Apache Lucene - Index File Forma

2022-12-06 本文已影响0人 MasonChan

资料来源：

官方 Java Docs
源码目录：lucene-9.4.1-src/lucene-9.4.1/lucene/core/src/java/org/apache/lucene/codecs/lucene94/package-info.java

Package org.apache.lucene.codecs.lucene94

Lucene 9.3 file format.

Apache Lucene - Index File Formats

介绍
定义
    倒排索引
    field 类型
    段
    doc id
索引结构概述
File Naming
Summary of File Extensions
    Lock File
    History
    Limitations

介绍

此文档定义了当前 Lucene 版本的索引文件格式。If you are using a different version of Lucene, please consult the copy of docs/ that was distributed with the version you are using.

本文试图提供 Apache Lucene 文件格式的高级定义。

定义

Lucene 的基本概念是索引(index)、文档(document)、字段(field)和词(term)：

一个 index 是由多个 doc 组成的
一个 doc 是由多个 field 组成的
一个 field 是由多个 term 组成的
一个 term 是由多个 byte 组成的

文档 Document

doc 是 json 格式的文本，一个简单的例子：

{"k0":"123", "k1":"123", "k2":"hello world"}

域 Field

field 是一个 json 的 element，由 field name 和 field value 组成，field value 通常叫做 filed text。

上面 doc 例子中共有 3 个 field：

field0: "k0":"123"，field name 为 k0，field text 为 123
field1: "k1":"123"，field name 为 k1，field text 为 123
field2: "k2":"hello world"，field name 为 k2，field text 为 hello world

词 Term

源代码：lucene-9.4.1-src/lucene-9.4.1/lucene/core/src/java/org/apache/lucene/index/Term.java

MySQL 的基本检索单位是 column，由 column name 和 column value 组成；类似的，在 Lucene 基本检索单位是 term，term 由 term name 和 term value 组成：

term name 等同于 field name，数据格式为 text。
term value 是 field text 的子集，我们通过分词的方式，例如 split() 方法，将 field text 划分为多个 term value，数据格式为 string。

分词 tokenize 在 Lucene 中是一个很基本操作，在创建倒排索引时会用到，在进行检索时也会用到。不同的语言、不同的分词方式，会导致很多分词器 tokenizer 的产生，这些分词器通常以插件的形式(jar 包)存在

假如使用空格作为分隔符来进行分词，那么上面的 doc 例子中共有 4 个 term：

term0: "k0":"123"，term name 为 k0，term value 为 123
term1: "k1":123，term name 为 k1，term value 为 123
term2: "k2":"hello"，term name 为 k2，term value 为 hello
term3: "k2":"world"，term name 为 k2，term value 为 world

term0 和 term1 虽然 value 的内容是一样的，但，term name 不一样，所以 term0 和 term1 会被当做 2 个不同的 term。

倒排索引

Lucene 的 index 存储了 term 和 term 的统计信息，使得基于 term 的搜索更加高效。

类似于 MySQL 的 column 索引，Lucene 的 term 也有索引，它们都是针对 value 建立索引，但是不同的是 column 索引是正排索引，而 term 索引是一个倒排索引（inverted index）。在正常思维中，我们通常先找到 doc，再找到 doc 里面的 term，这就是正排索引建立的基本逻辑；而倒排索引的基本逻辑是，通过指定的 term，反过来找包含这个 term 的 doc 。

Types of Fields

一个 field 既可以被存储，也可以被索引：

被存储的意思就是：把 field 的原数据以非倒排的方式存储在 index 中
被索引的意思就是：开启了索引特性的 field，则把 field 信息加入到倒排索引中

也就是说，一个 Lucene index，不但存储了 field 原数据，也存储了 field 的索引，这 2 种数据存储在不同的文件中，文件格式也不一样。

建立倒排索引时，按照用户定义的分词方式，field text 既可能被分割成多个 term，也可能被当成 1 个 term，例如进行自然语言识别时，后者效果显然比前者的要好。

段

Lucene index 由多个子索引组成，子索引又称作段 segment 。每一个 segment 都是一个完全独立的 index，提供独立的读写功能。所以 index 与 segment 之间有以下表现：

新加的 doc 有可能写入现有的 segment，也有可能写入一个新创建的 segment
多个 segment 之间会发生合并 merge
1 个检索可能会涉及到多个 segment

文档编号

在 Lucene 内部，使用文档编号 document number 来表示一个文档， document number 也叫 document id。index 的第一个 doc 编号为 0，依次递增。

请注意，文档编号可能会变化，因此在 Lucene 之外存储这些数字时要小心。一般来说，一下几种情况会导致文档编号发生变化：

存储在每个 segment 中的文档编号仅在该 segment 内是唯一的，必须进行转换才能在更大的上下文中使用。标准的技术是根据 segment 中使用的数字范围为每个 segment 分配一个值范围。要将文档编号从 segment 内部值转换为外部值，需要添加 segment 的基本文档编号。要将外部值转换回特定 segment 的值，需要根据外部值所在的范围识别出该 segment，并减去该 segment 的基本文档编号。例如，要组合两个只有 5 个文档的 segment 时，第一个 segment 的基本文档编号 0，第二个 segment 的基本文档编号为 5，那么可以推断出第二部分中的 doc3 的外部值为 8 = 5+3：

文档基本文档编号原段内文档编号合并后的段外文档编号

segment0 0 0-4 doc3 的为 3 = 0+3

segment1 5 0-4 doc3 的为 8 = 5+3
删除文档时，相关的文档编号不再指向任何文档，因此会产生空白编号。当合并 segment 时，会生成一个新的 segment，已删除的文档会被丢弃，文档编号也会重新生产。因此，新合并的段没有空白编号。

文档	基本文档编号	原段内文档编号	合并后的段外文档编号
segment0	0	0-4	doc3 的为 3 = 0+3
segment1	5	0-4	doc3 的为 8 = 5+3

索引结构概述

每个 segment index 包含以下信息:

Segment info：segment 元数据，例如 segment 的 doc 数量、segment 使用的文件、segment 存储的方式等
Field names：field name 元数据
Stored Field values：field 原始数据，在 Lucene 中在存储 doc 时不是直接整个 doc 进行存储的，而是将其拆成多个 attribute-value 对，再进行列式存储，attribute 是 field name，value 是 field text。当检索命中时，返回 field 原始数据，key 是文档编码。
Term dictionary：term 词典，当 field 开启索引功能时，把 term 记录到这个词典中，该词典还记录了包含 term 的文档数量、指向 term 词频和位置数据的指针。
Term Frequency data：term 词频数据，在 term 词典中，如果没有设置忽略 term 词频，那么每个 term 都有一个 term 出现次数的汇总，称之为词频。
Term Proximity data：term 位置数据，在 term 词典中，如果没有设置忽略 term 位置数据，那么每个 term 都会记录它的每一个原文档编码，称之为位置数据。
Normalization factors：归一化因子，在 doc 的每个 field 中，都存储着一个值，用来乘以命中分数，实现归一化的效果，使得搜索得分不因 doc 长度不同而出现较大偏差。For each field in each document, a value is stored that is multiplied into the score for hits on that field.
Term Vectors：term 向量由 term text 和 term 词频组成，doc 的每个 field 可以选择是否存储 term 向量(有时也叫 doc 向量)，请参考 Field 结构
Per-document values. 与存储值一样，这些值也以文档编号为键，但通常是为了快速访问而加载到主内存中。存储值通常用于搜索的摘要结果，而每个文档的值对于诸如评分因子之类的东西很有用。Like stored values, these are also keyed by document number, but are generally intended to be loaded into main memory for fast access. Whereas stored values are generally intended for summary results from searches, per-document values are useful for things like scoring factors.
Live documents. An optional file indicating which documents are live.
Point values. Optional pair of files, recording dimensionally indexed fields, to enable fast numeric range filtering and large numeric values like BigInteger and BigDecimal (1D) and geographic shape intersection (2D, 3D).
Vector values. The vector format stores numeric vectors in a format optimized for random access and computation, supporting high-dimensional nearest-neighbor search.