向量的索引文件格式

Normalization factors（归一化因数，文件后缀名为.nrm），每个文档的每个字段都会被存储，在检索的时候这个归一化因素会被与标准得分相乘得到最终的评分，存储的格式以DocValues的格式存储进去。

归一化基数在存储的时候通常是按照复合索引的格式存储的，每个段文件都有对应的一个归一化文件，截图如下：

.cfs文件代表一个虚拟的文件，包含所有索引文件的句柄访问，.cfe文件持有所有相对应.cfs文件的具体的条目列表

Term Vectors（项向量，也可以叫做文档向量，）它可以在索引的时候，决定是否存储向量，向量主要包括项文本和项的词频以及偏移量位置，一般情况下，我们并不需要开启项向量，因为开启项向量，会额外存储一些信息，导致索引变大，但是在一些需要高亮的需求时，我们就需要开启向量了，当然建议选择在前台使用高亮技术，以减免服务器的压力和Lucene索引的空间。

向量的索引文件格式的组成在lucene中有3中格式，

1，文档索引的.tvx文件，对于每个Document，都存储了文档的偏移量（.tvd）和文件字段的数据（.tvf），

DocumentIndex (.tvx) --> Header,<DocumentPosition,FieldPosition> NumDocs

Header --> CodecHeader
DocumentPosition --> UInt64 (offset in the .tvd file)
FieldPosition --> UInt64 (offset in the .tvf file)

2， .tvd文件，主要存储关于field的信息，包含域的数量，域在tvf文件里的指针位置等等
Document (.tvd) --> Header,<NumFields, FieldNums, FieldPositions> NumDocs

Header --> CodecHeader
NumFields --> VInt
FieldNums --> <FieldNumDelta> NumFields
FieldNumDelta --> VInt
FieldPositions --> <FieldPositionDelta> NumFields-1
FieldPositionDelta --> VLong

3，tvf文件主要存储关于term的信息，包括term本身，词频，偏移量，位置，载荷等信息。

Field (.tvf) --> Header,<NumTerms, Flags, TermFreqs> NumFields

Header --> CodecHeader
NumTerms --> VInt
Flags --> Byte
TermFreqs --> <TermText, TermFreq, Positions?, PayloadData?, Offsets?> NumTerms
TermText --> <PrefixLength, Suffix>
PrefixLength --> VInt
Suffix --> String
TermFreq --> VInt
Positions --> <PositionDelta PayloadLength?>TermFreq
PositionDelta --> VInt
PayloadLength --> VInt
PayloadData --> ByteNumPayloadBytes
Offsets --> <VInt, VInt>TermFreq

Flags byte stores whether this term vector has position, offset, payload. information stored.

Term byte prefixes are shared. The PrefixLength is the number of initial bytes from the previous term which must be pre-pended to a term's suffix in order to form the term's bytes. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".

PositionDelta is, if payloads are disabled for the term's field, the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document). If payloads are enabled for the term's field, then PositionDelta/2 is the difference between the current and the previous position. If payloads are enabled and
PositionDelta is odd, then PayloadLength is stored, indicating the length of the payload at the current term position.

PayloadData is metadata associated with a term position. If PayloadLength is stored at the current position, then it indicates the length of this payload. If PayloadLength is not stored, then this payload has the same length as the payload at the previous position. PayloadData encodes the concatenated bytes for all of a terms occurrences.

Offsets are stored as delta encoded VInts. The first VInt is the startOffset, the second is the endOffset.

向量的索引文件格式

猜你喜欢