[lucene] fields
2020-03-05 本文已影响0人
cdz620
Lucene fields 可接受的option
主要有三类:
- indexing
- storing
- term vector
Field option 组合使用
header 1 | header 2 | TermVector | Example usage |
---|---|---|---|
NOT_ANALYZED_NO_NORMS | YES | NO | Identifiers (filenames, primary keys), telephone and Social Security numbers, URLs, personal names, dates, and textual fields for sorting |
ANALYZED | YES | WITH_POSITIONS_OFFSETS | Document title, document abstract |
ANALYZED | NO | WITH_POSITIONS_OFFSETS | Document body |
NO | YES | NO | Document type, database primary key (if not used for searching) |
NOT_ANALYZED | NO | NO | Hidden keywords |
Fields option for indexing
Index.ANALYZED
- 可以被索引
- 解析器解析成token stream, 适用于text fields eg: body, title, abstract
Index.NOT_ANALYZED
- 可以被索引
- 不被解析, 整个字符串作为token
- eg: URLs, file system paths, dates, personal names, Social Security numbers, and telephone numbers
Index.ANALYZED_NO_NORMS
- 不包含索引的标准信息,可以在内存中被索引
Index.NOT_ANALYZED_NO_NORMS
- 类似Index.NOT_ANALYZED,也不包含标准的索引信息
- 用来节约index space 和 memory usage
Index.NO
不能被索引
Field options for storing fields
Field.Store.*
Store.YES
保存搜索结果,eg:URL, title, or database primary key
Store.NO
一般伴随着Index.ANALYZED字段出现,不关心之前的结果
CompressionTools
- 存储相关,可以使用CompressionTools压缩和解压缩数据
- 会降低性能,搜索时必须解压缩,性能会下降
Field options for term vectors
Term vectors
TermVector.YES
- unique term
- term出现的次数,不保存positions和offset信息
TermVector.WITH_POSITIONS
- uniqute term
- unique Term's count
- term positions
TermVector.WITH_OFFSETS
- uniqute term
- unique Term's count
- term offsets(词开始和词结束的位置)
TermVector.WITH_POSITIONS_OFFSETS
- uniqute term
- unique Term's count
- term positions
- term offsets(词开始和词结束的位置)
TermVector.NO
save none
Fields value
Reader
- value 不被store
- 可被索引 Index.ANALYZED
- 适用于源数据是需要很大的内存
- holding the full String in memory is too costly or inconvenient
TokenStream
- 预先把field value解析成TokenStream
- preanalyzed fields
byte[]
二进制数据
Vector Space Model
- Term出现的次数
- Term出现的位置
Field option for sorting
要使结果可以排序,必须保证以下条件:
- option的设置必须正确合理,
- 必须包含一个token,(Field.Index.NOT_ANALYZED or Field.Index.NOT_ANALYZED_NO_NORMS)
- 非Field.Index.NOT_ANALYZED or Field.Index.NOT_ANALYZED_NO_NORMS,部分analyzer也会产生token,排序也会正常工作。eg:KeywordAnalyzer
Multivalued fields
一个name,多个value的情况
Document doc = new Document();
for (String author : authors) {
doc.add(new Field("author", author,
Field.Store.YES,
Field.Index.ANALYZED));
}
Store.YES
会保存多个Fields