IR Chapter1-3
chapter 1 boolean search
布尔检索是数据库检索最基本的方法,是用逻辑“或”(+、OR)、逻辑"与"(×、AND)、逻辑"非"(-、NOT)等算符在数据库中对相关文献的定性选择的方法。
- 单词-文档矩阵,
- 索引结构是倒排索引 unit-posting,基于关键词的查找
- 查找
- 合并merge:按照Doc Frequency排序,从最小的数组开始合并
chapter 2 vocabulary list and posting
更多是关于web search,不是精确检索。在search前需要mapping。
1.文档处理document delineation and char seguence decoding
- index granularity索引粒度:确定document unit
(a message / a file /a message and its contained attachment / a book or a chapter or a sentence)
PS: 系统需要提供可选的索引granularity
2. vocabulary
-
tokenization( segment,eliminate punctuation etc.):hyphen、space
-
match:返回内容充分满足expectations,windows->Windows、WINDOWS 各种变换的形式
-
common words(stop words,pressure for time and space):
big list → small list → statistics
Web search engines generally do not use stop word lists.More precise with stop words -
Normalization (helps in queries, not in IR performance)
standard way:create equivalence classes
1). mapping【badcase U.S.A match USA,but C.A.T. not match cat】
2). maintain relations
a. index unnormalized tokens + a query expansion list
b. index construction, index 含有token同义词的document3). the forms of normalization
accents /diacritics
capitalization/case-folding:reduce letters to lower case
idiosyncratic issue in English:color vs colour -
Stemming and lemmatization:reduce inflectional forms
stemming remove derivational affixes
lemma 还原词根
3. Faster postings list intersection
- skip list pointer,而不是原来的逐一遍历
【Where do we place skips?】 There is a tradeoff.
heuristic : ` √P evenly-spaced skip pointers.
4. Positional postings and phrase queries
- ▼ Biword indexes→phrase index
single-word term index also needed
【cons】
large vocabulary→partial phrase index - ▲ Positional postings 位置信息索引
【cons】
expands posting size,depending on token - ▲+▼ combined scheme【biword index + position index】
index what biword?- query behavior:index frequently queried phrase
- individual word high frequency, phrase comparatively low
chapter 3 Dictionaries and tolerant retrieval(容错检索)
-
Search structures :Dictionary
-
Solution :
- hashing
- search tree【demand prescribed ordering】
Binary tree
B-tree
binary tree
B-Tree
-
Wildcard queries 通配符检索
找索引单元/term/单词,字符串查找
query:mon*
solution:regular B-Tree
process: walk down the tree following the symbol m, o, n.
query:*mon
solution:reversed B-Tree
process: walk down the tree following the symbol m, o, n.
e.g. lemon root->n->o->m->e->l
query:se*mon
solution2:permuterm indexes
solution1:regular B-Tree + reversed B-Tree
process:
search regular B-Tree, prefix se-, set A
search reversed B-Tree, suffix -mon, set B
intersect A ∩ B
General wildcard queries
-
permuterm index
也就是索引单元包含了首、尾的信息
cons:the dictionary becomes quite large, lead to blowup
image.png
e.g. m*n → n$m → man、moron
e.g. fi*mo*er → 首 er 尾fi → candidate terms →filter by exhaustive enumeration,检查中间是否包含"mo"
找完单词以后,进行document retrive -
k-gram indexes for wildcard queries
index k characters, with ¥ denoting beginning and end.
相比于permuterm index,数量相对少一些
e.g. k=3,castle →¥ca, cas, ast, stl, tle, le$
search
e.g. re*ve → ¥re and ve¥
filtering demanded string-matching operation
e.g. red* → ed¥→ retired【invalid,因为query左侧没有其他符号】
Spelling correction
1. implementation of spelling correction
2. 2 steps to solving the problem
- edit distance
- k-gram overlap
- implementation
- 选择正确答案的原则
选择最近的,proximity;
选择最常见的(出现最多;使用最多) - forms
isolated-term correction
context-sensitive correction.
- 编辑距离的计算【动态规划】
可以用来计算相似度
编辑距离其实是编辑操作的最小数目
edit operation:insert, replace, delete => 带权重的edit operation
3.k-gram索引 for spelling correction
【isolated spelling correction】
【Jaccard coefficient】
set A query里面的k-gram集合
set B term里面的k-gram集合
Jaccard coefficient = |A ∩ B| / |A ∪ B|
流程:
for t in term-posting list
Jaccard( t ) > 阈值
选择t作为候选
t 的 k-gram如何获得?
Jaccard = num(A&B) / num(A+B)-num(A+B)
num(A+B)是 posting hits
方法:
step1、经验方法,直接选出candidate
step2、计算编辑距离,选出具有最小edit distance的t
【context sensitive spelling correction】
query = w1 + w2 +.....wn
列举出每一个词的candidate
trim,bi-words,previous queries by users