IR Chapter1-3

2019-05-07  本文已影响0人  马小马_f182

chapter 1 boolean search

布尔检索是数据库检索最基本的方法,是用逻辑“或”(+、OR)、逻辑"与"(×、AND)、逻辑"非"(-、NOT)等算符在数据库中对相关文献的定性选择的方法。

  1. 单词-文档矩阵,
  2. 索引结构是倒排索引 unit-posting,基于关键词的查找

chapter 2 vocabulary list and posting

更多是关于web search,不是精确检索。在search前需要mapping。

1.文档处理document delineation and char seguence decoding

2. vocabulary

3. Faster postings list intersection

4. Positional postings and phrase queries

chapter 3 Dictionaries and tolerant retrieval(容错检索)

  1. hashing
  2. search tree【demand prescribed ordering】
    Binary tree
    B-tree
    binary tree
    B-Tree

找索引单元/term/单词,字符串查找

query:mon*
solution:regular B-Tree
process: walk down the tree following the symbol m, o, n.

query:*mon
solution:reversed B-Tree
process: walk down the tree following the symbol m, o, n.
e.g. lemon root->n->o->m->e->l

query:se*mon
solution2:permuterm indexes
solution1:regular B-Tree + reversed B-Tree
process:
search regular B-Tree, prefix se-, set A
search reversed B-Tree, suffix -mon, set B
intersect A ∩ B

General wildcard queries

  1. permuterm index
    也就是索引单元包含了首、尾的信息
    cons:the dictionary becomes quite large, lead to blowup


    image.png

    e.g. m*n → n$m → man、moron
    e.g. fi*mo*er → 首 er 尾fi → candidate terms →filter by exhaustive enumeration,检查中间是否包含"mo"
    找完单词以后,进行document retrive

  2. k-gram indexes for wildcard queries
    index k characters, with ¥ denoting beginning and end.
    相比于permuterm index,数量相对少一些
    e.g. k=3,castle →¥ca, cas, ast, stl, tle, le$
    search
    e.g. re*ve → ¥re and ve¥
    filtering demanded string-matching operation
    e.g. red* → ed¥→ retired【invalid,因为query左侧没有其他符号】

Spelling correction

1. implementation of spelling correction

2. 2 steps to solving the problem

  1. implementation
  1. 编辑距离的计算【动态规划】
    可以用来计算相似度
    编辑距离其实是编辑操作的最小数目
    edit operation:insert, replace, delete => 带权重的edit operation

3.k-gram索引 for spelling correction
【isolated spelling correction】
【Jaccard coefficient】
set A query里面的k-gram集合
set B term里面的k-gram集合
Jaccard coefficient = |A ∩ B| / |A ∪ B|
流程:
for t in term-posting list
   Jaccard( t ) > 阈值
   选择t作为候选

t 的 k-gram如何获得?
Jaccard = num(A&B) / num(A+B)-num(A+B)
num(A+B)是 posting hits
方法:
step1、经验方法,直接选出candidate
step2、计算编辑距离,选出具有最小edit distance的t

【context sensitive spelling correction】
query = w1 + w2 +.....wn
列举出每一个词的candidate
trim,bi-words,previous queries by users

上一篇下一篇

猜你喜欢

热点阅读