IR-chapter1:Boolean retrieval

2017-04-17  本文已影响0人  woodsouthmmm

Information retrieval

meaning

Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).

keywords: unstructured, large scale - provides a more natural and acceptable way of human-machine interaction compared with daunting database-style searching, also gives more challenge to data organization and query processing.(while In fact, no data is truly unstructured)

IR also covers supporting users in browsing or filtering document
collections or further processing a set of retrieved documents

scale

An example information retrieval problem

Shakespeare's collected works, containing the words Brutus and Caesar and not Calpurnia.

grep

(How about requiring lager data, more flexible query, ranked retrieval more quickly)

incidence matrix

incidence matrix for Shakespeare' collections query processing

extremely sparse

terminology

ll type of true and false

inverted index/inverted file/index

part of inverted index for Shakespeare's collections

a first take at building an inverted index

  1. collect documents to be indexed
  2. tokenize the text, turning each document into a list of tokens
  3. do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms
  4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
4th step

processing boolean queries

merge algorithm Algorithm for conjunctive queries

The extended Boolean model versus ranked retrieval

上一篇下一篇

猜你喜欢

热点阅读