IR-chapter1:Boolean retrieval

2017-04-17 本文已影响0人 woodsouthmmm

Information retrieval

meaning

Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).

keywords: unstructured, large scale - provides a more natural and acceptable way of human-machine interaction compared with daunting database-style searching, also gives more challenge to data organization and query processing.(while In fact, no data is truly unstructured)

IR also covers supporting users in browsing or filtering document
collections or further processing a set of retrieved documents

scale

web search
billions of documents stored on millions of computers
gather documents to indexed
build efficient system
exploit hypertext
protect from being boosted
personal information retrieval
spotlight, instant search
email program, search and classification
enterprise, institutional, and domain-specific search

An example information retrieval problem

Shakespeare's collected works, containing the words Brutus and Caesar and not Calpurnia.

grep

(How about requiring lager data, more flexible query, ranked retrieval more quickly)

incidence matrix

incidence matrix for Shakespeare' collections

query processing

extremely sparse

terminology

boolean retrieval model
a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms.
term
the smallest unit we treat as the element of the set
document
units we have decided to build a retrieval system over
collection/corpus
the group of documents
information need
the topic about which the user desires to know more.
query
what the user convey to the computer.
relevant
a document is relevant if it is the one that the user perceives as containing information of value with respect to their personal informational need.
effectiveness
the quality of its search results

ll type of true and false

pricision
TP/(TP+FP)
recall
TP/(TP+FN)

inverted index/inverted file/index

part of inverted index for Shakespeare's collections

vocabulary/lexicon
the set of terms
dictionary
the data structure of the items
posting
each item in the list
posting list
postings
all posting lists

a first take at building an inverted index

collect documents to be indexed
tokenize the text, turning each document into a list of tokens
do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms
Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

4th step

storage
memory - disk(a linked list of fixed length arrays for each term)

processing boolean queries

simple conjunctive query

merge algorithm

query optimization
process in increasing order of term frequency

Algorithm for conjunctive queries

asymmetric
difference is large

The extended Boolean model versus ranked retrieval

ranked retrieval model
such as the vector space model, in which users freely use free text queries
the extended Boolean model
proximity operator: specify that two terms in a query must occur close to each other in a document