Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University
LOGICAL VIEW OF THE DOCUMENTS (INDEX) Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORMATION NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION
Data Structure for Search (Index) Requirements : a) Represent documents appropriately b) Enable efficient and effective search In addition: c) Limit storage (note: tradeoff w. speed) Operations : Search Generation (add documents) Update (remove / replace documents) Others (Boolean search, phrases, etc.)
The Index Example Query: capital AND France (Boolean Query) Doc. 1: The capital of France is called Paris. Doc. 2: Paris is the capital of France. Doc. 3: The capitals of France and England are called Paris and London, respectively. Naive approach: Scanning Boolean query (capital AND France) delivers Doc. 1 and Doc. 2 as results Question: Can we do this more efficiently?
Term-Document Incidence Matrix DOC. 1DOC. 2DOC. 3 and001 are001 called101 capital110 capitals001 England001 France111 is110 London001 of111 Paris111 respectively001 The101 the010 Idea: Build a matrix with Columns = documents Rows = all appearing words (alphabetically sorted) Example: Doc. 1: The capital of France is called Paris. __________ 1 = Word appears in doc. 0 = Word does not appear
Term-Document Incidence Matrix Boolean Queries DOC. 1DOC. 2DOC. 3 and001 are001 called101 capital110 capitals001 England001 France111 is110 London001 of111 Paris111 respectively001 The101 the010 Query: capital AND France
Boolean Queries DOC. 1DOC. 2DOC. 3 and001 are001 called101 capital110 capitals001 England001 France111 is110 London001 of111 Paris111 respectively001 The101 the010 Query: capital AND France AND ( 1, 1, 0 )
Term-Document Incidence Matrix Search: Very easy, but not very efficient (e.g docs, terms = Matrix with cells) The good news: This matrix is very sparse (i.e. lots of 0’s, only few 1’s) Idea: Just store the ‘hits’ (term incidences) Data structure Inverted File
Inverted File DOC1DOC2DOC3 and001 are001 called101 capital110 capitals001 England001 France111 is110 London001 of111 Paris111 respectively001 The101 the
Inverted File Main advantage: Easy, efficient search Disadvantages: Storage (10%-100% of doc. size) Modifications (updates, …) Often other information is stored as well to support advanced queries (e.g. position for phrases) to speed up the search process (e.g. frequency for query optimization)
Inverted File & Term Frequency Query: France AND London AND capitals France capitals London … (203 documents) … (163 documents) … (24 documents) Optimize query to speed up search (i.e. limit number of merging steps) (France AND London) AND capitals (capitals AND London) AND France
Implementation capitals France … … … Dictionary … Postings Dictionary: Usually kept in memory (fast!) Postings: Kept on disks, access via offset
Dictionary: Size Dictionary usually kept in memory (speed) How big does it get? Heap’s law TEXT SIZE N DICTIONARY SIZE
Entries in the Dictionary and are called capital capitals England France is London of Paris respectively The the Ignores word order What terms / tokens should go into the dictionary? Bag-of-words approaches
LOGICAL VIEW OF THE DOCUMENTS (INDEX) Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORMATION NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION