Download presentation
Presentation is loading. Please wait.
1
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University
2
LOGICAL VIEW OF THE DOCUMENTS (INDEX) Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORMATION NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION
3
Data Structure for Search (Index) Requirements : a) Represent documents appropriately b) Enable efficient and effective search In addition: c) Limit storage (note: tradeoff w. speed) Operations : Search Generation (add documents) Update (remove / replace documents) Others (Boolean search, phrases, etc.)
4
The Index Example Query: capital AND France (Boolean Query) Doc. 1: The capital of France is called Paris. Doc. 2: Paris is the capital of France. Doc. 3: The capitals of France and England are called Paris and London, respectively. Naive approach: Scanning Boolean query (capital AND France) delivers Doc. 1 and Doc. 2 as results Question: Can we do this more efficiently?
5
Term-Document Incidence Matrix DOC. 1DOC. 2DOC. 3 and001 are001 called101 capital110 capitals001 England001 France111 is110 London001 of111 Paris111 respectively001 The101 the010 Idea: Build a matrix with Columns = documents Rows = all appearing words (alphabetically sorted) Example: Doc. 1: The capital of France is called Paris. __________ 1 = Word appears in doc. 0 = Word does not appear
6
Term-Document Incidence Matrix Boolean Queries DOC. 1DOC. 2DOC. 3 and001 are001 called101 capital110 capitals001 England001 France111 is110 London001 of111 Paris111 respectively001 The101 the010 Query: capital AND France
7
Boolean Queries DOC. 1DOC. 2DOC. 3 and001 are001 called101 capital110 capitals001 England001 France111 is110 London001 of111 Paris111 respectively001 The101 the010 Query: capital AND France AND ( 1, 1, 0 )
8
Term-Document Incidence Matrix Search: Very easy, but not very efficient (e.g. 1 000 000 docs, 1 000 terms = Matrix with 1 000 000 000 cells) The good news: This matrix is very sparse (i.e. lots of 0’s, only few 1’s) Idea: Just store the ‘hits’ (term incidences) Data structure Inverted File
9
Inverted File DOC1DOC2DOC3 and001 are001 called101 capital110 capitals001 England001 France111 is110 London001 of111 Paris111 respectively001 The101 the010 3 3 13 12 3 3 123 12 3 123 123 3 13 2
10
Inverted File Main advantage: Easy, efficient search Disadvantages: Storage (10%-100% of doc. size) Modifications (updates, …) Often other information is stored as well to support advanced queries (e.g. position for phrases) to speed up the search process (e.g. frequency for query optimization)
11
Inverted File & Term Frequency Query: France AND London AND capitals France capitals London 243237 … (203 documents) 162437 … (163 documents) 283237 … (24 documents) Optimize query to speed up search (i.e. limit number of merging steps) (France AND London) AND capitals (capitals AND London) AND France
12
Implementation capitals France … 243237 … 283237 … Dictionary … Postings Dictionary: Usually kept in memory (fast!) Postings: Kept on disks, access via offset
13
Dictionary: Size Dictionary usually kept in memory (speed) How big does it get? Heap’s law TEXT SIZE N DICTIONARY SIZE
14
Entries in the Dictionary and are called capital capitals England France is London of Paris respectively The the 3 3 13 12 3 3 123 12 3 123 123 3 13 2 Ignores word order What terms / tokens should go into the dictionary? Bag-of-words approaches
15
LOGICAL VIEW OF THE DOCUMENTS (INDEX) Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORMATION NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.