Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS122B: Projects in Databases and Web Applications Winter 2017

Similar presentations

Presentation on theme: "CS122B: Projects in Databases and Web Applications Winter 2017"— Presentation transcript:

1 CS122B: Projects in Databases and Web Applications Winter 2017
Professor Chen Li Department of Computer Science UC Irvine Notes 13: Inverted Index Slides borrowed from Prof. Manning at Stanford

2 Query Which plays of documents contain the words Cat AND Dog but NOT Fish?

3 Inverted index For each term T, we must store a list of all documents that contain T. Do we use an array or a list for this? Cat 2 4 8 16 32 64 128 Dog 1 2 3 5 8 13 21 34 Fish 13 16

4 Inverted index Linked lists generally preferred to arrays
Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers 2 4 8 16 32 64 128 Dictionary Cat Dog Fish 1 2 3 5 8 13 21 34 13 16 Postings

5 Query processing Consider processing the query: Cat AND Dog
Locate Cat in the Dictionary; Retrieve its postings. Locate Dog in the Dictionary; “Merge” the two postings: 2 4 8 16 32 64 128 Cat 1 2 3 5 8 13 21 34 Dog

6 The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 2 34 128 2 4 8 16 32 64 1 3 5 13 21 4 8 16 32 64 128 Cat Dog 2 8 1 2 3 5 8 13 21 34 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

7 Boolean queries: Exact match
Boolean Queries are queries using AND, OR and NOT together with query terms Views each document as a set of words Is precise: document matches condition or not. Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., lawyers) still like Boolean queries: You know exactly what you’re getting.

8 Other Challenges Stemming Tokenization Stop words Synonyms
Especially hard for non-Latin languages E.g., Chinese, Japanese Stop words Synonyms

Download ppt "CS122B: Projects in Databases and Web Applications Winter 2017"

Similar presentations

Ads by Google