CS122B: Projects in Databases and Web Applications Winter 2017 Professor Chen Li Department of Computer Science UC Irvine Notes 13: Inverted Index Slides borrowed from Prof. Manning at Stanford
Query Which plays of documents contain the words Cat AND Dog but NOT Fish?
Inverted index For each term T, we must store a list of all documents that contain T. Do we use an array or a list for this? Cat 2 4 8 16 32 64 128 Dog 1 2 3 5 8 13 21 34 Fish 13 16
Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers 2 4 8 16 32 64 128 Dictionary Cat Dog Fish 1 2 3 5 8 13 21 34 13 16 Postings
Query processing Consider processing the query: Cat AND Dog Locate Cat in the Dictionary; Retrieve its postings. Locate Dog in the Dictionary; “Merge” the two postings: 2 4 8 16 32 64 128 Cat 1 2 3 5 8 13 21 34 Dog
The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 2 34 128 2 4 8 16 32 64 1 3 5 13 21 4 8 16 32 64 128 Cat Dog 2 8 1 2 3 5 8 13 21 34 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.
Boolean queries: Exact match Boolean Queries are queries using AND, OR and NOT together with query terms Views each document as a set of words Is precise: document matches condition or not. Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., lawyers) still like Boolean queries: You know exactly what you’re getting.
Other Challenges Stemming Tokenization Stop words Synonyms Especially hard for non-Latin languages E.g., Chinese, Japanese Stop words Synonyms