Heavy-Tailed Distribution and Multi-Keyword Queries Surajit Chaudhuri, Kenneth Church, Arnd Christian K ö nig, Liying Sui Microsoft Corporation SIGIR 2007.

1 Heavy-Tailed Distribution and Multi-Keyword Queries Surajit Chaudhuri, Kenneth Church, Arnd Christian K ö nig, Liying Sui Microsoft Corporation SIGIR 2007 2008. 07. 31. Summarized by JongHeum Yeon, IDS Lab., Seoul National University

2 Copyright  2008 by CEBT INTRODUCTION  Inverted Index in Information Retrieval T 0 = "it is what it is“, T 1 = "what is it“, T 2 = "it is a banana“ "a": {2}, "banana": {2}, "is": {0, 1, 2}, "it": {0, 1, 2}, "what": {0, 1} Search “what”, “is”, “it” – {0,1} ∩{0,1,2} ∩{0,1,2} = {0,1}  Some queries require costly deep traversal into long lists in web- sites(Amazon, eBay, …) with large catalogs of products  The challenge is to reduce the worst-case overhead required to process arbitrary keyword queries 2

3 Copyright  2008 by CEBT Motivating Scenario  More frequent terms have relatively long inverted lists  Intersections of long inverted indexes are very slow relative to other queries  Figure 20 million products Frequency : F(>900K)-M(50K)-L(<1K) 3

4 Copyright  2008 by CEBT Problem Statement  Given a document collection, propose a set of indexes to materialize  Time for intersecting keywords does not exceed a given threshold Δ  Additional indexes should not be larger than k(small factor) times the size of the original inverted index 4

5 Copyright  2008 by CEBT INDEX STRUCTURE AND USAGE  Notation Query Q words(Q) = {w 1, …, w l } k max : maximum number of terms in query γ : global vocabulary π : global ordering – Given keyword-combination C = {w 1, …, w l }, sort words by global ordering for avoiding permutations of keyword-combination size(Q) : number of items(=document) whose text contains all keyword of a query Q size(w) : single word w, number of documents containing w |Q| : number of keywords a query Q contains 5

6 Copyright  2008 by CEBT Cost Model  Cost Disk seeks to the beginning of posting lists + Scanning postings Unit of cost : scanning a single posting in an inverted index Δ : Cost bound 6

7 Copyright  2008 by CEBT Processing Strategies  Execution Strategies ID-intersection – Retrieves all inverted indexes of the queried keywords and intersects them – |Q| seeks accesses to disk, reading their contents entirely Post-filtering – When w i in Q is very rare, – Reading text of w i by inverted index, then verifying the remaining keyword constraints using text 7

8 Copyright  2008 by CEBT Index Structure  materialize combinations of frequent keywords and a small fraction of them  For each vocabulary items w, a list of all keyword combinations containing w for which they have materialized the corresponding inverted index 8

9 Copyright  2008 by CEBT  Query Q = {w 1, …, w l }  Q contains rare keyword : post-filtering strategy  Otherwise : retrieve all match-list entries Query Processing 9

10 Copyright  2008 by CEBT EXPERIMENTS  Evaluation of Query Cost Materialized the index structure : 10K frequent words K max = 4, Cost Seek = 1000 Δ : cost of scanning 20% of the number of postings Speed-ups – 18x (2 keywords) – 14x (4 keywords)  Evaluation of Index Sizes 899M postings No additional indexes for keywords occurring in less than 50 documents 141K keywords for indexing Multi-keyword index structures contained 734M postings  Accuracy of Intersection-size Estimation Match list covers 99.3% 10

