Download presentation
Presentation is loading. Please wait.
Published byProsper Smith Modified over 9 years ago
1
Heavy-Tailed Distribution and Multi-Keyword Queries Surajit Chaudhuri, Kenneth Church, Arnd Christian K ö nig, Liying Sui Microsoft Corporation SIGIR 2007 2008. 07. 31. Summarized by JongHeum Yeon, IDS Lab., Seoul National University
2
Copyright 2008 by CEBT INTRODUCTION Inverted Index in Information Retrieval T 0 = "it is what it is“, T 1 = "what is it“, T 2 = "it is a banana“ "a": {2}, "banana": {2}, "is": {0, 1, 2}, "it": {0, 1, 2}, "what": {0, 1} Search “what”, “is”, “it” – {0,1} ∩{0,1,2} ∩{0,1,2} = {0,1} Some queries require costly deep traversal into long lists in web- sites(Amazon, eBay, …) with large catalogs of products The challenge is to reduce the worst-case overhead required to process arbitrary keyword queries 2
3
Copyright 2008 by CEBT Motivating Scenario More frequent terms have relatively long inverted lists Intersections of long inverted indexes are very slow relative to other queries Figure 20 million products Frequency : F(>900K)-M(50K)-L(<1K) 3
4
Copyright 2008 by CEBT Problem Statement Given a document collection, propose a set of indexes to materialize Time for intersecting keywords does not exceed a given threshold Δ Additional indexes should not be larger than k(small factor) times the size of the original inverted index 4
5
Copyright 2008 by CEBT INDEX STRUCTURE AND USAGE Notation Query Q words(Q) = {w 1, …, w l } k max : maximum number of terms in query γ : global vocabulary π : global ordering – Given keyword-combination C = {w 1, …, w l }, sort words by global ordering for avoiding permutations of keyword-combination size(Q) : number of items(=document) whose text contains all keyword of a query Q size(w) : single word w, number of documents containing w |Q| : number of keywords a query Q contains 5
6
Copyright 2008 by CEBT Cost Model Cost Disk seeks to the beginning of posting lists + Scanning postings Unit of cost : scanning a single posting in an inverted index Δ : Cost bound 6
7
Copyright 2008 by CEBT Processing Strategies Execution Strategies ID-intersection – Retrieves all inverted indexes of the queried keywords and intersects them – |Q| seeks accesses to disk, reading their contents entirely Post-filtering – When w i in Q is very rare, – Reading text of w i by inverted index, then verifying the remaining keyword constraints using text 7
8
Copyright 2008 by CEBT Index Structure materialize combinations of frequent keywords and a small fraction of them For each vocabulary items w, a list of all keyword combinations containing w for which they have materialized the corresponding inverted index 8
9
Copyright 2008 by CEBT Query Q = {w 1, …, w l } Q contains rare keyword : post-filtering strategy Otherwise : retrieve all match-list entries Query Processing 9
10
Copyright 2008 by CEBT EXPERIMENTS Evaluation of Query Cost Materialized the index structure : 10K frequent words K max = 4, Cost Seek = 1000 Δ : cost of scanning 20% of the number of postings Speed-ups – 18x (2 keywords) – 14x (4 keywords) Evaluation of Index Sizes 899M postings No additional indexes for keywords occurring in less than 50 documents 141K keywords for indexing Multi-keyword index structures contained 734M postings Accuracy of Intersection-size Estimation Match list covers 99.3% 10
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.