Max-Planck Institute for Informatics Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics Saarbrücken Germany ACM SigIR ‘05
An Initial Example… TREC Robust Track ’04, hard query no. 363 (Aquaint news corpus) “transportation tunnel disasters” transportation tunnel disasters 1.0 Increased retrieval robustness Count only the best match per document and expansion set Increased efficiency Top-k-style query evaluations Open scans on new terms only on demand No threshold tuning 1.0 1.0 transit highway train truck metro “rail car” car … 0.9 0.8 0.7 0.6 0.5 0.1 tube underground “Mont Blanc” … 0.9 0.8 0.7 catastrophe accident fire flood earthquake “land slide” … 1.0 0.9 0.7 0.6 0.5 d2 d1 Expansion terms from relevance feedback, thesaurus lookups, Google top-10 snippets, etc. Term similarities, e.g., Rocchio, Robertson&Sparck-Jones, concept similarities, or other correlation measures ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Computational model & background on top-k algorithms Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Computational Model Vector space model with a Cartesian product space D1×…×Dm and a data set D D1×…×Dm m Precomputed local scores s(ti,d)∈ Di for all d∈ D e.g., tf*idf variations, probabilistic models (Okapi BM25), etc. typically normalized to s(ti,d)∈ [0,1] Monotonous score aggregation aggr: (D1×…×Dm ) (D1×…×Dm ) → + e.g., sum, max, product (using sum over log sij ), cosine (using L2 norm) Partial-match queries (aka. “andish”) Non-conjunctive query evaluations Weak local matches can be compensated Access model Disk-resident inverted index over large text corpus Inverted lists sorted by decreasing local scores Inexpensive sequential accesses to per-term lists: “getNextItem()” More expensive random accesses: “getItemBy(docid)” ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
No-Random-Access (NRA) Algorithm [Fagin et al., PODS ’01 Balke et al. VLDB ’00 Buckley&Lewit, SigIR ‘85] Corpus: d1,…,dn NRA(q,L): scan all lists Li (i = 1..m) in parallel // e.g., round-robin < d, s(ti ,d) > = Li.getNextItem() E(d) = E(d) {i} highi = s(ti ,d) worstscore(d) = ∑E(d) s(t ,d) bestscore(d) = worstscore(d) + ∑E(d) high if worstscore(d) > min-k then add d to top-k min-k = min{ worstscore(d’) | d’ top-k} else if bestscore(d) > min-k then candidates = candidates {d} if max {bestscore(d’) | d’ candidates} min-k then return top-k d1 d1 d1 s(t1,d1) = 0.7 … s(tm,d1) = 0.2 Query q = (transportation, tunnel disaster) Inverted Index Rank Doc# Worst-score Best-score 1 d78 0.9 2.4 2 d64 0.8 3 d10 0.7 k = 1 Rank Doc# Worst-score Best-score 1 d78 1.4 2.0 2 d23 1.9 3 d64 0.8 2.1 4 d10 0.7 Rank Doc# Worst-score Best-score 1 d10 2.1 2 d78 1.4 2.0 3 d23 1.8 4 d64 1.2 d78 0.9 d23 0.8 d10 0.8 d1 0.7 d88 0.2 transport Scan depth 1 … Scan depth 2 Scan depth 3 Naive Join-then-Sort in between O(mn) and O(mn2) runtime d64 0.8 d23 0.6 d10 0.6 d10 0.2 d78 0.1 tunnel … d10 0.7 d78 0.5 d64 0.4 d99 0.2 d34 0.1 STOP! disaster … ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Computational model & background on top-k algorithms Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Dynamic & Self-tuning Query Expansions top-k (transport, tunnel, ~disaster) Incrementally merge inverted lists Li1…Lim’ in descending order of local scores Dynamically add lists into set of active expansions exp(ti) Only touch short prefixes of each list, don’t need to open all lists Best match score aggregation for combined term similarities and local scores d42 d11 d92 … virtual list ~disaster d42 d11 d92 ... d21 d78 d10 d1 d32 d87 disaster accident fire transport tunnel incr. merge d66 d95 d93 d17 d95 d11 d101 d99 ... ... Increased retrieval robustness & fewer topic drifts Increased efficiency through fewer active expansions No threshold tuning of term similarities in the expansions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Incremental Merge Operator Index list meta data (e.g., histograms) Relevance feedback, Thesaurus lookups,… Initial high-scores Expansion terms ~t = {t1,t2,t3} Correlation measures, Large corpus statistics … sim(t, t1 ) = 1.0 t1 ... d78 0.9 d1 0.4 d88 0.3 d23 0.8 d10 0.9 0.4 ... d12 0.2 d78 0.1 d64 0.8 d23 d10 0.7 t2 sim(t, t2 ) = 0.9 Expansion similarities 0.72 0.18 sim(t, t3 ) = 0.5 t3 ... d99 0.7 d34 0.6 d11 0.9 d78 d64 Incremental Merge iteratively triggered by top-k operator sequential access “getNextItem()” 0.45 0.35 ~t d78 0.9 d23 0.8 d10 0.8 d64 0.72 d23 0.72 d10 0.63 d11 0.45 d78 0.45 d1 0.4 d88 0.3 ... ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Computational model & background on top-k algorithms Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04] For each physically stored index list Li Treat each s(ti,d) [0,1] as a random variable Si and consider Approximate local score distribution using an equi-width histogram with n buckets ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04] For each physically stored index list Li Treat each s(ti,d) [0,1] as a random variable Si and consider Approximate local score distribution using an equi-width histogram with n buckets For a virtual index list ~Li = Li1…Lim’ Consider the max-distribution (feature independence) Alternatively, construct meta histogram for the active expansions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Return current top-k list if candidate queue is empty! Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04] For each physically stored index list Li Treat each s(ti,d) [0,1] as a random variable Si and consider Approximate local score distribution using an equi-width histogram with n buckets For a virtual index list ~Li = Li1…Lim’ Consider the max-distribution (feature independence) Alternatively, construct meta histogram for the active expansions For all d in the candidate queue Consider the convolution over local score distributions to predict aggregated scores Drop d from candidate queue, if Return current top-k list if candidate queue is empty! ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Computational model & background on top-k algorithms Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Incremental Merge for Multidimensional Phrases q = {undersea „fiber optic cable“} Top-k Nested Top-k operator iteratively prefetches & joins candidate items for each subquery condition “getNextItem()” Propagates candidates in descending order of bestscore(d) values to provide monotonous upper score bounds Provides [wortscore(d), bestscore(d)] guarantees to superordinate top-k operator Top-level top-k operator performs phrase tests only for the most promising items (random access) (Expensive predicates & minimal probes [Chang&Hwang, SIGMOD ‘02] ) Single threshold condition for algorithm termination (candidate pruning at the top-level queue only) Incr.Merge sim(„fiber optic cable“, „fiber optic cable“) = 1.0 sim(„fiber optic cable“, „fiber optics“) = 0.8 Nested Top-k Nested Top-k undersea … d14 0.9 d23 0.8 d32 0.7 d18 d1 fiber … d78 0.9 d1 0.7 d88 0.2 d23 0.8 d10 optic d34 d7 0.4 0.3 d12 0.6 cable d41 d2 0.1 d75 0.5 fiber … d78 0.9 d1 0.7 d88 0.2 d23 0.8 d10 optics d5 0.4 d47 0.1 d17 0.6 random access term-to- position index ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Computational model & background on top-k algorithms Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Experiments – Aquaint with Fixed Expansions Aquaint corpus of English news articles (528,155 docs) 50 “hard” queries from TREC 2004 Robust track WordNet expansions using a simple form of WSD Okapi-BM25 model for local scores, Dice coefficients as term similarities Fixed expansion technique (synonyms + first-order hyponyms) # SA #CPU sec P@10 MAP @1000 relPrec max(m) max(KB) # RA avg(m) Title-only Baseline Join&Sort 2.5 4 2,305,637 NRA-Baseline 1,439,815 9.4 432 KB 0.252 0.092 1.000 Static Expansions Join&Sort 35 118 20,582,764 NRA+Phrases, ε=0.0 18,258,834 210,531 245.0 37,355 KB 0.286 0.105 1.000 NRA+Phrases, ε=0.1 3,622,686 49,783 79.6 5,895 KB 0.238 0.086 0.541 Dynamic Expansions Incr.Merge, ε=0.0 35 118 7,908,666 53,050 159.1 17,393 KB 0.310 0.118 1.000 Incr.Merge, ε=0.1 5,908,017 48,622 79.4 13,424 KB 0.298 0.110 0.786 ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Experiments – Aquaint with Fixed Expansions, cont’d Probabilistic Pruning Performance Incremental Merge vs. top-k with static expansions Epsilon controls pruning aggressiveness 0 ≤ ε ≤ 1 ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Conclusions & Ongoing Work Increased efficiency Incremental Merge vs. Join-then-Sort & top-k using static expansions Very good precision/runtime ratio for probabilistic pruning Increased retrieval robustness Largely avoids topic drifts Modeling of fine grained semantic similarities (Incremental Merge & Nested Top-k operators) Scalability (see paper) Large expansions (m < 876 terms per query) on Aquaint Expansions for Terabyte collection (~25,000,000 docs) Efficient support for XML-IR (INEX Benchmark) Inverted lists for combined tag-term pairs e.g., sec=mining Efficiently supports child-or-descendant axis e.g., //article//sec//=mining Vague content & structure queries (VCAS) e.g., //article//~sec=~mining TopX-Engine, VLDB ’05 ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
Thank you! ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing