Max Planck Institute for Informatics TopX Efficient and Versatile Top-k Query Processing for Text, Structured, and Semistructured Data PhD Defense May 16th 2006 Martin Theobald Max Planck Institute for Informatics VLDB ‘05
An XML-IR Scenario (INEX IEEE) … //article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)] “Native XML data base systems can store schemaless data ... ” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ” “XML-QL: A Query Language for XML.” “Native XML Data Bases.” “Proc. Query Languages Workshop, W3C,1998.” “XML queries with an expres- sive power similar to that of Datalog …” sec article par bib title “Current Approaches to XML Data Manage- ment” item inproc RANKING “What does XML add for retrieval? It adds formal ways …” “w3c.org/xml” sec article par “Sophisticated technologies developed by smart people.” title “The XML Files” Ontology Game” Dirty Little Secret” bib “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” item url “XML” VAGUENESS PRUNING
Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions
Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions
Data Model ftf(“xml”, article1 ) = 4 XML tree model “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ <article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. <par>Native XML data base systems can store schemaless data.</par> </sec> </article> “xml data manage” article title abs sec “xml manage system vary wide expressive power“ “native xml data base” “native xml data base system store schemaless data“ par 1 6 2 3 4 5 “native xml data base native xml data base system store schemaless data“ XML tree model Pre/postorder labels for all tags and merged tag-term pairs XPath Accelerator [Grust, Sigmod ’02] Redundant full-content text nodes Full-content term frequencies ftf(ti,e)
Full-Content Scoring Model individual element statistics tag N avg.length k1 b article 12,223 2,903 10.5 0.75 sec 96,709 413 par 1,024,907 32 fig 109,230 13 Basic scoring idea within IR-style family of TF*IDF ranking functions bib[“transactions”] vs. par[“transactions”] Extended Okapi-BM25 probabilistic model for XML with element-specific parameterization [VLDB ’05 & INEX ’05] Additional static score mass c for relaxable structural conditions and non-conjunctive (“andish”) XPath evaluations
Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions
Inverted Block-Index for Content & Structure sec[“xml”] Random Access (RA) Sorted Access (SA) title[“native”] par[“retrieval”] sec[“xml”] title[“native”] par[“retrieval”] eid docid score pre post max-score 46 2 0.9 15 9 0.5 10 8 171 5 0.85 1 20 84 3 0.1 12 eid docid score pre post max- 216 17 0.9 2 15 72 3 0.8 14 10 51 0.5 4 12 671 31 0.4 23 eid docid score pre post max- 3 1 1.0 21 28 2 0.8 8 14 182 5 0.75 7 96 4 6 Combined inverted index over merged tag-term pairs (on redundant element full-contents) Sequential block-scans Group elements in descending order of (maxscore, docid) per list Block-scan all elements per doc for a given (tag, term) key Stored as inverted files or database tables (two B+-tree indexes over full range of attributes)
Navigational Index sec C=1.0 Sorted Access (SA) title[“native”] par[“retrieval”] Random Access (RA) sec title[“native”] par[“retrieval”] eid docid pre post 46 2 15 9 10 8 171 5 1 20 84 3 12 eid docid score pre post max- 216 17 0.9 2 15 72 3 0.8 14 10 51 0.5 4 12 671 31 0.4 23 eid docid score pre post max- 3 1 1.0 21 28 2 0.8 8 14 182 5 0.75 7 96 4 6 Additional element directory Random accesses on B+-tree index using (docid, tag) as key Carefully scheduled probes Schema-oblivious indexing & querying Non-schematic, heterogeneous data sources (no DTD required) Supports full NEXI syntax Supports all 13 XPath axes (+level )
Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions
TopX Query Processor Adapt Threshold Algorithm (TA) paradigm [Fagin et al., PODS ‘01] Focus on inexpensive SA & postpone expensive RA (NRA & CA) Keep intermediate top-k & enqueue partially evaluated candidates Lower/Upper score guarantees for each candidate d Remember set of evaluated query dimensions E(d) worstscore(d) = ∑iE(d) score(ti, ed) bestscore(d) = worstscore(d) + ∑iE(d) highi Early min-k threshold termination Return current top-k, iff TopX core engine [VLDB ’04] SA batching & efficient queue management Multi-threaded SA & query processing Probabilistic cost model for RA scheduling Probabilistic candidate pruning for approximate top-k results XML engine [VLDB ’05] Efficiently deals with uncertainty in the structure & content (“andish XPath”) Controlled amount of RA (unique among current XML-top-k engines) Dynamically switch between document & element granularity
TopX Query Processing By Example (NRA) Top-2 results sec[“xml”] worst=1.6 171 182 worst=0.9 46 worst=1.0 3 worst=2.2 46 28 51 worst=0.5 9 worst=1.7 46 28 worst=0.9 216 title[“native”] par[“retrieval”] min-2=1.6 min-2=0.0 min-2=0.5 min-2=1.0 min-2=0.9 sec[“xml”] title[“native”] par[“retrieval”] 1.0 1.0 1.0 0.9 eid docid score pre post 46 2 0.9 15 9 0.5 10 8 171 5 0.85 1 20 84 3 0.1 12 0.9 eid docid score pre post 216 17 0.9 2 15 72 3 0.8 14 10 51 0.5 4 12 671 31 0.4 23 1.0 eid docid score pre post 3 1 1.0 21 28 2 0.8 8 14 182 5 0.75 7 96 4 6 0.8 0.8 0.85 0.5 0.75 0.1 doc2 worst=0.9 best=2.8 46 28 51 worst=0.5 best=2.4 9 worst=0.9 best=2.9 46 worst=0.5 best=2.5 9 worst=1.7 best=2.5 46 28 worst=0.5 best=1.3 9 worst=0.9 best=2.7 46 28 51 worst=0.5 best=2.3 9 worst=2.2 best=2.2 46 28 51 worst=0.5 best=0.5 9 doc17 worst=0.9 best=2.55 216 worst=0.9 best=2.75 216 worst=0.9 best=1.8 216 worst=0.9 best=1.0 216 worst=0.9 best=2.8 216 doc1 worst=1.0 best=2.8 3 worst=1.0 best=2.65 3 worst=1.0 best=1.6 3 worst=1.0 best=2.75 3 worst=1.0 best=1.9 3 doc5 worst=1.6 best=2.1 171 182 worst=0.85 best=2.45 171 worst=0.85 best=2.65 171 171 worst=0.85 best=2.75 worst=0.85 best=2.15 171 Pseudo- doc doc3 72 worst=0.8 best=2.65 worst=0.8 best=2.45 72 worst=0.8 best=1.6 72 worst=0.1 best=0.9 84 worst=0.0 best=2.8 worst=0.0 best=2.75 worst=0.0 best=2.9 worst=0.0 best=2.65 worst=0.0 best=2.45 worst=0.0 best=1.7 worst=0.0 best=1.35 worst=0.0 best=1.4 Candidate queue
“Andish” XPath over Element Blocks worstscore(d) = 0.14 article bib sec 0.63 RA getSubtree- Score() getParentScore() 1.18 0.0 [*, *] C=1.0 C=0.2 bib 0.2 [1, 419] 1.0 [1, 419] 3.69 1.38 0.2 [398, 418] 1.0 [398, 418] 1.0 [169, 348] 1.0 [351, 389] 1.0 [392, 395] 0.2 [169, 348] 0.2 [351, 389] 0.2 [392, 395] SA item= w3c sec= xml retrieve par= native database item= w3c 0.49 [174, 324] 0.21 [169, 348] 0.16 [351, 389] 0.11 [37, 46] 0.11 [351, 389] 0.24 [354, 353] 0.18 [357, 359] 0.16 [65, 64] 0.14 [347, 343] 0.13 [166, 164] 0.12 [354, 353] 0.07 [389, 388] 0.06 [354, 353] 0.04 [375, 378] 0.02 [372, 371] Incremental & non-conjunctive XPath evaluations using Hash joins on the content conditions Staircase joins [Grust, VLDB ‘03] on the structure Tight & accurate [worstscore(d), bestscore(d)] bounds for early pruning (ensuring monotonous updates) Virtual support elements for navigation
Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions
Random Access Scheduling – Minimal Probing article bib sec RA 1.0 [1, 419] 1.0 [398, 418] 1.0 [169, 348] SA item= w3c sec= xml retrieve par= native database 0.49 [174, 324] 0.16 [351, 389] 0.11 [351, 389] 0.24 [354, 353] 0.12 [354, 353] 0.06 [354, 353] MinProbe: Schedule RAs only for the most promising candidates Extending “Expensive Predicates & Minimal Probing” [Chang&Hwang, SIGMOD ‘02] Schedule batch of RAs on d, only iff worstscore(d) + rd c > min-k rank-k worstscore evaluated content & structure- related score unresolved, static structural score mass
Cost-based Scheduling (CA) – Ben Probing Goal: Minimize overall execution cost #SA + cR/cS #RA Access costs on d are wasted, if d does not make it into the final top-k (considering both structural selectivities & content scores) Probabilistic cost model comparing different types of Expected Wasted Costs EWC-RAs(d) of looking up d in the remaining structure EWC-RAc(d) of looking up d in the remaining content EWC-SA(d) of not seeing d in the next batch of b SAs BenProbe: Schedule batch of RAs on d, iff #EWC-RAs|c(d) cR/cS < #EWC-SA Bounds the ratio between #RA and #SA Schedule RAs late & last Schedule RAs in asc. order of EWC-RAs|c(d)
Selectivity Estimator [VLDB ’05] //sec[//figure=“java”] [//par=“xml”] [//bib=“vldb”] Split the query into a set of basic, characteristic XML patterns: twigs, paths & tag-term pairs figure= “java” sec par= “xml” bib= “vldb” sec Consider structural selectivities of unresolved & non-redundant patterns Y PS [d satisfies all structural conditions Y] = bib= “vldb” conjunctive //sec[//figure]//par //sec[//figure]//bib //sec[//par]//bib //sec//figure //sec//par //sec//bib //bib=“vldb” //par=“xml” //figure=“java” p1 = 0.682 p2 = 0.001 p3 = 0.002 p4 = 0.688 p5 = 0.968 p6 = 0.002 p7= 0.023 p8 = 0.067 p9 = 0.011 “andish” PS [d satisfies a subset Y’ of structural conditions Y] = Consider binary correlations between structural patterns and/or tag-term pairs (data sampling, query logs, etc.)
Score Predictor [VLDB ’04] Consider score distributions of the content-related inverted lists PC [d gets in the final top-k] = Convolutions of score histograms (assuming independence) Probabilistic candidate pruning: Drop d from the candidate queue, iff PC [d gets in the final top-k] < ε (with probabilistic guarantees for relative precision & recall) title[“native”] f1 1 high1 f2 high2 eid docid score pre post max- 216 17 0.9 2 15 72 3 0.8 10 8 51 0.5 4 12 2 δ(d) par[“retrieval”] sampling eid docid score pre post max- 3 1 1.0 21 28 2 0.8 8 14 182 5 0.75 7 Closed-form convolutions, e.g., truncated Poisson Moment-generating functions & Chernoff-Hoeffding bounds Combined score predictor & selectivity estimator
Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions
Dynamic and Self-tuning Query Expansion [SIGIR ’05] TREC Robust Topic no. 363 Incrementally merge inverted lists for a set of active expansions exp(t1)..exp(tm) in descending order of scores s(ti, d) Max-score aggregation for fending off topic drifts Dynamically expand set of active expansions only when beneficial for finding the final top-k results Specialized expansion operators Incremental Merge operator Nested Top-k operator (phrase matching) Supports text, structured records & XML Boolean (but ranked) retrieval mode Top-k (transport, tunnel, ~disaster) SA SA transport d66 d93 d95 ... d101 tunnel d17 d11 d99 d42 d11 d92 d37 … ~disaster SA d42 d11 d92 ... d21 d78 d10 d1 d37 d32 d87 disaster accident fire Incr. Merge
Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions
Data Collections & Competitors INEX ‘04 Ad-hoc Track setting IEEE collection with 12,223 docs & 12M elemt’s in 534 MB XML data 46 NEXI queries with official relevance judgments and a strict quantization e.g., //article[.//bib=“QBIC” and .//par=“image retrieval”] TREC ‘04 Robust Track setting Aquaint news collection with 528,155 docs in 1,904 MB text data 50 “hard” queries from TREC Robust Track ‘04 with official relevance judgments e.g., “transportation tunnel disasters” or “Hubble telescope achievements” Competitors for XML setup DBMS-style Join&Sort Using index full scans on the TopX index (Holistic Twig Joins) StructIndex [Kaushik et al, Sigmod ’04] Top-k with separate indexes for content & structure DataGuide-like structural index Eager RAs (Fagin’s TA) StructIndex+ Extent chaining technique for DataGuide-based extent identifiers (skip scans on the content index)
INEX: TopX vs. Join&Sort & StructIndex 3.22 84,424 723,169 0.0 10 TopX – BenProbe 0.17 0.09 17.02 3,25,068 761,970 n/a StructIndex 12.01 9,122,318 Join&Sort 1.00 0.34 80.02 5,074,384 77,482 StructIndex+ 1.38 64,807 635,507 TopX – MinProbe 0.03 16.10 1,902,427 882,929 1,000 relPrec # SA CPU sec P@k MAP@k epsilon # RA k rel.Prec 46 NEXI Queries
INEX: TopX with Probabilistic Pruning 0.07 0.08 0.09 0.77 0.34 2.31 56,952 392,395 0.25 10 1.00 1.38 64,807 635,507 0.00 TopX - MinProbe 0.65 0.31 0.92 48,963 231,109 0.50 0.51 0.33 0.46 42,174 102,118 0.75 0.38 0.30 35,327 36,936 # SA CPU sec P@k MAP@k epsilon # RA k rel.Prec 46 NEXI Queries
TREC Robust: Dynamic vs. Static Query Expansion Careful WordNet expansions using automatic Word Sense Disambiguation & phrase detection [WebDB ’03 & PKDD ’05] with (m<118) MinProbe RA scheduling for phrase matching (auxiliary term-offset table) Incremental Merge + Nested Top-k (mtop< 22) vs. Static Expansions (mtop< 118) 50 Keyword + Phrase Queries
Conclusions Efficient and versatile TopX query processor Scalability Extensible framework for XML-IR & full-text search Very good precision/runtime ratio for probabilistic candidate pruning Self-tuning solution for robust query expansions & IR-style vague search Combined SA and RA scheduling close to lower bound for CA access cost [Submitted for VLDB ’06] Scalability Optimized for query processing IO Exploits cheap disk space for redundant index structures (constant redundancy factor of 4-5 for INEX IEEE) Extensive TREC Terabyte runs with 25,000,000 text documents (426 GB) INEX 2006 New Wikipedia XML collection with 660,000 documents & 120,000,000 elements (~ 6 GB raw XML) Official host for the Topic Development and Interactive Track (69 groups registered worldwide) TopX WebService available (SOAP connector)
That’s it. Thank you!
TREC Terabyte: Comparison of Scheduling Strategies Thanks to Holger Bast & Deb Majumdar!