An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany.

An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany VLDB ‘05

An Efficient and Versatile Query Engine for TopX Search 2 //article [ //sec [ about(.//, “XML retrieval”) ] //par [ about(.//, “native XML database”) ] ] //bib[about(.//item, “W3C”)] sec article sec par bib par title “Current Approaches to XML Data Manage- ment.” item “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ” “XML queries with an expres- sive power similar to that of Datalog …” par title “XML-QL: A Query Language for XML.” “Native XML database systems can store schemaless data... ” inproc “Proc. Query Languages Workshop, W3C,1998.” title “Native XML databases.” An XML-IR Scenario… sec article sec par “Sophisticated technologies developed by smart people.” par title “The X ML Files” par title “The Ontology Game” title “The Dirty Little Secret” “What does XML add for retrieval? It adds formal ways …” bib “ w3c.org/xml” “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” title item url “XML”

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 3 TopX: Efficient XML-IR Extend top-k query processing algorithms for sorted lists [Buckley ’85; Güntzer, Balke & Kießling ’00; Fagin ‘01] to XML data Non-schematic, heterogeneous data sources Combined inverted index for content & structure Avoid full index scans, postpone expensive random accesses to large disk-resident data structures Exploit cheap disk space for redundant indexing Goal: Efficiently retrieve the best results of a similarity query

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 4 XML-IR: History and Related Work IR on structured data (SGML): 1995 2000 2005 IR on XML: Commercial software: MarkLogic, Verity?, IBM?, Oracle?,... XML query languages: XQuery (W3C) XPath 2.0 (W3C) NEXI (INEX Benchmark) XPath & XQuery Full-Text (W3C) XPath 1.0 (W3C) XML-QL (AT&T Labs) Web query languages: Lorel (Stanford U) Araneus (U Roma) W3QS (Technion Haifa) TeXQuery (AT&T Labs) WebSQL (U Toronto) XIRQL (U Dortmund / Essen) XXL & TopX (U Saarland / MPII) ApproXQL (U Berlin / U Munich) ELIXIR (U Dublin) JuruXML (IBM Haifa ) XSearch (Hebrew U) Timber (U Michigan) XRank & Quark (Cornell U) FleXPath (AT&T Labs) XKeyword (UCSD) OED etc. (U Waterloo) HySpirit (U Dortmund) HyperStorM (GMD Darmstadt) WHIRL (CMU)

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 5 Outline Data & Scoring model Database schema & indexing Top-k query processing for XML Scheduling & probabilistic candidate pruning Experiments & Conclusions

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 6 Computational Model Precomputed content scores score(t i,e) ∈  E.g., term/element frequencies, probabilistic models (Okapi BM25), etc. Typically normalized to score(t i,e) ∈ [0,1] Monotonous score aggregation aggr: (D 1 ×…×D m )  (D 1 ×…×D m ) →  + E.g., sum, max, product (using log), cosine (using L 2 norm) Structural query conditions Complex query DAGs Aggregate constant score c for each matched structural condition (edges) Similarity queries (aka. “andish”) Non-conjunctive query evaluations Weak content matches can be compensated Vague structural matches Access model Disk-resident inverted index  Inexpensive sequential accesses (SA) to inverted lists: “getNextItem()”  More expensive random accesses (RA): “getItemBy(Id)”

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 7 Data Model Simplified XML model disregarding IDRef & XLink/XPointer Redundant full-contents Per-element term frequencies ftf(t i,e) for full contents Pre/postorder labels for each tag-term pair XML-IR IR techniques for XML Clustering on XML Evaluation “xml ir” article title abs sec “xml ir ir technique xml clustering xml evaluation“ “ir technique xml“ “clustering xml evaluation“ “clustering xml” “evaluation“ title par 16 253 4 33 5261 ftf(“xml”, article 1 ) = 3 ftf(“xml”, article 1 ) = 3

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 8 Full-Content Scoring Model Full-content scores cast into an Okapi-BM25 probabilistic model with element-specific model parameterization Basic scoring idea within IR-style family of TF*IDF ranking functions tagNavglengthk1k1 b article16,8502,90310.50.75 sec96,70941310.50.75 par1,024,9073210.50.75 fig109,2301310.50.75 element statistics Additional static score mass c for relaxable structural conditions

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 10 Inverted Block-Index for Content & Structure eiddocidscoreprepostmax- score 4620.92150.9 920.51080.9 17150.851200.85 8430.11120.1 sec[clustering] title[xml]par[evaluation] sec[clustering] title[xml] par[evaluation] Inverted index over tag-term pairs (full-contents) Benefits from increased selectivity of combined tag-term pairs Accelerates child-or-descendant axis, e.g., sec//”clustering” eiddocidscoreprepostmax- score 216170.92150.9 7230.81080.8 5120.54120.5 671310.412230.4 eiddocidscoreprepostmax- score 311.01211.0 2820.88140.8 18250.7537 9640.7564 Sequential block-scans Re-order elements in descending order of (maxscore, docid, score) per list Fetch all tag-term pairs per doc in one sequential block-access docid limits range of in-memory structural joins Stored as inverted files or database tables (B + -tree indexes)

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 11 Navigational Index eiddocidprepost 462215 92108 1715120 843112 sec title[xml]par[evaluation] sec title par Additional navigational index Non-redundant element directory Supports element paths and branching path queries Random accesses using (docid, tag) as key Schema-oblivious indexing & querying eiddocidprepost 21617215 723108 512412 671311223 eiddocidprepost 31121 282814 182537 96464

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 13 TopX Query Processing Adapt T hreshold A lgorithm (TA) paradigm Focus on inexpensive sequential/sorted accesses Postpone expensive random accesses Candidate d = connected sub-pattern with element ids and scores Incrementally evaluate path constraints using pre/postorder labels In-memory structural joins (nested loops, staircase, or holistic twig joins) Upper/lower score guarantees per candidate Remember set of evaluated dimensions E(d) worstscore(d) = ∑ i  E(d) score(t i,e) bestscore(d) = worstscore(d) + ∑ i  E(d) high i Early threshold termination Candidate queuing Stop, if Extensions Batching of sorted accesses & efficient queue management Cost model for random access scheduling Probabilistic candidate pruning for approximate top-k results [Theobald, Schenkel & Weikum, VLDB ’04] [Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85] [Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85]

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 14 1.0 worst=0.9 best=2.9 46 worst=0.5 best=2.5 9 TopX Query Processing By Example eiddocidscoreprepost 4620.9215 920.5108 17150.85120 8430.1112 eiddocidscoreprepost 216170.9215 7230.8108 5120.5412 671310.41223 eiddocidscoreprepost 311.0121 2820.8814 18250.7537 9640.7564 worst=1.0 best=2.8 3 worst=0.9 best=2.8 216 171 worst=0.85 best=2.75 72 worst=0.8 best=2.65 worst=0.9 best=2.8 46 2851 worst=0.5 best=2.4 9 doc 2 doc 17 doc 1 worst=0.9 best=2.75 216 doc 5 worst=1.0 best=2.75 3 doc 3 worst=0.9 best=2.7 46 2851 worst=0.5 best=2.3 9 worst=0.85 best=2.65 171 score=1.7 best=2.5 46 28 score=0.5 best=1.3 9 worst=0.9 best=2.55 216 worst=1.0 best=2.65 3 worst=0.85 best=2.45 171 worst=0.8 best=2.45 72 worst=0.8 best=1.6 72 worst=0.1 best=0.9 84 worst=0.9 best=1.8 216 worst=1.0 best=1.9 3 worst=2.2 best=2.2 46 2851 worst=0.5 best=0.5 9 worst=1.0 best=1.6 3 worst=0.85 best=2.15 171 worst=1.6 best=2.1 171 182 worst=0.9 best=1.0 216 worst=0.0 best=2.9 Pseudo- Element worst=0.0 best=2.8 worst=0.0 best=2.75 worst=0.0 best=2.65 worst=0.0 best=2.45 worst=0.0 best=1.7 worst=0.0 best=1.4 worst=0.0 best=1.35 sec[clustering] title[xml] Top-2 results worst=0.9 46 worst=0.5 9 worst=0.9 216 worst=1.7 46 28 worst=2.2 46 2851 worst=1.0 3 worst=1.6 171 182 par[evaluation] 1.0 0.9 0.85 0.1 0.9 0.8 0.5 0.8 0.75 min-2=0.0 min-2=0.5 min-2=0.9 min-2=1.6 sec[clustering] title[xml]par[evaluation] Candidate queue

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 15 Incremental Path Validations Complex query DAGs Transitive closure of descendant constraints Aggregate additional static score mass c for a structural condition i, if all edges rooted at i are satisfiable Incrementally test structural constraints Quickly decrease best scores for early pruning Schedule random accesses in ascending order of structural selectivities //article[//sec//par//“xml java”] //bib//item//title//“security” article sec par= “xml” par= “java” bib title= “security” item child-or-descendant article sec par= “xml” par= “xml” par= “java” par= “java” bib title= “security” title= “security” item article bib item title= security title= security sec par= xml par= xml par= java par= java Query: “Promising candidate” 0.8 0.7 [0.0, high i ] c =[1.0] [1.0] bib item worst(d)= 1.5 best(d) = 6.5 worst(d)= 1.5 best(d) = 5.5 worst(d)= 1.5 best(d) = 4.5 0.0 RA min-k=4.8 RA

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 17 MinProbe-Scheduling Structural conditions as “soft filters” (Expensive Predicates & Minimal Probes [Chang & Hwang, SIGMOD ‘02] ) Schedule random accesses only for the most promising candidates Schedule batch of RAs on d, if worstscore(d) + o d c > min-k Random Access Scheduling - Minimal Probes evaluated content & structure- related score unevaluated structural score mass (constant!) article sec par= “xml” par= “xml” par= “java” par= “java” bib title= “security” title= “security” item c =[1.0] [1.0]

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 18 BenProbe-Scheduling Analytic cost model Basic idea Compare expected random access costs to an optimal schedule Access costs on d are wasted, if d does not make it into the final top-k (considering both content & structure) Compare different Expected Wasted Costs (EWC) EWC-RA s (d) of looking up d in the structure EWC-RA c (d) of looking up d in the content EWC-SA(d) of not seeing d in the next batch of b sorted accesses Schedule batch of RAs on d, if EWC-RA s|c (d) [RA] < EWC-SA [SA] Cost-based Scheduling EWC-SA =

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 19 Split the query into a set of characteristic patterns, e.g., twigs, descendants & tag-term pairs Consider structural selectivities P[d satisfies all structural conditions Y] = P[d satisfies a subset Y’ of structural conditions Y] = Consider binary correlations between structural patterns and/or tag-term pairs (estimated from data sampling, query logs, etc.) Structural Selectivity Estimator //sec[//figure=“java”] [//par=“xml”] [//bib=“vldb”] //sec[//figure]//par //sec[//figure]//bib //sec[//par]//bib //sec//figure //sec//par //sec//bib //bib=“vldb” //par=“xml” //figure=“java” p 1 = 0.682 p 2 = 0.001 p 3 = 0.002 p 4 = 0.688 p 5 = 0.968 p 6 = 0.002 p 7 = 0.023 p 8 = 0.067 p 9 = 0.011 figure= “java” figure= “java” sec par= “xml” par= “xml” bib= “vldb” bib= “vldb” bib= “vldb” bib= “vldb” sec EWC- RA s (d)

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 20 Full-content Score Predictor For each inverted list L i (i.e., all tag-term pairs) Approximate local score distribution S i by an equi-width histogram Periodically test all d in the candidate queue Consider aggregated score predictor eiddocidscoreprepostmax- score 216170.92150.9 7230.81080.8 5120.54120.5 eiddocidscoreprepostmax- score 311.01211.0 2820.88140.8 18250.7537 title[xml] par[evaluation] Convolution (S 1,S 2 ) 2 0 δ(d) 0 S1S1 1 high 1 S2S2 high 2 1 0 EWC- RA c (d) Probabilistic candidate pruning: Drop d from the candidate queue, if P[d gets in the final top-k] < ε Probabilistic candidate pruning: Drop d from the candidate queue, if P[d gets in the final top-k] < ε P[d gets in the final top-k] =

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 22 Data Collections & Competitors INEX ‘04 benchmark setting 12,223 docs; 12M elemt’s; 119M index entries; 534MB 46 queries with official relevance judgments e.g., //article[.//bib=“QBIC” and.//par=“image retrieval”] IMDB (Internet Movie Database) 386,529 docs; 34M elemt’s; 130M index entries; 1,117 MB 20 queries, e.g., //movie[.//casting[.//actor=“John Wayne”] and.//role=“Sheriff”]//[.//year=“1959” and.//genre=“Western”] Competitors DBMS-style Join&Sort Using index full scans on the TopX schema StructIndex [Kaushik et al, Sigmod ’04] Top-k with separate inverted indexes for content & structure DataGuide-like structural index Full evaluations  no uncertainty about final document scores No candidate queuing, eager random accesses StructIndex+ Extent chaining technique for DataGuide-based extent identifiers (skip scans)

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 23 INEX Results 0.07 84,424 723,1690.010TopX – BenProbe 0.17 0.09 0.37 3,25,068 761,970n/a10StructIndex 0.261 0 9,122,318n/a10Join&Sort 1.000.341.87 5,074,384 77,482n/a10StructIndex+ 0.03 64,807 635,5070.010TopX – MinProbe 1.000.030.35 1,902,427 882,9290.01,000TopX – BenProbe relPrec # SA CPU sec P@k MAP@k epsilon # RA k relPrec

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 24 IMDB Results n/a 0.16 291,655 346,697n/a10StructIndex 37.7 0 14,510077 n/a10Join&Sort 1.000.17 301,647 22,445n/a10StructIndex+ 0.08 72,196 317,3800.010TopX – MinProbe 0.06 50,016 241,4710.010TopX – BenProbe # SA CPU sec P@k MAP@k epsilon # RA k relPrec

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 25 INEX with Probabilistic Pruning 0.07 0.08 0.09 0.770.340.05 56,952 392,3950.2510 1.000.340.03 64,807 635,5070.0010TopX - MinProbe 0.650.310.02 48,963 231,1090.5010 0.510.330.01 42,174 102,1180.7510 0.380.300.01 35,327 36,9361.0010 # SA CPU sec P@k MAP@k epsilon # RA k relPrec

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 26 Conclusions & Ongoing Work Efficient and versatile TopX query processor Extensible framework for text, semi-structured & structured data Probabilistic cost model for random access scheduling Very good precision/runtime ratio for probabilistic candidate pruning Scalability Optimized for runtime, exploits cheap disk space (factor 4-5 for INEX) Experiments on TREC Terabyte text collection (see paper) Support for typical IR extensions Phrase matching, mandatory terms “+”, negation “-” Query weights (e.g., relevance feedback, ontological similarities) Dynamic and self-tuning query expansions [SigIR ’05] Incrementally merges inverted lists on demand Dynamically opens scans on additional expansion terms Vague Content & Structure (VCAS) queries

VLDB ‘05 An Efficient and Versatile Query Engine for TopX Search 27 Thank you! Demo available! Demo available!

An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany.

Similar presentations

Presentation on theme: "An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany.

Similar presentations

Presentation on theme: "An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany."— Presentation transcript:

Similar presentations

About project

Feedback