Max Planck Institute for Informatics

Slides:



Advertisements
Similar presentations
Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Advertisements

Information Retrieval in Practice
Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile.
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
XML Ranking Querying, Dagstuhl, 9-13 Mar, An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Max-Planck Institute Ralf Schenkel Max-Planck Institute Mohammed.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Stanford University Ralf Schenkel Max-Planck Institute Mohammed.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Querying Structured Text in an XML Database By Xuemei Luo.
TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
Chapter 6: Information Retrieval and Web Search
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany.
IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.
INEX ‘05 INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.
ADT 2010 MonetDB/XQuery (2/2): High-Performance, Purely Relational XQuery Processing Stefan Manegold.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
Information Retrieval in Practice
XML: Extensible Markup Language
Practical Database Design and Tuning
Information Retrieval in Practice
CPS216: Data-intensive Computing Systems
Indexing & querying text
Database Management System
Information Retrieval in Practice
Max-Planck Institute for Informatics
Efficient Filtering of XML Documents with XPath Expressions
RE-Tree: An Efficient Index Structure for Regular Expressions
Probabilistic Data Management
Implementation Issues & IR Systems
Chapter 12: Query Processing
Database Performance Tuning and Query Optimization
Spatio-temporal Pattern Queries
Evaluation of Relational Operations: Other Operations
Martin Theobald Max-Planck-Institut Informatik Stanford University
Practical Database Design and Tuning
Structure and Content Scoring for XML
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Chapter 12 Query Processing (1)
Chapter 11 Database Performance Tuning and Query Optimization
Structure and Content Scoring for XML
Evaluation of Relational Operations: Other Techniques
A Framework for Testing Query Transformation Rules
Information Retrieval and Web Design
Efficient Processing of Top-k Spatial Preference Queries
CSE 326: Data Structures Lecture #14
Evaluation of Relational Operations: Other Techniques
Relax and Adapt: Computing Top-k Matches to XPath Queries
Introduction to XML IR XML Group.
Index Structures Chapter 13 of GUW September 16, 2019
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Max Planck Institute for Informatics TopX Efficient and Versatile Top-k Query Processing for Text, Structured, and Semistructured Data PhD Defense May 16th 2006 Martin Theobald Max Planck Institute for Informatics VLDB ‘05

An XML-IR Scenario (INEX IEEE) … //article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)] “Native XML data base systems can store schemaless data ... ” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ” “XML-QL: A Query Language for XML.” “Native XML Data Bases.” “Proc. Query Languages Workshop, W3C,1998.” “XML queries with an expres- sive power similar to that of Datalog …” sec article par bib title “Current Approaches to XML Data Manage- ment” item inproc RANKING “What does XML add for retrieval? It adds formal ways …” “w3c.org/xml” sec article par “Sophisticated technologies developed by smart people.” title “The XML Files” Ontology Game” Dirty Little Secret” bib “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” item url “XML” VAGUENESS PRUNING

Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions

Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions

Data Model ftf(“xml”, article1 ) = 4 XML tree model “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ <article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. <par>Native XML data base systems can store schemaless data.</par> </sec> </article> “xml data manage” article title abs sec “xml manage system vary wide expressive power“ “native xml data base” “native xml data base system store schemaless data“ par 1 6 2 3 4 5 “native xml data base native xml data base system store schemaless data“ XML tree model Pre/postorder labels for all tags and merged tag-term pairs  XPath Accelerator [Grust, Sigmod ’02] Redundant full-content text nodes Full-content term frequencies ftf(ti,e)

Full-Content Scoring Model individual element statistics tag N avg.length k1 b article 12,223 2,903 10.5 0.75 sec 96,709 413 par 1,024,907 32 fig 109,230 13 Basic scoring idea within IR-style family of TF*IDF ranking functions bib[“transactions”] vs. par[“transactions”] Extended Okapi-BM25 probabilistic model for XML with element-specific parameterization [VLDB ’05 & INEX ’05] Additional static score mass c for relaxable structural conditions and non-conjunctive (“andish”) XPath evaluations

Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions

Inverted Block-Index for Content & Structure sec[“xml”] Random Access (RA) Sorted Access (SA) title[“native”] par[“retrieval”] sec[“xml”] title[“native”] par[“retrieval”] eid docid score pre post max-score 46 2 0.9 15 9 0.5 10 8 171 5 0.85 1 20 84 3 0.1 12 eid docid score pre post max- 216 17 0.9 2 15 72 3 0.8 14 10 51 0.5 4 12 671 31 0.4 23 eid docid score pre post max- 3 1 1.0 21 28 2 0.8 8 14 182 5 0.75 7 96 4 6 Combined inverted index over merged tag-term pairs (on redundant element full-contents) Sequential block-scans Group elements in descending order of (maxscore, docid) per list Block-scan all elements per doc for a given (tag, term) key Stored as inverted files or database tables (two B+-tree indexes over full range of attributes)

Navigational Index sec C=1.0 Sorted Access (SA) title[“native”] par[“retrieval”] Random Access (RA) sec title[“native”] par[“retrieval”] eid docid pre post 46 2 15 9 10 8 171 5 1 20 84 3 12 eid docid score pre post max- 216 17 0.9 2 15 72 3 0.8 14 10 51 0.5 4 12 671 31 0.4 23 eid docid score pre post max- 3 1 1.0 21 28 2 0.8 8 14 182 5 0.75 7 96 4 6 Additional element directory Random accesses on B+-tree index using (docid, tag) as key Carefully scheduled probes Schema-oblivious indexing & querying Non-schematic, heterogeneous data sources (no DTD required) Supports full NEXI syntax Supports all 13 XPath axes (+level )

Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions

TopX Query Processor Adapt Threshold Algorithm (TA) paradigm [Fagin et al., PODS ‘01] Focus on inexpensive SA & postpone expensive RA (NRA & CA) Keep intermediate top-k & enqueue partially evaluated candidates Lower/Upper score guarantees for each candidate d Remember set of evaluated query dimensions E(d) worstscore(d) = ∑iE(d) score(ti, ed) bestscore(d) = worstscore(d) + ∑iE(d) highi Early min-k threshold termination Return current top-k, iff TopX core engine [VLDB ’04] SA batching & efficient queue management Multi-threaded SA & query processing Probabilistic cost model for RA scheduling Probabilistic candidate pruning for approximate top-k results XML engine [VLDB ’05] Efficiently deals with uncertainty in the structure & content (“andish XPath”) Controlled amount of RA (unique among current XML-top-k engines) Dynamically switch between document & element granularity

TopX Query Processing By Example (NRA) Top-2 results sec[“xml”] worst=1.6 171 182 worst=0.9 46 worst=1.0 3 worst=2.2 46 28 51 worst=0.5 9 worst=1.7 46 28 worst=0.9 216 title[“native”] par[“retrieval”] min-2=1.6 min-2=0.0 min-2=0.5 min-2=1.0 min-2=0.9 sec[“xml”] title[“native”] par[“retrieval”] 1.0 1.0 1.0 0.9 eid docid score pre post 46 2 0.9 15 9 0.5 10 8 171 5 0.85 1 20 84 3 0.1 12 0.9 eid docid score pre post 216 17 0.9 2 15 72 3 0.8 14 10 51 0.5 4 12 671 31 0.4 23 1.0 eid docid score pre post 3 1 1.0 21 28 2 0.8 8 14 182 5 0.75 7 96 4 6 0.8 0.8 0.85 0.5 0.75 0.1 doc2 worst=0.9 best=2.8 46 28 51 worst=0.5 best=2.4 9 worst=0.9 best=2.9 46 worst=0.5 best=2.5 9 worst=1.7 best=2.5 46 28 worst=0.5 best=1.3 9 worst=0.9 best=2.7 46 28 51 worst=0.5 best=2.3 9 worst=2.2 best=2.2 46 28 51 worst=0.5 best=0.5 9 doc17 worst=0.9 best=2.55 216 worst=0.9 best=2.75 216 worst=0.9 best=1.8 216 worst=0.9 best=1.0 216 worst=0.9 best=2.8 216 doc1 worst=1.0 best=2.8 3 worst=1.0 best=2.65 3 worst=1.0 best=1.6 3 worst=1.0 best=2.75 3 worst=1.0 best=1.9 3 doc5 worst=1.6 best=2.1 171 182 worst=0.85 best=2.45 171 worst=0.85 best=2.65 171 171 worst=0.85 best=2.75 worst=0.85 best=2.15 171 Pseudo- doc doc3 72 worst=0.8 best=2.65 worst=0.8 best=2.45 72 worst=0.8 best=1.6 72 worst=0.1 best=0.9 84 worst=0.0 best=2.8 worst=0.0 best=2.75 worst=0.0 best=2.9 worst=0.0 best=2.65 worst=0.0 best=2.45 worst=0.0 best=1.7 worst=0.0 best=1.35 worst=0.0 best=1.4 Candidate queue

“Andish” XPath over Element Blocks worstscore(d) = 0.14 article bib sec 0.63 RA getSubtree- Score() getParentScore() 1.18 0.0 [*, *] C=1.0 C=0.2 bib 0.2 [1, 419] 1.0 [1, 419] 3.69 1.38 0.2 [398, 418] 1.0 [398, 418] 1.0 [169, 348] 1.0 [351, 389] 1.0 [392, 395] 0.2 [169, 348] 0.2 [351, 389] 0.2 [392, 395] SA item= w3c sec= xml retrieve par= native database item= w3c 0.49 [174, 324] 0.21 [169, 348] 0.16 [351, 389] 0.11 [37, 46] 0.11 [351, 389] 0.24 [354, 353] 0.18 [357, 359] 0.16 [65, 64] 0.14 [347, 343] 0.13 [166, 164] 0.12 [354, 353] 0.07 [389, 388] 0.06 [354, 353] 0.04 [375, 378] 0.02 [372, 371] Incremental & non-conjunctive XPath evaluations using Hash joins on the content conditions Staircase joins [Grust, VLDB ‘03] on the structure Tight & accurate [worstscore(d), bestscore(d)] bounds for early pruning (ensuring monotonous updates)  Virtual support elements for navigation

Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions

Random Access Scheduling – Minimal Probing article bib sec RA 1.0 [1, 419] 1.0 [398, 418] 1.0 [169, 348] SA item= w3c sec= xml retrieve par= native database 0.49 [174, 324] 0.16 [351, 389] 0.11 [351, 389] 0.24 [354, 353] 0.12 [354, 353] 0.06 [354, 353] MinProbe: Schedule RAs only for the most promising candidates Extending “Expensive Predicates & Minimal Probing” [Chang&Hwang, SIGMOD ‘02] Schedule batch of RAs on d, only iff worstscore(d) + rd c > min-k rank-k worstscore evaluated content & structure- related score unresolved, static structural score mass

Cost-based Scheduling (CA) – Ben Probing Goal: Minimize overall execution cost #SA + cR/cS #RA Access costs on d are wasted, if d does not make it into the final top-k (considering both structural selectivities & content scores) Probabilistic cost model comparing different types of Expected Wasted Costs EWC-RAs(d) of looking up d in the remaining structure EWC-RAc(d) of looking up d in the remaining content EWC-SA(d) of not seeing d in the next batch of b SAs BenProbe: Schedule batch of RAs on d, iff #EWC-RAs|c(d) cR/cS < #EWC-SA Bounds the ratio between #RA and #SA Schedule RAs late & last Schedule RAs in asc. order of EWC-RAs|c(d)

Selectivity Estimator [VLDB ’05] //sec[//figure=“java”] [//par=“xml”] [//bib=“vldb”] Split the query into a set of basic, characteristic XML patterns: twigs, paths & tag-term pairs figure= “java” sec par= “xml” bib= “vldb” sec Consider structural selectivities of unresolved & non-redundant patterns Y PS [d satisfies all structural conditions Y] = bib= “vldb” conjunctive //sec[//figure]//par //sec[//figure]//bib //sec[//par]//bib //sec//figure //sec//par //sec//bib //bib=“vldb” //par=“xml” //figure=“java” p1 = 0.682 p2 = 0.001 p3 = 0.002 p4 = 0.688 p5 = 0.968 p6 = 0.002 p7= 0.023 p8 = 0.067 p9 = 0.011 “andish” PS [d satisfies a subset Y’ of structural conditions Y] = Consider binary correlations between structural patterns and/or tag-term pairs (data sampling, query logs, etc.)

Score Predictor [VLDB ’04] Consider score distributions of the content-related inverted lists PC [d gets in the final top-k] = Convolutions of score histograms (assuming independence) Probabilistic candidate pruning: Drop d from the candidate queue, iff PC [d gets in the final top-k] < ε (with probabilistic guarantees for relative precision & recall) title[“native”] f1 1 high1 f2 high2 eid docid score pre post max- 216 17 0.9 2 15 72 3 0.8 10 8 51 0.5 4 12 2 δ(d) par[“retrieval”] sampling eid docid score pre post max- 3 1 1.0 21 28 2 0.8 8 14 182 5 0.75 7 Closed-form convolutions, e.g., truncated Poisson Moment-generating functions & Chernoff-Hoeffding bounds Combined score predictor & selectivity estimator

Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions

Dynamic and Self-tuning Query Expansion [SIGIR ’05] TREC Robust Topic no. 363 Incrementally merge inverted lists for a set of active expansions exp(t1)..exp(tm) in descending order of scores s(ti, d) Max-score aggregation for fending off topic drifts Dynamically expand set of active expansions only when beneficial for finding the final top-k results Specialized expansion operators Incremental Merge operator Nested Top-k operator (phrase matching) Supports text, structured records & XML Boolean (but ranked) retrieval mode Top-k (transport, tunnel, ~disaster) SA SA transport d66 d93 d95 ... d101 tunnel d17 d11 d99 d42 d11 d92 d37 … ~disaster SA d42 d11 d92 ... d21 d78 d10 d1 d37 d32 d87 disaster accident fire Incr. Merge

Outline Data & relevance scoring model Database schema & indexing TopX query processing Index access scheduling & probabilistic candidate pruning Dynamic query relaxation & expansion Experiments & conclusions

Data Collections & Competitors INEX ‘04 Ad-hoc Track setting IEEE collection with 12,223 docs & 12M elemt’s in 534 MB XML data 46 NEXI queries with official relevance judgments and a strict quantization e.g., //article[.//bib=“QBIC” and .//par=“image retrieval”] TREC ‘04 Robust Track setting Aquaint news collection with 528,155 docs in 1,904 MB text data 50 “hard” queries from TREC Robust Track ‘04 with official relevance judgments e.g., “transportation tunnel disasters” or “Hubble telescope achievements” Competitors for XML setup DBMS-style Join&Sort Using index full scans on the TopX index (Holistic Twig Joins) StructIndex [Kaushik et al, Sigmod ’04] Top-k with separate indexes for content & structure DataGuide-like structural index Eager RAs (Fagin’s TA) StructIndex+ Extent chaining technique for DataGuide-based extent identifiers (skip scans on the content index)

INEX: TopX vs. Join&Sort & StructIndex 3.22 84,424 723,169 0.0 10 TopX – BenProbe 0.17 0.09 17.02 3,25,068 761,970 n/a StructIndex 12.01 9,122,318 Join&Sort 1.00 0.34 80.02 5,074,384 77,482 StructIndex+ 1.38 64,807 635,507 TopX – MinProbe 0.03 16.10 1,902,427 882,929 1,000 relPrec # SA CPU sec P@k MAP@k epsilon # RA k rel.Prec 46 NEXI Queries

INEX: TopX with Probabilistic Pruning 0.07 0.08 0.09 0.77 0.34 2.31 56,952 392,395 0.25 10 1.00 1.38 64,807 635,507 0.00 TopX - MinProbe 0.65 0.31 0.92 48,963 231,109 0.50 0.51 0.33 0.46 42,174 102,118 0.75 0.38 0.30 35,327 36,936 # SA CPU sec P@k MAP@k epsilon # RA k rel.Prec 46 NEXI Queries

TREC Robust: Dynamic vs. Static Query Expansion Careful WordNet expansions using automatic Word Sense Disambiguation & phrase detection [WebDB ’03 & PKDD ’05] with (m<118) MinProbe RA scheduling for phrase matching (auxiliary term-offset table) Incremental Merge + Nested Top-k (mtop< 22) vs. Static Expansions (mtop< 118) 50 Keyword + Phrase Queries

Conclusions Efficient and versatile TopX query processor Scalability Extensible framework for XML-IR & full-text search Very good precision/runtime ratio for probabilistic candidate pruning Self-tuning solution for robust query expansions & IR-style vague search Combined SA and RA scheduling close to lower bound for CA access cost [Submitted for VLDB ’06] Scalability Optimized for query processing IO Exploits cheap disk space for redundant index structures (constant redundancy factor of 4-5 for INEX IEEE) Extensive TREC Terabyte runs with 25,000,000 text documents (426 GB) INEX 2006 New Wikipedia XML collection with 660,000 documents & 120,000,000 elements (~ 6 GB raw XML) Official host for the Topic Development and Interactive Track (69 groups registered worldwide) TopX WebService available (SOAP connector)

That’s it. Thank you!

TREC Terabyte: Comparison of Scheduling Strategies Thanks to Holger Bast & Deb Majumdar!