TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Max-Planck Institute Ralf Schenkel Max-Planck Institute Mohammed.

Slides:



Advertisements
Similar presentations
Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Advertisements

Information Retrieval in Practice
Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information.
Information Retrieval in Practice
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Benchmarking XML storage systems Information Systems Lab HS 2007 Final Presentation © ETH Zürich | Benchmarking XML.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Organizing files for performance Chapter Data compression Advantages of reduced file size Redundancy reduction: state code example Repeating sequences:
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Stanford University Ralf Schenkel Max-Planck Institute Mohammed.
Overview of Search Engines
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Information Retrieval in Practice
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How are data stored? –physical level –logical level.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
INEX – a broadly accepted data set for XML database processing? Pavel Loupal, Michal Valenta.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
XML과 Database 홍기형 성신여자대학교 성신여자대학교 홍기형.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
CSCE Database Systems Chapter 15: Query Execution 1.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Lecture 5 Cost Estimation and Data Access Methods.
An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany.
IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
INEX ‘05 INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
1 Information Retrieval LECTURE 1 : Introduction.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.
Information Retrieval in Practice
University of Maryland Baltimore County
Information Retrieval in Practice
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Indexing & querying text
Max-Planck Institute for Informatics
COMP 430 Intro. to Database Systems
Implementation Issues & IR Systems
Max Planck Institute for Informatics
Martin Theobald Max-Planck-Institut Informatik Stanford University
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Lecture 13: Query Execution
Fabio Grandi DEIS - Univ. of Bologna, Italy
Information Retrieval and Web Design
Introduction to XML IR XML Group.
Presentation transcript:

TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Max-Planck Institute Ralf Schenkel Max-Planck Institute Mohammed AbuJarour Hasso-Plattner Institute

“Native XML data base systems can store schemaless data... ” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ” “XML-QL: A Query Language for XML.” “Native XML Data Bases.” “Proc. Query Languages Workshop, W3C,1998.” “XML queries with an expres- sive power similar to that of Datalog …” sec article sec par bib par title “Current Approaches to XML Data Manage- ment” item par title inproc title //article[about(.//bib//item, “W3C”)] //sec [ about(.//, “XML retrieval”) ] //par [ about(.//, “native XML databases”) ] “What does XML add for retrieval? It adds formal ways …” “ w3c.org/xml” sec article sec par “Sophisticated technologies developed by smart people.” par title “The X ML Files ” par title “The Ontology Game” title “The Dirty Little Secret” bib “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” title item url “XML” RANKINGRANKING VAGUENESSVAGUENESS EARLY PRUNING From the INEX ’03-’05 IEEE Collection

Ontology/ Large Thesaurus WordNet, OpenCyc, etc. Ontology/ Large Thesaurus WordNet, OpenCyc, etc. SA Relational DBMS Backend Unified Text & XML Schema Relational DBMS Backend Unified Text & XML Schema Random Access Top-k Queue Top-k Queue Scan Threads Candidate Queue Candidate Queue Indexer/Crawler Frontends Web Interface Web Service API Frontends Web Interface Web Service API Selectivities Histograms Correlations Selectivities Histograms Correlations Index Metadata TopX 1.0 Query Processor TopX 1.0 Query Processor Sequential Access SA Path Conditions Phrases & Proximity Other Full-Text Op’s Path Conditions Phrases & Proximity Other Full-Text Op’s Expensive Predicates RA RA Probabilistic Candidate Pruning Probabilistic Candidate Pruning Probabilistic Index Access Scheduling Probabilistic Index Access Scheduling Dynamic Query Expansion Dynamic Query Expansion Non-conjunctive Top-k XPath Query Processing Non-conjunctive Top-k XPath Query Processing RA RA JDBC 2.0

Data Model  XML trees (no XLink/ID/IDRef)  Pre-/postorder ranges for the structural index  Redundant full-content text nodes XML Data Management XML management systems vary widely in their expressive power. Native XML Data Bases. Native XML data base systems can store schemaless data. “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ “native xml data base native xml data base system store schemaless data“ “xml data manage” article title abs sec “xml manage system vary wide expressive power“ “native xml data base” “native xml data base system store schemaless data“ title par “ xml data manage xml manage system vary wide expressive power native xml native xml data base system store schemaless data“ ftf (“xml”, article 1 ) = 4 ftf (“xml”, article 1 ) = 4 ftf (“xml”, sec 4 ) = 2 ftf (“xml”, sec 4 ) = 2 “native xml data base native xml data base system store schemaless data“

Scoring Model [INEX ‘05/’06/’07/’08]  XML-specific variant of Okapi BM25 (originating from probabilistic IR on unstructured text) Content Index (Tag-Term Pairs)Element Freq.Element Statistics author[“gates”] vs. section[“gates”] author[“gates”] vs. section[“gates”]

TopX 1.0: Relational Schema  Precompute & materialize scoring model into combined inverted index over tag-term pairs  Supports sorted access (by descending MaxScore) and random access (by DocID) sec[“xml”] Select DocID, Pre, Post, Score From TagTermIndex Where tag=‘sec’ and term=‘xml’ Order by MaxScore desc, DocID desc Pre asc, Post Desc SA Select Pre, Post, Score From TagTermIndex Where DocID=3 and tag=‘sec’ and term=‘xml’ Order by Pre Asc, Post Desc RA R A  Typically two B+trees in a DBMS

Top-k XPath over a Relational Schema [TopX, VLDB ’05 & VLDB-J(1) ’08] Content-only (CO) & “structure enriched” queries: //sec[about(.//, “XML”) and about(.//title, “native”]//par[about(.//, “retrieval”)]  Sequentially scan each index list in descending order of MaxScore  Hash-join element blocks by DocID in-memory  Do “some” incremental XPath evaluation using Pre/Post indices  Aggregate Score along connected path fragments  Use variant of Fagin’s threshold algorithm for top-k-style early termination sec[“xml”]title[“native”]par[“retrieval”]

article RA RA  Expensive predicate probes (RA) to the structure index (3rd B+tree)  Non-conjunctive XPath evaluations  Dynamically relax content- & structure-related query conditions (top-k results entirely driven by score aggregations for content & structure cond.’s) Content-and-structure (CAS) queries: //article//sec[about(.//, “XML”)] Select Pre, Post From TagIndex Where DocID=2123 and Tag=‘article’ Order by Pre asc, Post desc sec[“xml”] SA 1.0 Top-k XPath over a Relational Schema [TopX, VLDB ’05 & VLDB-J(1) ’08]

Relational Schema (ct’d) 20,810,942 distinct tag-term pairs for the 4.38 GB Wikipedia collection sec[“xml”] article  No shredding into DTD-specific relational schema!  No DTD at all for INEX Wikipedia! 1,107 distinct tags

TopX 1.0: Top-k XPath over a Relational Schema  2-dimensional source of redundancy  Full-content scoring model (red. factor ≈ avg. depth of a text node  6.7 for INEX-Wiki)  De-normalized relational schema, many redundant attributes  High overhead in the architecture (Java->JDBC->DBMS & back)  Element-block sizes are data-driven, not easy to control layout on disk  Hashing too slow compared to very efficient in-memory merge-joins Content IndexStructure Index ( ) bytes X 567,262,445 tag-term pairs ≈ 16 GB ( ) bytes X 52,561,559 tags ≈ 0.85 GB

TopX 2.0: Object-Oriented Storage DocID MaxSore 1 DocID sec[“xml”] 0 title[“xml”] 122,564 … par[“xml”] 432,534 ( ) X 567,262,445 Relational: ≈16 GB 4 X 456,466,649 + (4+4+4) X 567,262,445 Object-oriented: ≈ 8.6 GB (still uncompressed) (+ (4+4) X 20,810,942 = 166 MB for the offset index ) B B L L Binary file B – Element block separator L – Index list separator Element Block

 Group element blocks with similar MaxScore into document blocks of bounded length (e.g. < 256KB)  Sort element blocks within each document block by DocID  Supports  Sorted access by MaxScore  Merge-joins by DocID  Raw disk access Object-Oriented Storage w/Block-Merging sec[“xml”] 0 title[“xml”] 122,564 L B B B 5 B … … B B 3 6 B Document Block < 256KB MaxSore Element Block SA

Merging Document Blocks Incrementally  Sequential access and efficient merge-joins on top of large document blocks sec[“xml”] B B B 5 B … B B 3 6 B … par[“retrieval”] B B B 7 B B B 6 9 B //sec[about(.//, “XML”)] //par[about(.//, “retrieval”)] SA Max(MaxScore):

Compressed Number Encoding  Multi-attribute (=4) double-nested block-index structure  Delta encoding only works for DocID (and to some extent for Pre)  No specific assumptions on distributions of Pre/Post or Score  No Unary or Huffman coding (prefix-free but additional coding table)  Sophisticated compression schemes may be expensive to decode  No Zip, etc.; not even PFor-Delta (needs second pass for each attribute type)  But have known number ranges  DocID [1, 659,388] -> 3 bytes (254 3 = 16,387,064, lossless)  Pre/Post [1, 43,114] -> 2 bytes (256 2 = 64,516, lossless)  Score [0,1] -> rounded to 1 byte (256 buckets, lossy)  Variable-length byte encoding w/leading length-indicator byte Len PrePostScore  4+1=5 bytes  9+1=10 bytes

Some more tricks…  Dump leading histogram blocks into index list headers  Histograms only for index lists that exceed one document block (<5% of all lists)  Supports probabilistic pruning and cost-based index access scheduling [IO-Top-K, VLDB ’ 06]  Incrementally read & process precomputed memory images for fast top-k queries on top of large disk blocks Histogram Block 36 bytes 1 0 sec[“xml”] score freq EB 1 EB 2 … EB k DB 1 (256 KB) … … DB 2 (256 KB) DB l (256 KB)

………  SA Scheduling  Look-ahead Δ i through precomputed score histograms  Knapsack-based optimization of Score Reduction  RA Scheduling  2-phase probing: Schedule RAs “late & last” i.e., cleanup the queue if  Extended probabilistic cost model for integrated SA & RA scheduling Block Access Scheduling [IO-Top-K, VLDB ’ 06] Inverted Block-Index (256KB doc-blocks) Δ 3,3 = 0.2 Δ 1,3 = 0.8 SA RA RA

Object Storage Summary 567,262,445 tag-term pairs 20,810,942 distinct tag-term pairs 20,815,884 document blocks (<256KB) 456,466,649 element blocks 3,729,714,594 total bytes (3.47GB) (6.57 bytes/tag-term pair on avg.) 52,561,559 tags (elements) 1,107 distinct tags 2,323 document blocks (<256KB) 8,999,193 element blocks 205,021,938 total bytes (195MB) (3.9 bytes/tag on avg.) From 4.38 GB Wikipedia XML sources Structure Index Content Index (incl. histograms)

Efficiency Track Results – Focused, All 566/568 efficiency topics (CO & CAS) iP[0.0]iP[0.01]iP[0.05]iP[0.10] MAiPAVG MS SUM SEC CO CO CO CAS CAS CAS All experiments: AMD Opteron quad-core 2.6 GHz, 16 GB RAM, RAID 5, Windows Server 2003

Efficiency Track Results – Focused, Type (A) 538/540 type (A) efficiency topics (CO & CAS) MAiPAVG MS SUM SEC CO CO CO CAS CAS CAS

Efficiency Track Results – Focused, Type (B) MAiPAVG MS SUM SEC CO CO CO CAS CAS CAS /21 type (B) efficiency topics (CO & CAS)

Efficiency Track Results – Focused, Type (C) 7/7 type (C) efficiency topics (CO & CAS) MAiPAVG MS SUM SEC CO-15 n/a CO- 150 n/a CO n/a CAS-15 n/a CAS- 150 n/a CAS n/a

Efficiency Track Results – Thorough, All 566/568 efficiency topics (CO & CAS) MAPAVG MS SUM SEC CO CAS Note: top-15 only!

Conclusions & Outlook  Scalable and efficient XML-IR with vague search  TopX 1.0 our mature system, default engine for INEX topic development & interactive tracks [VLDB-J Special Issue on DB&IR Integration ‘08]  Brand-new TopX 2.0 prototype  Efficient reimplementation in C++, object-oriented XML storage, moderate compression rates  20—30 times better sequential throughput than relational  Can do CAS in 0.05 sec avg. & CO in 0.02 sec avg. (classic ad-hoc topics) and CAS in 0.09 sec avg. & CO in 0.05 sec avg. (incl. difficult topics)  More features  Generalized proximity search, graph top-k  Updates (gaps within document blocks)  XQuery Full-Text (top-k-style bounds over IF, For-Let )  …