Download presentation
Presentation is loading. Please wait.
1
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Stanford University Ralf Schenkel Max-Planck Institute Mohammed AbuJarour Hasso-Plattner Institute
2
“Native XML data base systems can store schemaless data... ” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ” “XML-QL: A Query Language for XML.” “Native XML Data Bases.” “Proc. Query Languages Workshop, W3C,1998.” “XML queries with an expres- sive power similar to that of Datalog …” sec article sec par bib par title “Current Approaches to XML Data Manage- ment” item par title inproc title //article[.//bib[about(.//item, “W3C”)] ]//sec [ about(.//, “XML retrieval”) ] //par [ about(.//, “native XML databases”) ] “What does XML add for retrieval? It adds formal ways …” “ w3c.org/xml” sec article sec par “Sophisticated technologies developed by smart people.” par title “The X ML Files ” par title “The Ontology Game” title “The Dirty Little Secret” bib “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” title item url “XML” RANKINGRANKING VAGUENESSVAGUENESS EARLY PRUNING From the INEX ’03-’05 IEEE Collection
3
Ontology/ Large Thesaurus WordNet, OpenCyc, etc. Ontology/ Large Thesaurus WordNet, OpenCyc, etc. SA Relational DBMS Backend Unified Text & XML Schema Relational DBMS Backend Unified Text & XML Schema Random Access Top-k Queue Top-k Queue Scan Threads Candidate Queue Candidate Queue Indexer/Crawler Frontends Web Interface Web Service API Frontends Web Interface Web Service API Selectivities Histograms Correlations Selectivities Histograms Correlations Index Metadata TopX 1.0 Query Processor TopX 1.0 Query Processor Sequential Access SA Path Conditions Phrases & Proximity Other Full-Text Op’s Path Conditions Phrases & Proximity Other Full-Text Op’s Expensive Predicates RA RA Probabilistic Candidate Pruning Probabilistic Candidate Pruning Probabilistic Index Access Scheduling Probabilistic Index Access Scheduling Dynamic Query Expansion Dynamic Query Expansion Non-conjunctive Top-k XPath Query Processing Non-conjunctive Top-k XPath Query Processing RA RA JDBC 2.0
4
Data Model XML trees (no XLink/ID/IDRef) Pre-/postorder ranges for the structural index Redundant full-content text nodes XML Data Management XML management systems vary widely in their expressive power. Native XML Data Bases. Native XML data base systems can store schemaless data. “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ “native xml data base native xml data base system store schemaless data“ “xml data manage” article title abs sec “xml manage system vary wide expressive power“ “native xml data base” “native xml data base system store schemaless data“ title par 16 213 2 45 5364 “ xml data manage xml manage system vary wide expressive power native xml native xml data base system store schemaless data“ ftf (“xml”, article 1 ) = 4 ftf (“xml”, article 1 ) = 4 ftf (“xml”, sec 4 ) = 2 ftf (“xml”, sec 4 ) = 2 “native xml data base native xml data base system store schemaless data“
5
Scoring Model [INEX ‘05/’06/’07] XML-specific variant of Okapi BM25 (originating from probabilistic IR on unstructured text) Content Index (Tag-Term Pairs)Element Freq.Element Statistics bib[“transactions”] vs. par[“transactions”] bib[“transactions”] vs. par[“transactions”]
6
TopX 1.0: Relational Schema Precompute & materialize scoring model into combined inverted index over tag-term pairs Supports sorted access (by MaxScore) and random access (by DocID) sec[“xml”] Select DocID, Pre, Post, Score From TagTermIndex Where tag=‘sec’ and term=‘xml’ Order by MaxScore desc, DocID desc Pre asc, Post Desc SA Select Pre, Post, Score From TagTermIndex Where DocID=3 and tag=‘sec’ and term=‘xml’ Order by Pre Asc, Post Desc RA R A Two B+trees
7
Top-k XPath on a Relational Schema [VLDB ’05] Content-only (CO) & “structure enriched” queries: //sec[about(.//, “XML”) and about(.//title, “native”]//par[about(.//, “retrieval”)] Sequentially (mostly) scan each index list in desc. order of MaxScore Hash-join element blocks by DocID in-memory Do “some” incremental XPath evaluation using Pre/Post indices Aggregate Score along connected path fragments Use variant of Fagin’s threshold algorithm for top-k-style early termination sec[“xml”]title[“native”]par[“retrieval”]
8
article RA RA Expensive predicate probes (RA) to the structure index (3rd B+tree) Non-conjunctive XPath evaluations Dynamically relax content- & structure-related query conditions (top-k results entirely driven by score aggregations for content & structure cond.’s) Content-and-structure (CAS) queries: //article//sec[about(.//, “XML”)] Select Pre, Post From TagIndex Where DocID=2123 and Tag=‘article’ Order by Pre asc, Post desc sec[“xml”] SA Top-k XPath on a Relational Schema [VLDB ’05] 1.0
9
Relational Schema (cont’d) 20,810,942 distinct tag-term pairs for 4.38 GB Wikipedia collection sec[“xml”] article No shredding into DTD-specific relational schema! No DTD at all for INEX Wikipedia! 1,107 distinct tags
10
Relational Schema (cont’d) 2-dimensional source of redundancy Full-content scoring model (#terms times avg. depth of a text node 6.7 for INEX Wiki) De-normalized relational schema High overhead in the architecture (Java->JDBC->DBMS & back) Element-block sizes are data-driven, not easy to control layout on disk Hashing too slow compared to very efficient in-memory merge-joins Content IndexStructure Index (4+4+4+4+4+4+4) bytes X 567,262,445 tag-term pairs 16 GB (4+4+4+4) bytes X 52,561,559 tags 0.85 GB
11
TopX 2.0: Object-Oriented Storage 2150.9 2 DocID 1080.5 23480.8 45870.2 MaxSore 1 DocID sec[“xml”] 0 title[“xml”] 122,564 … par[“xml”] 432,534 (4+4+4+4+4+4+4) X 567,262,445 Relational: 16 GB 4 X 456,466,649 + (4+4+4) X 567,262,445 Object-oriented: 8.6 GB (+ (4+4) X 20,810,942 = 166 MB for the offset index ) B 2150.9 17 1450.2 27320.4 160.9 3 B L L Binary file B – Element block separator L – Index list separator
12
Group element blocks with similar MaxScore into document blocks of fixed length (e.g. 256KB) Sort element blocks within each document block by DocID Supports Sorted access by MaxScore Merge-joins by DocID Raw disk access Object-Oriented Storage w/Block-Merging sec[“xml”] 0 title[“xml”] 122,564 L B… B B 2240.7 3110.3 2150.9 2 1080.5 23480.8 1 B 5 B … 6150.6 13170.5 14320.3 5230.5 7210.3 24150.1 … B… B B 3 6 B Document Block MaxSore
13
Merging Document Blocks Sequential access and efficient merge-joins on top of large document blocks 6150.5 13170.5 14320.3 5230.6 7210.3 24150.1 sec[“xml”] B… B B 2240.7 3110.3 2150.9 2 1080.5 23480.8 1 B 5 B … B… B B 3 6 B … 32450.8 33270.7 37390.5 18290.8 23240.8 24150.7 par[“retrieval”] B… B B 65211.0 72430.5 3170.9 5 1390.2 12480.9 2 B 7 B B… B B 6 9 B //sec[about(.//, “XML”)] //par[about(.//, “retrieval”)] SA 1.0 0.8 0.7
14
Compressed Number Encoding Multi-attribute (4), double-nested block-index structure Delta encoding only works for DocID (and to some extent for Pre) No specific assumptions on distributions of Pre/Post or Score No Unary or Huffman coding (prefix-free but additional coding table) Sophisticated compression schemes may be expensive to decode No Zip, etc. But known number ranges DocID [1, 659,388] -> 3 bytes (254 3 = 16,387,064, lossless) Pre/Post [1, 43,114] -> 2 bytes (254 2 = 64,516, lossless) Score [0,1] -> rounded to 1 byte (254 buckets, lossy) Variable-length byte encoding w/leading length-indicator byte 4 3267 9225332192 Len PrePostScore 5 bytes 10 bytes
15
Some more tricks… Dump leading histogram blocks into index list headers Histograms only for index lists that exceed one document block (<5% of all lists) Own native compare methods for DocID, Pre/Post Decode only Score for arithmetic op’s ( Mostly perform pointer operations at qp time) Incrementally read & process precomputed memory image for fast top-k queries on top of large disk blocks Histogram Block 36 bytes 1 0 sec[“xml”] score freq EB 1 EB 2 … EB k DB 1 (256 KB) … … DB 2 (256 KB) DB l (256 KB)
16
……… 1.0 0.9 0.8 1.0 0.9 0.2 1.0 0.9 0.7 0.6 SA Scheduling Look-ahead Δ i through precomputed score histograms Knapsack-based optimization of Score Reduction RA Scheduling 2-phase probing: Schedule RAs “late & last” i.e., cleanup the queue if Extended probabilistic cost model for integrated SA & RA scheduling Block Access Scheduling [VLDB ’ 06] Inverted Block-Index (256KB doc-blocks) Δ 3,3 = 0.2 Δ 1,3 = 0.8 SA RA RA
17
Object Storage Summary 567,262,445 tag-term pairs 20,810,942 distinct tag-term pairs 20,815,884 document blocks (256KB) 456,466,649 element blocks 4,703,385,686 total bytes (8.3 bytes/tag-term pair) 52,561,559 tags (elements) 1,107 distinct tags 2,323 document blocks (256KB) 8,999,193 element blocks 246,601,752 total bytes (4.7 bytes/tag) 4.38 GB Wikipedia XML sources Structure Index Content Index (incl. histograms)
18
Preliminary Runtime Experiments CO (top-10, non-conjunctive)
19
Preliminary Runtime Experiments CAS (top-10, non-conjunctive)
20
Some INEX Results CAS (top-1,500, non-conjunctive)
21
Some INEX Results CAS (top-1,500, non-conjunctive)
22
Conclusions & Outlook Scalable and efficient XML-IR with vague search Mature system, reference engine for INEX topic development & interactive tracks [VLDB Special Issue on DB&IR Integration ‘08] Brand-new TopX 2.0 prototype Very efficient reimplementation in C++ Object-oriented XML storage, moderate compression rates 10—20 times better sequential throughput than relational More features Generalized proximity search, graph top-k Updates (gaps within document blocks) XQuery Full-Text (top-k-style bounds over IF, For-Let ) …
23
http://www.inex.otago.ac.nz/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.