Presentation is loading. Please wait.

Presentation is loading. Please wait.

TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory.

Similar presentations


Presentation on theme: "TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory."— Presentation transcript:

1 TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory University

2 Outline  Query rewriting  Data & scoring model  Distributed indexing (new for 2009!)  Query processing  Results  Ad-hoc  Efficiency Ad-hoc Focused Efficiency Focused

3 Query Rewriting I (NEXI/XPath-FT)  CAS Queries – //article//(sec|p)[(about(.//header, “Yoga Lessons” ) or about(.//title, +Yoga -history)) and about(.//figure, exercise) ] Query DAGs – tag-term pairs as leafs – navigational tags as support elements Discard all Boolean constraints, “andish” mode for both CO and CAS article sec p p header$ yoga header$ yoga header$ lesson header$ lesson title$ yoga title$ yoga figure$ exercise figure$ exercise // self

4 Query Rewriting II (NEXI)  CO Queries – “Yoga Lessons” +Yoga -history exercise – //*[about(., “Yoga Lessons” +Yoga -history exercise)] – Virtual * tag, fully pre-computed and materialized in inverted lists as *-term pairs – Can be generalized to specific tag classes (e.g. ) *$yoga *$lesson *$exercise self

5 Data Model  XML Trees (no XLink/ID/IDRef)  Pre-/post-order ranges for the structure  Redundant full-content text nodes XML Data Management XML management systems vary widely in their expressive power. Native XML Data Bases. Native XML data base systems can store schemaless data. “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ “native xml data base native xml data base system store schemaless data“ “xml data manage” article title abs sec “xml manage system vary wide expressive power“ “native xml data base” “native xml data base system store schemaless data“ title par 16 213 2 45 5364 “ xml data manage xml manage system vary wide expressive power native xml native xml data base system store schemaless data“ ftf (“xml”, article 1 ) = 4 ftf (“xml”, article 1 ) = 4 ftf (“xml”, sec 4 ) = 2 ftf (“xml”, sec 4 ) = 2 “native xml data base native xml data base system store schemaless data“

6 Scoring Model [TopX @ INEX ’05–’09]  XML-specific variant of Okapi BM25 (aka. E-BM25, Robertson et al. [INEX ‘05]) with k 1 = 2.0, b=0.75 decay factor for ftf of 0.925 Content Index (Tag-Term Pairs)Element Freq.Element Statistics author[“gates”] vs. section[“gates”] author[“gates”] vs. section[“gates”]

7 How to create a full CAS index for a large XML collection efficiently?  TopX index statistics for Wikipedia 2009 (55 GB XML sources)  Go distributed!

8 tag$term 1 tag$term 3 … tag$term 1 tag$term 3 … File [(f/p)+1] … File [2f/p] File [(p-1)(f/p)+1] … File [f] File [1] … File [f/p] tag$term 2 tag$term 4 … tag$term 2 tag$term 4 … tag$term 4 tag$term 5 … tag$term 4 tag$term 5 … … … … Node 1 Node 2 Node p Docs [1, …, n/p] Docs [(n/p)+1, …, 2n/p] Docs [(p-1)/(n/p)+1, …, n] Distributed Indexing I Top-k Engine Two-level hashing:  At query processing time: hash(t i )  NodeId|FileId|ByteOffset (64-bit dictionary)  At Indexing Time: FileId(t i ) = hash(t i ) mod f NodeId (t i ) = FileId(t i ) mod p

9 Distributed Indexing II  Shared dictionary is mapping 64-bit keys  64-bit values – Using hash(t i ) as keys – Using 8 bits/NodeId, 12 bits/FileId, 44 bits/ByteOffset as values  Max. distributed index size: 4,096 x 2 44 bytes = 16 Terabytes (Dictionary itself takes ~4 GB for 200 million keys)

10  Group element blocks with similar Max-Score into document blocks of fixed length (e.g. 256KB)  Sort element blocks within each document block by Doc-ID  Supports  Sequential (“sorted”) access by descending max(Max-Score)  Merge-joins by Doc-ID  Dynamic top-k pruning, efficient merge-joins over large blocks Index Files: Inverted Block Structure for CAS Queries sec[“xml”] 0 title[“xml”] 122,564 L 2240.7 3110.3 2150.9 1080.5 23480.8 Doc-ID 1 Doc-ID 5 Doc-ID 2 … 6150.6 13170.5 14320.3 5230.5 7210.3 24150.1 … Doc-ID 3 Doc-ID 6 Document Block ≤ 256KB Max-Sore Element Block SA pre post score

11 Merging Blocks Incrementally 6150.5 13170.5 14320.3 5230.6 7210.3 24150.1 sec[“xml”] 2240.7 3110.3 2150.9 2 1080.5 23480.8 1 5 … 3 6 … 32450.8 33270.7 37390.5 18290.8 23240.8 24150.7 par[“retrieval”] 65211.0 72430.5 3170.9 4 1390.2 12480.9 2 7 5 6 //sec[about(.//, “XML”)] //par[about(.//, “retrieval”)] SA 1.0 0.8 Max(Max-Score): 0.9 0.6  Sorted access and efficient merge-joins on top of large document blocks from disk

12 Some more tricks…  Dump leading histogram blocks directly into index list headers  Histograms only for index lists that exceed one document block (<5% of all lists)  Supports probabilistic pruning and cost-based index access scheduling [Prob-Top-K, VLDB ’04; IO-Top-K, VLDB ’ 06]  Efficient on-the-fly index decompression (S16), internal caching of decompressed index lists  Incrementally read & process precomputed memory images for fast top-k queries on top of large disk blocks Histogram Block ~36 bytes

13 Runs Ah-hoc Track (Article-Only, CO & CAS) – Focused – Best-In-Context – Thorough Efficiency – Type (A) Focused (same as Ad-Hoc Focused) Top-15, Top-150, Top-1500, Article-Only, CO & CAS – Type (B) Focused, CO only Top-15 only, but up to 96 keywords/query

14 Results – Ad-hoc, Focused

15 Results – Ad-hoc, Best-In-Context

16 Results – Ad-hoc, Thorough

17 Results – Efficiency, Focused (Type A)

18

19 Results – Efficiency, Focused (Type B)

20

21 Future Work Phrase-matching & proximity ranking(non-monotonic!) “Holistic” Top-k for XQuery – Multiple XPaths per XQuery – Efficient inter-document retrieval – Complex Boolean constraints among paths Updates!  Full-fledged open-source platform for W3C XQuery Full-Text


Download ppt "TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory."

Similar presentations


Ads by Google