1/28 Efficient Top-k Queries for XML Information Retrieval Gerhard Weikum Joint work with Ralf Schenkel.

Slides:



Advertisements
Similar presentations
Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Advertisements

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Improved TF-IDF Ranker
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Gerhard Weikum Joint work with Martin Theobald and Ralf Schenkel Efficient Top-k Queries for XML Information Retrieval.
IRDM WS Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query Processing 3.3 Index Access Scheduling.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.
DELIS Highlights: Efficient and Intelligent Top-k Search in Peer-to-Peer Systems presented by Gerhard Weikum (Max-Planck Institute of Computer Science)
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Stanford University Ralf Schenkel Max-Planck Institute Mohammed.
Chapter 5: Information Retrieval and Web Search
Information Retrieval in Practice
1 Intelligente und effiziente Suche auf semistrukturierten Daten Gerhard Weikum
LOGO XML Keyword Search Refinement 郭青松. Outline  Introduction  Query Refinement in Traditional IR  XML Keyword Query Refinement  My work.
VLDB ´04 Top-k Query Evaluation with Probabilistic Guarantees Martin Theobald Gerhard Weikum Ralf Schenkel Max-Planck Institute for Computer Science SaarbrückenGermany.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Chapter 6: Information Retrieval and Web Search
CS 1308 Computer Literacy and the Internet
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.
Winter Semester 2003/2004Selected Topics in Web IR and Mining5-1 5 Index Pruning 5.1 Index-based Query Processing 5.2 Pruning with Combined Authority/Similarity.
INEX ‘05 INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
B+ Trees: An IO-Aware Index Structure Lecture 13.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
Neighborhood - based Tag Prediction
Indexing & querying text
Database Management System
Information Retrieval in Practice
Max-Planck Institute for Informatics
Probabilistic Data Management
Max Planck Institute for Informatics
Martin Theobald Max-Planck-Institut Informatik Stanford University
Structure and Content Scoring for XML
Chapter 5: Information Retrieval and Web Search
Overview of Query Evaluation
Observations on DB & IR and Conclusions
Structure and Content Scoring for XML
Evaluation of Relational Operations: Other Techniques
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

1/28 Efficient Top-k Queries for XML Information Retrieval Gerhard Weikum Joint work with Ralf Schenkel and Martin Theobald Max-Planck- Gesellschaft

2/28 A Few Challenging Queries (on Web / Deep Web / Intranet / Personal Info) Which drama has a scene in which a woman makes a prophecy to a Scottish nobleman that he will become king? Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML? SB IR XML Who was the woman from Paris that I met at the PC meeting where Alon Halevy was PC Chair? Which gene expression data from Barrett tissue in the esophagus exhibit high levels of gene A01g? Are there any published theorems that are equivalent to or subsume my latest mathematical conjecture?

3/28 Professor Name: Gerhard Weikum Address... Country: Germany Teaching:Research: Course Title: IR Description: Information retrieval... Syllabus... BookArticle... Project Title: Intelligent Search of XML Data Sponsor: German Science Foundation... XML-IR Example (1) Select P, C, R From Index Where Professor As P And P = „ Saarbruecken“ And P// Course = „ IR“ As C And P// Research = „ XML“ As R City: SB

4/28 Article... Select P, C, R From Index Where Professor As P And P = „ Saarbruecken“ And P// Course = „ IR“ As C And P// Research = „ XML“ As R Select P, C, R From Index Where ~ Professor As P And P = „ ~ Saarbruecken“ And P// ~ Course = „ ~ IR“ As C And P// ~ Research = „ ~ XML“ As R Professor Name: Gerhard Weikum Address... Country: Germany Teaching:Research: Course Title: IR Description: Information retrieval... Syllabus... Book... Project Title: Intelligent Search of XML Data Sponsor: German Science Foundation... XML-IR Example (2) City: SB Name: Ralf Schenkel Teaching: Literature Book Lecturer Interests: Semistructured Data, IR Address: Max-Planck Institute for CS, Germany Seminar Title: Statistical Language Models... Contents: Ranked Search...

5/28 XML-IR: History and Related Work IR on structured docs (SGML): IR on XML: XIRQL (U Dortmund) XXL (U Saarland / MPI) XRank (Cornell U) JuruXML (IBM Haifa ) Commercial software (MarkLogic, Verity?, Oracle?, Google?,...) XML query languages: XQuery (W3C) XPath 2.0 (W3C) INEX benchmark Compass (U Saarland / MPI) XPath 1.0 (W3C) XML-QL (AT&T Labs) Web query languages: Lorel (Stanford U) Araneus (U Roma) W3QS (Technion Haifa) TeXQuery (AT&T Labs) FleXPath (AT&T Labs) ELIXIR (U Dublin) HyperStorM (GMD Darmstadt) HySpirit (U Dortmund) WHIRL (CMU) PowerDB-IR (ETH Zurich) ApproXQL (U Berlin / U Munich) Timber (U Michigan) XSearch (Hebrew U)

6/28 XML-IR Concepts Where clause: conjunction of restricted path expressions with binding of variables Select P, C, R From Index Where ~Professor As P And P = „Saarbruecken“ And P//~Course = „Information Retrieval“ As C And P//~Research = „~XML“ As R Elementary conditions on names and contents „Semantic“ similarity conditions on names and contents Relevance scoring based on tf*idf similarity of contents, ontological similarity of names, aggregation of local scores into global scores ~Research = „~XML“ Query result: query is a path/tree/graph pattern results are isomorphic paths/subtrees/subgraphs of the data graph Query result: query is a pattern with relaxable conditions results are approximate matches to query with similarity scores applicable to both XML and HTML data graphs

7/28 Ontologies/Thesauri: Example WordNet woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman)... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil)

8/28 Ontology Graph An ontology graph is a directed graph with concepts (and their descriptions) as nodes and semantic relationships as edges (e.g., hypernyms). woman human body personality character lady witch nanny Mary Poppins fairy Lady Di heart... syn (1.0) hyper (0.9) part (0.3) mero (0.5) part (0.8) hypo (0.77) hypo (0.3) hypo (0.35) hypo (0.42) instance (0.2) instance (0.61) instance (0.1) Weighted edges capture strength of relationships  key for identifying closely related concepts

9/28 Query Expansion Threshold-based query expansion: substitute ~w by (c 1 |... | c k ) with all c i for which sim(w, c i )   „Old hat“ in IR; highly disputed for danger of topic dilution Approach to careful expansion: determine phrases from query or best initial query results (e.g., forming 3-grams and looking up ontology/thesaurus entries) if uniquely mapped to one concept then expand with synonyms and weighted hyponyms Problem: choice of threshold   see Top-k QP

10/28 Outline Motivation & Preliminaries Prob-k: Efficient Approximative Top-k TopX: Top-k for XML

11/28 Top-k Query Processing with Scoring Naive join&sort QP algorithm: algorithm B+ tree on terms 17: : performance... z-transform... 52: : : : : : : : : : : : : 0.144: : : 0.6 index lists with (DocId, s = tf*idf) sorted by DocId Given: query q = t 1 t 2... t z with z (conjunctive) keywords similarity scoring function score(q,d) for docs d  D, e.g.: Find: top k results w.r.t. score(q,d) =aggr{s i (d)}(e.g.:  i  q s i (d)) Google: > 10 mio. terms > 8 bio. docs > 4 TB index q: algorithm performance z-transform top-k (  [term=t 1 ] (index)  DocId  [term=t 2 ] (index)  DocId...  DocId  [term=t z ] (index) order by s desc)

12/28 TA (Fagin’01; Güntzer/Kießling/Balke; Nepal et al.) scan all lists L i (i=1..m) in parallel: consider d j at position pos i in Li; high i := s i (d j ); if d j  top-k then { look up s (d j ) in all lists L with  i; // random access compute s(d j ) := aggr {s (dj) | =1..m}; if s(d j ) > min score among top-k then add d j to top-k and remove min-score d from top-k; }; if min score among top-k  aggr {high | =1..m} then exit; m=3 aggr: sum k=2 f: 0.5 b: 0.4 c: 0.35 a: 0.3 h: 0.1 d: 0.1 a: 0.55 b: 0.2 f: 0.2 g: 0.2 c: 0.1 h: 0.35 d: 0.35 b: 0.2 a: 0.1 c: 0.05 f: 0.05 f: 0.75 a: 0.95 top-k: b: 0.8 but random accesses are expensive !  TA-sorted  Prob-sorted applicable to XML data: course = „~ Internet“ and ~topic = „performance“

13/28 TA-Sorted (aka. NRA) scan index lists in parallel: consider d j at position pos i in Li; E(d j ) := E(d j )  {i}; high i := s i (q,d j ); bestscore(d j ) := aggr{x 1,..., x m ) with x i := s i (q,d j ) for i  E(d j ), high i for i  E(d j ); worstscore(dj) := aggr{x 1,..., x m ) with x i := si(q,d j ) for i  E(d j ), 0 for i  E(d j ); top-k := k docs with largest worstscore; if min worstscore among top-k  max bestscore{d | d not in top-k} then exit; m=3 aggr: sum k=2 a: 0.55 b: 0.2 f: 0.2 g: 0.2 c: 0.1 h: 0.35 d: 0.35 b: 0.2 a: 0.1 c: 0.05 f: 0.05 top-k: candidates: f: 0.5 b: 0.4 c: 0.35 a: 0.3 h: 0.1 d: 0.1 f: ?  a: 0.95 h: ?  b: 0.8 d: ?  c: ?  g: ?  h: ?  d: ? 

14/28 Evolution of a Candidate’s Score scan depth drop d from the candidate queue Approximate top-k “What is the probability that d qualifies for the top-k ?” bestscore(d) worstscore(d) min-k score ? Worst- and best-scores slowly converge to final score Add d to top-k result, if worstscore(d) > min-k Drop d only if bestscore(d) < min-k, otherwise keep it in candidate queue  Overly conservative threshold & long sequential index scans TA family of algorithms based on invariant (with sum as aggr) worstscore(d)bestscore(d)

15/28 Top-k Queries with Probabilistic Guarantees TA family of algorithms based on invariant (with sum as aggr) Relaxed into probabilistic invariant where the RV S i has some (postulated and/or estimated) distribution in the interval (0,high i ] f: 0.5 b: 0.4 c: 0.35 a: 0.3 h: 0.1 d: 0.1 a: 0.55 b: 0.2 f: 0.2 g: 0.2 c: 0.1 h: 0.35 d: 0.35 b: 0.2 a: 0.1 c: 0.05 f: 0.05 S1S1 S2S2 S3S3 Discard candidates with p(d) ≤  Exit index scan when candidate list empty high i worstscore(d)bestscore(d)

16/28 Probabilistic Threshold Test postulating uniform or Zipf score distribution in [0, high i ] compute convolution using LSTs use Chernoff-Hoeffding tail bounds or generalized bounds for correlated dimensions (Siegel 1995) fitting Poisson distribution (or Poisson mixture) over equidistant values: easy and exact convolution distribution approximated by histograms: precomputed for each dimension dynamic convolution at query-execution time with independent Si‘s or with correlated Si‘s engineering-wise histograms work best! 0 f 2 (x) 1 high 2 Convolution (f 2 (x), f 3 (x)) 2 0 δ(d) f 3 (x) high  cand doc d with 2  E(d), 3  E(d)

17/28 Prob-sorted Algorithm (Smart Variant) Prob-sorted (RebuildPeriod r, QueueBound b):... scan all lists Li (i=1..m) in parallel: …same code as TA-sorted… // queue management for all priority queues q for which d is relevant do insert d into q with priority bestscore(d); // periodic clean-up if step-number mod r = 0 then // rebuild; single bounded queue if strategy = Smart then for all queue elements e in q do update bestscore(e) with current high_i values; rebuild bounded queue with best b elements; if prob[top(q) can qualify for top-k] <  then exit; if all queues are empty then exit;

18/28 TA-sortedProb-sorted (smart) #sorted accesses2,263,652527,980 elapsed time [s] max queue size relative recall10.69 rank distance039.5 score error Performance Results for.Gov Queries on.GOV corpus from TREC-12 Web track: 1.25 Mio. docs (html, pdf, etc.) 50 keyword queries, e.g.: „Lewis Clark expedition“, „juvenile delinquency“, „legalization Marihuana“, „air bag safety reducing injuries death facts“ speedup by factor 10 at high precision/recall (relative to TA-sorted); aggressive queue mgt. even yields factor 100 at % prec./recall

19/28.Gov Expanded Queries on.GOV corpus with query expansion based on WordNet synonyms: 50 keyword queries, e.g.: „juvenile delinquency youth minor crime law jurisdiction offense prevention“, „legalization marijuana cannabis drug soft leaves plant smoked chewed euphoric abuse substance possession control pot grass dope weed smoke“ TA-sortedProb-sorted (smart) #sorted accesses22,403,49018,287,636 elapsed time [s] max queue size relative recall10.88 rank distance014.5 score error00.035

20/28 Performance Results for IMDB Queries on IMDB corpus (Web site: Internet Movie Database): movies, 1.2 Mio. persons (html/xml) 20 structured/text queries with Dice-coefficient-based similarities of categorical attributes Genre and Actor, e.g.: Genre  {Western}  Actor  {John Wayne, Katherine Hepburn}  Description  {sheriff, marshall}, Genre  {Thriller}  Actor  {Arnold Schwarzenegger}  Description  {robot} TA-sortedProb-sorted (smart) #sorted accesses1,003,650403,981 elapsed time [s] max queue size relative recall10.75 rank distance score error00.25

21/28 response time: : : : : : : 0.6 throughput: : : : : : 0.8 Handling Ontology-Based Query Expansions algorithm B+ tree index on terms 57: : performance 52: : : : : : : : : : 0.4 ontology / meta-index  i  q {max j  onto(i) { sim(i,j)*sj(d)) }} performance response time: 0.7 throughput: 0.6 queueing: 0.3 delay: consider expandable query „algorithm and ~performance“ with score dynamic query expansion with incremental on-demand merging of additional index lists + much more efficient than threshold-based expansion + no threshold tuning + no topic drift

22/28 Outline Motivation & Preliminaries Prob-k: Efficient Approximative Top-k TopX: Top-k for XML

23/28 Top-k Search on XML TA-style algorithm should handle also tag-value conditions such as title = algorithm and ~topic = ~performance using standard indexes and exact path conditions of the form / book // literature using path indexes (pre/post index, HOPI, etc.) Example query (NEXI, XPath & IR): //book[about(.//„Information Retrieval“ „XML“) and.//affiliation[about(.// „Stanford“)] and.//reference[about(.// „Page Rank“)] ]//publisher//country Handling arbitrary combinations of tag-term conditions and path conditions TopX Algorithm

24/28 Problems Addressed by TopX Problems: 1) content conditions (CC) on both tags and terms 2) scores for elements or subtrees, docs as results 3) score aggregation not necessarily monotonic 4) test path conditions (PC), but avoid random accesses Solutions: 0) disk space is cheap, disk I/O is not ! 1) build index lists for each tag-term pair 2) block-fetch all elements for the same doc in desc. order of MaxScore(e) = max{Score(e‘) | e‘  doc(e)} 3) precompute and store scores for entire subtrees 4a) test PCs on candidates in memory 4b) postpone evaluation of remaining PCs until after threshold test Example query (NEXI, XPath & IR): //book[about(.//„Information Retrieval“ „XML“) and.//affiliation[about(.// „Stanford“)] and.//reference[about(.// „Page Rank“)] ]//publisher//country

25/28 Simplified Scoring Model for TopX 2:A 1:R 6:B 3:X7:X 4:B5:C aaccabab 8:B9:C bbbcccxy 2:X 1:A 6:B 3:B7:C 4:B5:C cccabb abcabc 2:B 1:Z 3:X 4:C5:A aaaabb 6:B8:X 7:C acc 9:B10:A bb11:C12:C aabbcxyz d1d2d3 restricted to tree data (disregarding links) using only tag-term conditions for scoring score (doc d or subtree rooted at d.n) for query q with tag-term conditions A 1 [a 1 ],..., A m [a m ] ) matched by nodes n 1,..., n m Example:

26/28 TopX Pseudocode based on index table L (Tag, Term, MaxScore, DocId, Score, ElemId, Pre, Post) decompose query: content conditions (CC) & path conditions (PC); for each index list L i (extracted from L by tag and/or term) do: block-scan next elements from same doc d; E(d) := E(d)  {i}; for each CC l  E(d) – {i} do: for each element pair (e, e‘)  (elems(d,L i )  elems(d,L l )) do: test PC(i,l) connecting e and e‘ using pre & post of e, e‘; delete e  elems(d,L i ) if not  e‘ such that (e,e‘) satisifes PC(i,l); delete e‘  elems(d,L l ) if not  e such that (e,e‘) satisifes PC(i,l); bestscore(d) :=  j  E(d) max{Score(e) | e  elems(L j )} +  j  E(d) high j ; worstscore(d) := 0; if E(d) is complete then worstscore(d) := bestscore(d);... // proceed with standard top-k algorithm test remaining PCs on d and drop d if not satisifed;

27/28 TopX Example Data Example query: //A [.//“a“ &.//B[.//“b“] &.//C[.//“c“] ] Tag Term MaxScore DocId Score ElemId Pre Post pre- computed index table with appropriate B+ tree index on (Tag, Term, MaxScore, DocId, Score, ElemId) 2:A 1:R 6:B 3:X7:X 4:B5:C aaccabab 8:B9:C bbbcccxy 2:X 1:A 6:B 3:B7:C 4:B5:C cccabb abcabc 2:B 1:Z 3:X 4:C5:A aaaabb 6:B8:X 7:C acc 9:B10:A bb11:C12:C aabbcxyz d1d2d3 block-scans: (A, a, d3,...) (B, b, d1,...) (C, c, d2,...) (A, a, d1,...) (B, b, d3,...) (C, c, d3,...) (A, a, d2,...) (B, b, d2,...) (C, c, d1,...) A a 1 d3 1 e5 5 2 A a 1 d3 1/4 e A a 1/2 d1 1/2 e2 2 4 A a 2/9 d2 2/9 e1 1 7 B b 1 d1 1 e8 8 5 B b 1 d1 1/2 e4 4 1 B b 1 d1 3/7 e6 6 8 B b 1 d3 1 e9 9 7 B b 1 d3 1/3 e2 2 4 B b 2/3 d2 2/3 e4 4 1 B b 2/3 d2 1/3 e3 3 3 B b 2/3 d2 1/3 e6 6 6 C c 1 d2 1 e5 5 2 C c 1 d2 1/3 e7 7 5 C c 2/3 d3 2/3 e7 7 5 C c 2/3 d3 1/5 e C c 3/5 d1 3/5 e9 9 6 C c 3/5 d1 1/2 e5 5 2

28/28 Experimental Results: INEX Benchmark  (join&sort)TopX (  =0.0)TopX (  =0.1) #sorted accesses472,22770,6745,534 #random accesses elapsed time [s] relative recall on IEEE-CS journal and conference articles: 12,000 XML docs with 12 Mio. elements,7.9 GB for all indexes 20 CO queries, e.g.: „XML editors or parsers“ 20 CAS queries, e.g.: //article[.//bibl[about(.//„QBIC“)] and.//p[about(.//„image retrieval“)] ] CO  (join&sort)TopX (  =0.0)TopX (  =0.1) #sorted accesses1,077,302280,249163,084 #random accesses2,0221,931 elapsed time [s] relative recall CAS

29/28 INEX with Query Expansion static expansionincremental merge (  =0.8,  =0.1)(  =0.1) #sorted accesses471,128533,385 #random accesses4, elapsed time [s] max #terms CO queries, e.g.: „XML editors or parsers“ 20 CAS queries, e.g.: //article[.//bibl[about(.//„QBIC“)] and.//p[about(.//„image retrieval“)] ] CO CAS static expansionincremental merge (  =0.8,  =0.1)(  =0.1) #sorted accesses814,1231,271,379 #random accesses12, elapsed time [s] max #terms

30/28 Conclusion: Ongoing and Future Work Observation: Approximations with statistical guarantees are key to obtaining Web-scale efficiency (e.g., TREC’04 Terabyte benchmark: ca. 25 Mio. docs, ca terms, 5-50 terms per query) Challenges: Generalize TopX to arbitrary graphs Efficient consideration of correlated dimensions Integrated support for all kinds of XML similarity search: content & ontological sim, structural sim Scheduling of index-scan steps and few random accesses Integration of top-k operator into physical algebra and query optimizer of XML engine