Gerhard Weikum Joint work with Martin Theobald and Ralf Schenkel Efficient Top-k Queries for XML Information Retrieval.

Slides:



Advertisements
Similar presentations
Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Advertisements

Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Improved TF-IDF Ranker
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
IRDM WS Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query Processing 3.3 Index Access Scheduling.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Max-Planck Institute Ralf Schenkel Max-Planck Institute Mohammed.
DELIS Highlights: Efficient and Intelligent Top-k Search in Peer-to-Peer Systems presented by Gerhard Weikum (Max-Planck Institute of Computer Science)
Search Engines and Information Retrieval
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Stanford University Ralf Schenkel Max-Planck Institute Mohammed.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Web and Intranet Search: What‘s Next After Google* ? Moderator: Gerhard Weikum (Max-Planck Institute for CS) Panelists: Eric Brill (Microsoft Research)
Information Retrieval in Practice
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
Search Engines and Information Retrieval Chapter 1.
VLDB ´04 Top-k Query Evaluation with Probabilistic Guarantees Martin Theobald Gerhard Weikum Ralf Schenkel Max-Planck Institute for Computer Science SaarbrückenGermany.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
CIDR 20051/16 Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? Surajit Chaudhuri (Microsoft Research) Raghu Ramakrishnan (U.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Ralf Schenkel joint work with Jens Graupmann and Gerhard Weikum The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
1/28 Efficient Top-k Queries for XML Information Retrieval Gerhard Weikum Joint work with Ralf Schenkel.
Chapter 6: Information Retrieval and Web Search
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany.
IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.
INEX ‘05 INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
An Efficient Algorithm for Incremental Update of Concept space
Neighborhood - based Tag Prediction
Database Management System
Max-Planck Institute for Informatics
Probabilistic Data Management
Implementation Issues & IR Systems
Chapter 12: Query Processing
Max Planck Institute for Informatics
Martin Theobald Max-Planck-Institut Informatik Stanford University
Laks V.S. Lakshmanan Depf. of CS UBC
Structure and Content Scoring for XML
Structure and Content Scoring for XML
Introduction to XML IR XML Group.
Presentation transcript:

Gerhard Weikum Joint work with Martin Theobald and Ralf Schenkel Efficient Top-k Queries for XML Information Retrieval

Gerhard Weikum May 19, /28 Queries Beyond Google professors from Saarbruecken who teach DB or IR and have projects on XML drama with three women making a prophecy to a British nobleman that he will become king the woman from Paris whom I met at the PC meeting chaired by Raghu Ramakrishnan  “Semantic Search”: exploit structure and annotations in the data exploit background knowledge (ontologies/thesauri + statistics) connect/join/fuse information fragments

Gerhard Weikum May 19, /28 Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML? Professor Name: Gerhard Weikum Address... City: SB Country: Germany Teaching:Research: What If The Semantic Web Existed And All Information Were in XML? Course Title: IR Description: Information retrieval... Syllabus... BookArticle... Project Title: Intelligent Search of XML Data Sponsor: German Science Foundation...

Gerhard Weikum May 19, /28 Professor Name: Gerhard Weikum Address... Country: Germany Teaching:Research: Course Title: IR Description: Information retrieval... Syllabus... BookArticle... Project Title: Intelligent Search of XML Data Sponsor: German Science Foundation... XML-IR Example (1) City: SB Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML? // Professor [//* = „ Saarbruecken“] [// Course [//* = „ IR“] ] [// Research [//* = „ XML“] ]

Gerhard Weikum May 19, /28 Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML? Article... // Professor [//* = „ Saarbruecken“] [// Course [//* = „ IR“] ] [// Research [//* = „ XML“] ] // ~ Professor [//* = „ ~ Saarbruecken“] [// ~ Course [//* = „ ~ IR“] ] [// ~ Research [//* = „ ~ XML“] ] Professor Name: Gerhard Weikum Address... Country: Germany Teaching:Research: Course Title: IR Description: Information retrieval... Syllabus... Book... Project Title: Intelligent Search of XML Data Sponsor: German Science Foundation... XML-IR Example (2) City: SB Name: Ralf Schenkel Teaching: Literature Book Lecturer Interests: Semistructured Data, IR Address: Max-Planck Institute for CS, Germany Seminar Title: Statistical Language Models... Contents: Ranked Search... Need to combine DB and IR techniques with logics, statistics, AI, ML, NLP for ranked retrieval

Gerhard Weikum May 19, /28 Outline Motivation and Strategic Direction XML IR & Ontologies Efficient Top-k QP Conclusion TopX: Efficient XML IR

Gerhard Weikum May 19, /28 XML-IR: History and Related Work IR on structured docs (SGML): IR on XML: XIRQL (U Dortmund) XXL & TopX (U Saarland / MPI) XRank & Quark (Cornell U) JuruXML (IBM Haifa ) Commercial software (MarkLogic, Fast, Verity, IBM,...) XML query languages: XQuery (W3C) XPath 2.0 (W3C) INEX benchmark W3C XPath & XQuery Full-Text XPath 1.0 (W3C) XML-QL (AT&T Labs) Web query languages: Lorel (Stanford U) Araneus (U Roma) W3QS (Technion Haifa) TeXQuery (AT&T) FleXPath (AT&T Labs) ELIXIR (U Dublin) ProximalNodes (U Chile) HySpirit (U Dortmund) WHIRL (CMU) PowerDB-IR (ETH Zurich) ApproXQL (U Berlin / U Munich) Timber (U Michigan) WebSQL (U Toronto)... CIRQUID (CWI) OED etc. (U Waterloo)

Gerhard Weikum May 19, /28 XML-IR Concepts [MPII XXL 00 & TopX 05, U Duisburg XIRQL 00, U Dublin Elixir 00, Cornell XRank & Quark 03, U Michigan 02, U Wisconsin 04, CWI Cirquid 03, AT&T FleXPath 04, W3C XPath Full-Text 05, IBM, Verity, Fast, etc.] search condition: conjunction of restricted path expressions // Professor [//* = „Saarbruecken“] [// Course [//* = „Information Retrieval“] ] [// Research [//* = „XML“] ] Elementary conditions on names and contents „Semantic“ similarity conditions on names and contents Relevance scoring based on term-frequency statistics for content similarity (BM25 variant with tf per tag-term pair and tag-specific idf) ontological similarity of concept names, aggregation of local scores into global scores ~Research [//* „~XML“] Query result: query is a path/tree/graph pattern results are isomorphic paths/subtrees/subgraphs of the data graph Query result: query is a pattern with relaxable conditions results are approximate matches to query with similarity scores applicable to XML, HTML, Blogs, relational data graphs, etc.

Gerhard Weikum May 19, /28 magician wizard intellectual artist alchemist director primadonna professor teacher scholar academic, academician, faculty member scientist researcher HYPONYM (0.749) investigator mentor RELATED (0.48) Query Expansion and Execution Thesaurus/Ontology: concepts, relationships, glosses from WordNet, Gazetteers, Web forms & tables, Wikipedia relationships quantified by statistical correlation measures Query expansion Weighted expanded query Example: (professor lecturer (0.749) scholar (0.71)...) and ( (course class (1.0) seminar (0.84)... ) = („IR“ „Web search“ (0.653)... ) ) Term2Concept with WSD exp(ti)={w | sim(ti,w)   } Efficient top-k search with dynamic expansion better recall, better mean precision for hard queries Problem: tuning the threshold   top-k with incr. merge User query: ~c = ~t1... ~tm Example: ~professor and ( ~course = „~IR“ ) //professor[//place = „SB“]//course = „IR“

Gerhard Weikum May 19, /28 Towards a Statistically Semantic Web Information extraction: (NLP, reg.exp., lexicon, HMM, CRF, etc.) Person TimePeriod... Sir Isaac Newton 4 Jan Leibniz... Kneller Publication Topic Philosophiae Naturalis... gravitation Scientist Sir Isaac Newton... Leibniz Author Publication... Newton Philosophia... Database (tables or XML) but with confidence < 1  Semantic-Web database with uncertainty !  ranked retrieval !

Gerhard Weikum May 19, /28 Outline Motivation and Strategic Direction XML IR & Ontologies Efficient Top-k QP Conclusion TopX: Efficient XML IR

Gerhard Weikum May 19, /28 Efficient Top-k Search [Buckley85, Güntzer et al. 00, Fagin01] Index lists s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 s(t 1,d 1 ) = 0.7 … s(t m,d 1 ) = 0.2 … Data items: d 1, …, d n Query: q = (t 1, t 2, t 3 ) RankDocWorst- score Best- score 1d d d RankDocWorst- score Best- score 1d d d d RankDocWorst- score Best- score 1d d d d … … t1t1 d d1 0.7 d d d d d d d d1d1 d1d1 t2t2 d d d t3t3 d d d STOP! TA with sorted access only (NRA): can index lists; consider d at pos i in L i ; E(d) := E(d)  {i}; high i := s(t i,d); worstscore(d) := aggr{s(t,d) |  E(d)}; bestscore(d) := aggr{worstscore(d), aggr{high |  E(d)}}; if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’  top-k}; else if bestscore(d) > min-k then cand := cand  {d}; s threshold := max {bestscore(d’) | d’  cand}; if threshold  min-k then exit; TA: efficient & principled top-k query processing with monotonic score aggr. Scan depth 1 Scan depth 1 Scan depth 2 Scan depth 2 Scan depth 3 Scan depth 3 k = 1 Ex. Google: > 10 mio. terms, > 8 bio. docs, > 4 TB index fetch lists join&sort

Gerhard Weikum May 19, /28 Probabilistic Pruning of Top-k Candidates [VLDB 04] scan depth drop d from priority queue  Approximate top-k with probabilistic guarantees: bestscore(d) worstscore(d) min-k score ? Add d to top-k result, if worstscore(d) > min-k Drop d only if bestscore(d) < min-k, otherwise keep in PQ TA family of algorithms based on invariant (with sum as aggr): worstscore(d)bestscore(d)  Often overly conservative (deep scans, high memory for PQ) discard candidates d from queue if p(d)   score predictor can use LSTs & Chernoff bounds, Poisson approximations, or histogram convolution  E[rel. = 1 

Gerhard Weikum May 19, /28 Probabilistic Threshold Test postulating uniform or Pareto score distribution in [0, high i ] compute convolution using LSTs use Chernoff-Hoeffding tail bounds or generalized bounds for correlated dimensions (Siegel 1995) fitting Poisson distribution or Poisson mixture over equidistant values: easy and exact convolution distribution approximated by histograms: precomputed for each dimension dynamic convolution at query-execution time engineering-wise histograms work best! 0 f 2 (x) 1 high 2 Convolution (f 2 (x), f 3 (x)) 2 0 δ(d) f 3 (x) high  cand doc d with 2  E(d), 3  E(d)

Gerhard Weikum May 19, /28 TA-sortedProb-sorted (smart) #sorted accesses2,263,652527,980 elapsed time [s] max queue size relative recall10.69 rank distance039.5 score error Performance Results for.Gov Queries on.GOV corpus from TREC-12 Web track: 1.25 Mio. docs (html, pdf, etc.) 50 keyword queries, e.g.: „Lewis Clark expedition“, „juvenile delinquency“, „legalization Marihuana“, „air bag safety reducing injuries death facts“ speedup by factor 10 at high precision/recall (relative to TA-sorted); aggressive queue mgt. even yields factor 100 at % prec./recall

Gerhard Weikum May 19, /28.Gov Expanded Queries on.GOV corpus with query expansion based on WordNet synonyms: 50 keyword queries, e.g.: „juvenile delinquency youth minor crime law jurisdiction offense prevention“, „legalization marijuana cannabis drug soft leaves plant smoked chewed euphoric abuse substance possession control pot grass dope weed smoke“ TA-sortedProb-sorted (smart) #sorted accesses22,403,49018,287,636 elapsed time [s] max queue size relative recall10.88 rank distance014.5 score error00.035

Gerhard Weikum May 19, /28 consider expandable query „~professor and research = XML“ with score Top-k Queries with Query Expansion [SIGIR 05]  i  q {max j  exp(i) { sim(i,j)*s j (d) }} dynamic query expansion with incremental on-demand merging of additional index lists + much more efficient than threshold-based expansion + no threshold tuning + no topic drift lecturer: : : : : : : 0.6 scholar: : : : : : 0.8 research: XML B+ tree index on tag-term pairs and terms 57: : professor 52: : : : : : : : : : 0.4 thesaurus / meta-index professor lecturer: 0.7 scholar: 0.6 academic: 0.53 scientist: TREC-13 Robust Track benchmark: speedup by factor 4 at high precision/recall; also handles TREC-13 Terabyte benchmark

Gerhard Weikum May 19, /28 Combined Algorithm (CA) for Balanced SA/RA Scheduling [Fagin 03] perform NRA (TA-sorted)... after every r rounds of SA (m*r scan steps) perform RA to look up all missing scores of „best candidate“ in Q cost ratio C RA /C SA = r cost competitiveness w.r.t. „optimal schedule“ (scan until  i high i ≤ min{bestscore(d) | d  final top-k}, then perform RAs for all d‘ with bestscore(d‘) > min-k): 4m + k

Gerhard Weikum May 19, /28 Index Access Optimization [joint work with Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald] Perform additional RAs when helpful 1) to increase min-k (increase worstscore of d  top-k) or 2) to prune candidates (decrease bestscore of d  Q) Last Probing (2-Phase Schedule): perform RAs for all candidates at point t when total cost of remaining RAs = total cost of SAs up to t with score-prediction & cost model for deciding RA order For SA scheduling plan next b 1,..., b m index scan steps for batch of b steps overall s.t.  i=1..m b i = b and benefit(b 1,..., b m ) is max! solve knapsack-style NP-hard problem for batched scans, or use greedy heuristics

Gerhard Weikum May 19, /28 Performance of SA/RA Scheduling Methods [joint work with Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald] absolute run-times for TREC’04 Robust queries on Terabyte.Gov data: (C++, Berkeley DB, Opteron, 8 GB): full merge: 170 ms per query RR-Never (NRA): 155 ms for k= ms for k=50 KSR-Last-Ben (NEW): 30 ms for k=10 55 ms for k=50 Example query: kyrgyzstan united states relation 15 mio. list entries, NEW scans 2% and performs 300 RAs for 10 ms response time

Gerhard Weikum May 19, /28 Outline Motivation and Strategic Direction XML IR & Ontologies Efficient Top-k QP Conclusion TopX: Efficient XML IR

Gerhard Weikum May 19, /28 TopX Search on XML Data [VLDB 05] Problems  Solutions: 0)disk space is cheap, disk I/O is not:  precompute and store scores for entire subtrees 1)content conditions (CC) on both tags and terms  build index lists for each tag-term pair 2) scores for elements or subtrees, docs as results  block-fetch all elements for the same doc 3) test path conditions (PC), but avoid RAs  test PCs on candidates in memory via Pre&Post coding and carefully schedule RAs 4) PCs may be relaxable  unsatisfied PCs result in score penalty Example query (NEXI, XPath Full-Text): //book[about(.//„Information Retrieval“ „XML“) and.//affiliation[about(.// „Stanford“)] and.//reference[about(.// „Page Rank“)] ]//publisher//country apply & adapt (prob.) TA-style top-k query processing

Gerhard Weikum May 19, /28 TopX Algorithm based on index table (with several B+ tree indexes): L (Tag, Term, MaxScorePerDoc, DocId, Score, ElemId, Pre, Post) decompose query: content conditions (CC) & path conditions (PC); //conditions may be optional or mandatory for each index list L i (extracted from L by tag and/or term) do: block-scan next elements from same doc d; test evaluable PCs of all elements of d; drop elements and docs that do not satisfy mandatory PCs or CCs; update score bookkeeping for d; consider random accesses for d by cost-based scheduler; drop d if (prob.) score threshold is not reached;

Gerhard Weikum May 19, /28 min-k = 0.0 min-k = 0.5 min-k = 0.9 min-k = 1.6 TopX Query Processing By Example //sec[clustering] //title[xml]//par[evaluation] 1.0 worst=0.9 best= worst=0.5 best=2.5 9 eiddocidscoreprepost eiddocidscoreprepost eiddocidscoreprepost worst=1.0 best=2.8 3 worst=0.9 best= worst=0.85 best= worst=0.8 best=2.65 worst=0.9 best= worst=0.5 best=2.4 9 d2 d17d1 worst=0.9 best= d5 worst=1.0 best= d3 worst=0.9 best= worst=0.5 best=2.3 9 worst=0.85 best= score=1.7 best= score=0.5 best=1.3 9 worst=0.9 best= worst=1.0 best= worst=0.85 best= worst=0.8 best= worst=0.8 best= worst=0.9 best= worst=1.0 best=1.9 3 worst=2.2 best= worst=0.5 best=0.5 9 worst=1.0 best=1.6 3 worst=0.85 best= worst=1.6 best= worst=0.9 best= worst=0.0 best=2.9 Pseudo candidate worst=0.0 best=2.8 worst=0.0 best=2.75 worst=0.0 best=2.65 worst=0.0 best=2.45 worst=0.0 best=1.7 worst=0.0 best=1.4 worst=0.0 best=1.35 //sec[clustering] //title[xml] Top-k results (k=2): worst= worst=0.5 9 worst= worst= worst= worst=1.0 3 worst= //par[evaluation] Candidate queue worst=0.1 best=0.9 84

Gerhard Weikum May 19, /28 Experimental Results: INEX Benchmark JoinStruct TopX TopX &SortIndex (  =0.0)(  =0.1) #sorted 635, ,986 #random 64,80759,414 CPU sec relative on IEEE-CS journal and conference articles: 12,000 XML docs with 12 Mio. elements,7.9 GB for all indexes 20 CO queries, e.g.: „XML editors or parsers“ 20 CAS queries, e.g.: //article[.//bibl[about(.//„QBIC“)] and.//p[about(.//„image retrieval“)] ] TopX outperforms Join&Sort by factor > 10 and beats StructIndex by factor > 20 on INEX, factor 2-3 on IMDB

Gerhard Weikum May 19, /28 Challenge: XML IR on Graphs prof addr Gerhard Weikum SB course biblIR teachingproject publ XML MPII SB lecturer Holger Bast seminar biblIR collab research IRalgorithms affil lecturer affil Ralf Schenkel MPII project XMLINEX course InfoSysToC LSI Q: professor from SB teaching IR and research on XML on graph with scores as node weights and weighted edges (XML with XLinks, Web-to-XML graph, etc.) Combine content-score aggregation with result-subgraph compactness  compute top-k Steiner trees (or top-k MSTs as approximation) initial work: BANKS (IIT Bombay), SphereSearch (MPII)

Gerhard Weikum May 19, /28 Outline Motivation and Strategic Direction XML IR & Ontologies Efficient Top-k QP Conclusion TopX: Efficient XML IR

Gerhard Weikum May 19, /28 Conclusion: Ongoing and Future Work Observations: XML IR: enterprise search, DLs, data integr., Web + InfoExtraction Approximations with statistical guarantees are key to obtaining Web-scale efficiency (TREC’04 TB: 25 Mio. docs, terms, 5-50 terms per query; Wikipedia for INEX’06: docs, 130 Mio. elements) Challenges: Scheduling of index-scan steps and random accesses and efficient consideration of correlated dimensions Integrate info-extraction confidence values into XML similarity search (content & ontology & structure) Generalize TopX to arbitrary graphs Integration of top-k operator into physical algebra and query optimizer of XML engine Re-invent SQL and XQuery with probabilistic ranking

Gerhard Weikum May 19, /28 Backup Slides

Gerhard Weikum May 19, /28 Ontologies & Thesauri: Example WordNet woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman)... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil) IR&NLP Approach e.g. WordNet Thesaurus (Princeton) (> concepts with lexical & linguistic relations)

Gerhard Weikum May 19, /28 Query Expansion Example Title: International Organized Crime Description: Identify organizations that participate in international criminal activity, the activity, and, if possible, collaborating organizations and the countries involved. From TREC 2004 Robust Track Benchmark: sorted accesses in s. Results: 1.Interpol Chief on Fight Against Narcotics 2.Economic Counterintelligence Tasks Viewed 3.Dresden Conference Views Growth of Organized Crime in Europe 4.Report on Drug, Weapons Seizures in Southwest Border Region 5.SWITZERLAND CALLED SOFT ON CRIME... A parliamentary commission accused Swiss prosecutors today of doing little to stop drug and money-laundering international networks from pumping billions of dollars through Swiss companies.... Query = {international[0.145|1.00], ~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], maffia[0.318|1.00], mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}], organ[0.213|1.00], crime[0.312|1.00], collabor[0.415|0.20], columbian[0.686|0.20], cartel[0.466|0.20],...}} Let us take, for example, the case of Medellin cartel's boss Pablo Escobar. Will the fact that he was eliminated change anything at all? No, it may perhaps have a psychological effect on other drug dealers but, for organizing the illicit export of metals and import of arms. It is extremely difficult for the law-enforcement organs to investigate and stamp out corruption among leading officials....

Gerhard Weikum May 19, /28 Performance Results for IMDB Queries on IMDB corpus (Web site: Internet Movie Database): movies, 1.2 Mio. persons (html/xml) 20 structured/text queries with Dice-coefficient-based similarities of categorical attributes Genre and Actor, e.g.: Genre  {Western}  Actor  {John Wayne, Katherine Hepburn}  Description  {sheriff, marshall}, Genre  {Thriller}  Actor  {Arnold Schwarzenegger}  Description  {robot} TA-sortedProb-sorted (smart) #sorted accesses1,003,650403,981 elapsed time [s] max queue size relative recall10.75 rank distance score error00.25

Gerhard Weikum May 19, /28 Comparison of Probabilistic Predictors

Gerhard Weikum May 19, /28 Experiments with TREC-13 Robust Track on Acquaint corpus (news articles): docs, 2 GB raw data, 8 GB for all indexes no exp.static exp.static exp. incr. merge (  =0.1)(  =0.3,(  =0.3, (  =0.1)  =0.0)  =0.1) #sorted acc. 1,333,756 10,586,1753,622,6865,671,493 #random acc. 0555,17649,78334,895 elapsed time [s] max #terms relative recall with Okapi BM25 scoring model 50 most difficult queries, e.g.: „transportation tunnel disasters“ „Hubble telescope achievements“ potentially expanded into: „earthquake, flood, wind, seismology, accident, car, auto, train,...“ „astronomical, electromagnetic radiation, cosmic source, nebulae,...“ speedup by factor 4 at high precision/recall; no topic drift, no need for threshold tuning; also handles TREC-13 Terabyte benchmark

Gerhard Weikum May 19, /28 phrase matching requires lookups in term-position index  expensive random accesses or extra handicap for effective candidate pruning  embed phrase matching in operator tree as nested top-k Nested Top-k Operators for Phrase MatchingTop-k undersea ~“fiber optic cable“ Term- Position- Index Term- Position- Index NestedTop-k Incr.Merge NestedTop-k fiber optics fiber opticcable sim(„fiber optic cable“, „fiber optics“) = 0.8 undersea … d d d d d1 0.8 ~“fiber optic cable“ … d d1 0.7 d d d … d d7 0.4 d d d … d d2 0.3 d d d … d d1 0.7 d d d … d d5 0.4 d d d sim(„fiber optic cable“, „fiber optic cable“) = 1.0

Gerhard Weikum May 19, /28 Combined Algorithm (CA) for Balanced SA/RA Scheduling [Fagin 03] perform NRA (TA-sorted) with [worstscore, bestscore] bookkeeping in priority queue Q and round-robin SA to m index lists... after every r rounds of SA (i.e. m*r scan steps) perform RA to look up all missing scores of „best candidate“ in Q (where „best“ is in terms of bestscore, worstscore, or E[score], or P[score > min-k]) assume cost ratio C RA /C SA = r cost competitiveness w.r.t. „optimal schedule“ (scan until  i high i ≤ min{bestscore(d) | d  final top-k}, then perform RAs for all d‘ with bestscore(d‘) > min-k): 4m + k

Gerhard Weikum May 19, /28 pos i +b i score reduction  i... ii Scheduling Example LiLi L1L1 L2L2 L3L3 score (pos i ) score (b i + pos i ) available info: = high i = new high i + histograms + convolutions A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X:  = 1.46  = 1.7 batch of b =  i=1..m b i steps: choose b i values so as to achieve high score reduction  + carefully chosen RAs: score lookups for „interesting“ candidates

Gerhard Weikum May 19, /28 pos i +b i score reduction  i... ii Scheduling Example LiLi L1L1 L2L2 L3L3 score (pos i ) score (b i + pos i ) available info: = high i = new high i + histograms + convolutions A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X:  = 1.46  = 1.7 choose b i values so as to achieve high score reduction 

Gerhard Weikum May 19, /28 pos i +b i score reduction  i... ii Scheduling Example LiLi L1L1 L2L2 L3L3 score (pos i ) score (b i + pos i ) available info: = high i = new high i + histograms + convolutions A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X: compute top-1 result using flexible SAs and RAs

Gerhard Weikum May 19, /28 pos i +b i score reduction  i... ii Scheduling Example LiLi L1L1 L2L2 L3L3 score (pos i ) score (b i + pos i ) available info: = high i = new high i + histograms + convolutions A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X: A: [0.8, 2.4] G: [0.7, 2.4] Y: [0.9, 2.4] ?: [0.0, 2.4] candidates:

Gerhard Weikum May 19, /28 pos i +b i score reduction  i... ii Scheduling Example LiLi L1L1 L2L2 L3L3 score (pos i ) score (b i + pos i ) available info: = high i = new high i + histograms + convolutions A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X: A: [1.5, 2.0] G: [0.7, 1.6] Y: [0.9, 1.6] ?: [0.0, 1.4] candidates:

Gerhard Weikum May 19, /28 pos i +b i score reduction  i... ii Scheduling Example LiLi L1L1 L2L2 L3L3 score (pos i ) score (b i + pos i ) available info: = high i = new high i + histograms + convolutions A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X: A: [1.5, 2.0] G: [0.7, 1.2] Y: [1.4, 1.6] candidates:

Gerhard Weikum May 19, /28 pos i +b i score reduction  i... ii Scheduling Example LiLi L1L1 L2L2 L3L3 score (pos i ) score (b i + pos i ) available info: = high i = new high i + histograms + convolutions A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X: A: [1.7, 2.0]Y: [1.4, 1.6] candidates: execution costs: 9 SA + 1 RA

Gerhard Weikum May 19, /28 SA Scheduling: Objective and Optimization plan next b 1,..., b m index scan steps for batch of b steps overall s.t.  i=1..m b i = b and benefit(b 1,..., b m ) is max! possible benefit definitions: Solve knapsack-style NP-hard optimization problem (e.g. for batched scans) or use greedy heuristics: b i := b * benefit(b i =b) /  =1..m benefit(b =b) with score gradient score reduction

Gerhard Weikum May 19, /28 SA Scheduling: Benefit Aggregation score current top-k min-k candidates in Q worstscore(d) bestscore(d) Sur- plus(d) gap(d) Consider current top-k T and andidate queue Q; for each d  Q we know E(d)  1..m, R(d) = 1..m – E(d), bestscore(d), worstscore(d), p(d) = P[score(d) > min-k] with surplus(d) = bestscore(d) – min-k gap(d) = min-k – worstscore(d)  i = E[score(j) | j  [pos i, pos i +b i ]] weighs documents and dimensions in benefit function

Gerhard Weikum May 19, /28 Random-Access Scheduling Perform additional RAs when helpful 1) to increase min-k (increase worstscore of d  top-k) or 2) to prune candidates (decrease bestscore of d  Q) For 1) Top Probing: perform RAs for current top-k (whenever min-k changes), and possibly for best d from Q (in desc. order of bestscore, worstscore, or P[score(d)>min-k]) For 2) Last Probing (2-Phase Schedule): perform RAs for all candidates at point t when total cost of remaining RAs = total cost of SAs up to t (would be optimal for linear increase of SA-cost(t) and hyperbolically decreasing remaining-RA-cost(t)) with score-prediction & cost model for deciding RA order

Gerhard Weikum May 19, /28 Performance of SA/RA Scheduling Methods absolute run-times for TREC’04 Robust queries on Terabyte.Gov data: (C++, Berkeley DB, Opteron, 8 GB): full merge: 170 ms per query RR-Never (NRA): 155 ms for k= ms for k=50 RR-Last-Best (NEW): 30 ms for k=10 55 ms for k=50 Example query: kyrgyzstan united states relation 15 mio. list entries, NEW scans 2% and performs 300 RAs for 10 ms response time

Gerhard Weikum May 19, /28 Query Scores for Content Conditions Content-based scores cast into an Okapi-BM25 probabilistic model with element-specific model parameterization Basic scoring idea within IR-style family of TF*IDF ranking functions tagNavglengthk1k1 b article16,8502, sec96, par1,024, fig109, element statistics with where tf(c i,e)number of occurrences of c i in element e N T number of elements with tag T ef T (c i ) number of elements with tag T that contain c i

Gerhard Weikum May 19, /28 Scoring Model for TopX 2:A 1:R 6:B 3:X7:X 4:B5:C aaccabab 8:B9:C bbbcccxy 2:X 1:A 6:B 3:B7:C 4:B5:C cccabb abcabc 2:B 1:Z 3:X 4:C5:A aaaabb 6:B8:X 7:C acc 9:B10:A bb11:C12:C aabbcxyz d1d2d3 restricted to tree data (disregarding links) using tag-term conditions for scoring score (doc d or subtree rooted at d.n) for query q with tag-term conditions X 1 [t 1 ],..., X m [t m ] ) matched by nodes n 1,..., n m Example: relevance: tf-based or BM25 specificity: idf per tag type compactness: subtree size Example query: //A [.//“a“ &.//B[.//“b“] &.//C[.//“c“] ]

Gerhard Weikum May 19, /28 worst(d)= 1.5 best(d) = 5.5 worst(d)= 2.5 best(d) = 5.5 worst(d)= 2.5 best(d) = 4.5 article item bib title= security title= security sec par= xml par= xml par= java par= java bib item Incremental Testing of Path Conditions Complex query DAGs –Transitive closure of PC’s –Aggregate add’l score mass c for PC i if all edges rooted at i are satisfied Incrementally test PC’s –Quickly decrease bestscores for pruning –Schedule RAs based on cost model; Ben-Probe: EWC-RA(d) < EWC-SA(d) based on histogram convolutions and selectivity estimation //article[//sec//par=“xml java”]//bib//item//title=“security” article sec par= “xml” par= “java” bib title= “security” item child-or-descendant article sec par= “xml” par= “xml” par= “java” par= “java” bib title= “security” title= “security” item Query: Promising candidate [0.0, high i ] c = [1.0] [1.0] RA

Gerhard Weikum May 19, /28 BenProbe-Scheduling –Analytic cost model –Basic idea Compare expected access costs to the costs of an optimal schedule Access costs on d are wasted, if d does not make it into the final top-k (considering both content & structure) –Compare different Expected Wasted Costs (EWC) EWC-RA s (d) of looking up d in the structure EWC-RA c (d) of looking up d in the content EWC-SA(d) of not seeing d in the next batch of b sorted accesses –Schedule batch of RAs on d, if EWC-RA(d) [RA] < EWC-SA [SA] Scheduling: Cost Model EWC-SA =

Gerhard Weikum May 19, /28 Split the query into a set of characteristic patterns, e.g., twigs, descendants & tag- term pairs Consider structural selectivities P[d satisfies all structural conditions Y] = P[d satisfies a subset Y’ of structural conditions Y] = Consider binary correlations between structural patterns and/or tag-term pairs P[d satisfies all structural conditions Y] = With cov( X i, X j ) estimated from contingency tables, query logs, etc. Structural Selectivity Estimator //sec[//figure=“java”] [//par=“xml”] [//bib=“vldb”] //sec[//figure]//par //sec[//figure]//bib //sec[//par]//bib //sec//figure //sec//par //sec//bib //sec //par=“xml” //figure=“java” //bib=“vldb” p 1 = p 2 = p 3 = p 4 = p 5 = p 6 = p 7 = p 8 = p 9 = p 10 = figure= “java” figure= “java” sec par= “xml” par= “xml” bib= “vldb” bib= “vldb” bib= “vldb” bib= “vldb” sec EWC- RA s (d)

Gerhard Weikum May 19, /28 TopX Example Data Ex. query: //A [.//“a“ &.//B[.//“b“] &.//C[.//“c“] ] Tag Term MaxScore DocId Score ElemId Pre Post pre- computed index table with appropriate B+ tree index on (Tag, Term, MaxScore, DocId, Score, ElemId) 2:A 1:R 6:B 3:X7:X 4:B5:C aaccabab 8:B9:C bbbcccxy 2:X 1:A 6:B 3:B7:C 4:B5:C cccabb abcabc 2:B 1:Z 3:X 4:C5:A aaaabb 6:B8:X 7:C acc 9:B10:A bb11:C12:C aabbcxyz d1d2d3 block-scans: (A, a, d3,...) (B, b, d1,...) (C, c, d2,...) (A, a, d1,...) (B, b, d3,...) (C, c, d3,...) (A, a, d2,...) (B, b, d2,...) (C, c, d1,...) A a 1 d3 1 e5 5 2 A a 1 d3 1/4 e A a 1/2 d1 1/2 e2 2 4 A a 2/9 d2 2/9 e1 1 7 B b 1 d1 1 e8 8 5 B b 1 d1 1/2 e4 4 1 B b 1 d1 3/7 e6 6 8 B b 1 d3 1 e9 9 7 B b 1 d3 1/3 e2 2 4 B b 2/3 d2 2/3 e4 4 1 B b 2/3 d2 1/3 e3 3 3 B b 2/3 d2 1/3 e6 6 6 C c 1 d2 1 e5 5 2 C c 1 d2 1/3 e7 7 5 C c 2/3 d3 2/3 e7 7 5 C c 2/3 d3 1/5 e C c 3/5 d1 3/5 e9 9 6 C c 3/5 d1 1/2 e5 5 2 //A[a] //B[b]//C[c]

Gerhard Weikum May 19, /28 Experiments – Datasets & Competitors INEX ‘04 benchmark setting –12,223 docs; 12M elemt’s; 119M content features; 534MB –46 queries, e.g., //article[.//bib=“QBIC” and.//p=“image retrieval”] IMDB (Internet Movie Database) –386,529 docs; 34M elemt’s; 130M content features; 1,117 MB –20 queries, e.g., //movie[.//casting[.//actor=“John Wayne”] //role=“Sheriff”]//[.//year=“1959” and.//genre=“Western”] Competitors –DBMS-style Join&Sort Using the TopX schema –StructIndex Top-k with separate inverted indexes for content & structure DataGuide-like structural index Full evaluations  no uncertainty about final document scores No candidate queue, eager random access –StructIndex+ Extent chaining technique for DataGuide-based extent identifiers (skip scans)

Gerhard Weikum May 19, /28 INEX Results , , TopX – BenProbe ,25, ,970n/a10StructIndex ,122,318n/a10Join&Sort ,074,384 77,482n/a10StructIndex , , TopX – MinProbe ,902, , ,000TopX – BenProbe relPrec # SA CPU epsilon # RA k relPrec

Gerhard Weikum May 19, /28 IMDB Results n/a , ,697n/a10StructIndex , n/a10Join&Sort ,647 22,445n/a10StructIndex , , TopX – MinProbe , , TopX – BenProbe # SA CPU epsilon # RA k relPrec

Gerhard Weikum May 19, /28 INEX with Probabilistic Pruning , , , , TopX - MinProbe , , , , ,327 36, # SA CPU epsilon # RA k relPrec