Martin Theobald Max-Planck-Institut Informatik Stanford University

Slides:



Advertisements
Similar presentations
XIRQL: Eine Anfragesprache für Information Retrieval in XML-Dokumenten
Advertisements

XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk.
Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Information Retrieval in Practice
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
XML Ranking Querying, Dagstuhl, 9-13 Mar, An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Max-Planck Institute Ralf Schenkel Max-Planck Institute Mohammed.
Tunable Compression of Word-level Index for Versioned Corpora Klaus Berberich, Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics Saarbruecken,
Information Retrieval in Practice
Search Engines and Information Retrieval
December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Xyleme A Dynamic Warehouse for XML Data of the Web.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Stanford University Ralf Schenkel Max-Planck Institute Mohammed.
1 - Fuhr: Information Retrieval Methods for XML Documents XIRQL: Eine Anfragesprache für Information Retrieval in XML- Dokumenten Norbert Fuhr Universität.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Information Retrieval in Practice
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory.
1 Query Operations Relevance Feedback & Query Expansion.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Demo: Power Tools for P8 Presenter: Jay Bowen Demonstration Topic: Choice List Features Demo URL below Power Tools Choice List Support 1. Native P8 Choice.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany.
IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.
INEX ‘05 INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Welcome to CPSC 534B: Information Integration Laks V.S. Lakshmanan Rm. 315.
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
Information Retrieval in Practice
Why indexing? For efficient searching of a document
XML: Extensible Markup Language
Max-Planck Institute for Informatics
Martin Rajman, Martin Vesely
Max Planck Institute for Informatics
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Introduction to Search Engines
Introduction to XML IR XML Group.
Presentation transcript:

Martin Theobald Max-Planck-Institut Informatik Stanford University TopX Efficient & Versatile Top-k Query Processing for Text, Semistructured & Structured Data Martin Theobald Max-Planck-Institut Informatik Stanford University

RANKING PRUNING VAGUENESS //article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)] RANKING “Native XML data base systems can store schemaless data ... ” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ” “XML-QL: A Query Language for XML.” “Native XML Data Bases.” “Proc. Query Languages Workshop, W3C,1998.” “XML queries with an expres- sive power similar to that of Datalog …” sec article par bib title “Current Approaches to XML Data Manage- ment” item inproc “What does XML add for retrieval? It adds formal ways …” “w3c.org/xml” sec article par “Sophisticated technologies developed by smart people.” title “The XML Files” Ontology Game” Dirty Little Secret” bib “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” item url “XML” VAGUENESS PRUNING

Unified Text & XML Schema Frontends Web Interface Web Service API TopX Query Processor Probabilistic Index Access Scheduling Candidate Queue Candidate Cache SA Scan Threads Top-k Queue Probabilistic Candidate Pruning Query Processing Time Random Access Dynamic Query Expansion Sequential Access Incremental XPath Engine Auxiliary Predicates RA Thesaurus WordNet, OpenCyc, etc. Index Metadata Selectivities Histograms Correlations DBMS / Inverted Lists Unified Text & XML Schema Indexing Time RA Indexer /Crawler

“xml manage system vary wide expressive Data Model “xml data manage xml manage system vary wide expressive power native xml native xml data base system store schemaless data“ “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ “xml data manage” article title abs sec “xml manage system vary wide expressive power“ “native xml data base” “native xml data base system store schemaless data“ par 1 6 2 3 4 5 <article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. <par>Native XML data base systems can store schemaless data. </par> </sec> </article> ftf (“xml”, article1 ) = 4 “native xml data base native xml data base system store schemaless data“ XML trees (no XLinks or ID/IDref attributes) Pre-/postorder node labels Redundant full-content text nodes

Scoring Model [INEX ’06/’07] XML-specific extension to Okapi BM25 (originating from probabilistic text IR) ftf instead of tf ef instead of df Element type-specific length normalization Tunable parameters k1 and b bib[“transactions”] vs. par[“transactions”]

TopX Query Processing [VLDB ’05] //sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)] sec[“xml”] title[“native”] par[“retrieval”] 1.0 1.0 1.0 0.9 eid docid score pre post 46 2 0.9 15 9 0.5 10 8 171 5 0.85 1 20 84 3 0.1 12 0.9 eid docid score pre post 216 17 0.9 2 15 72 3 0.8 14 10 51 0.5 4 12 1.0 eid docid score pre post 3 1 1.0 21 28 2 0.8 8 14 182 5 0.75 7 0.8 0.8 0.85 0.5 0.75 0.1 … 16 11 0.4 5 89 … 14 8 0.8 19 … 20 3 0.04 8 21 5 0.05 4 35 1 0.09 32 Top-2 Candidate Queue worst=1.6 171 182 worst=1.0 3 worst=0.9 46 worst=2.2 46 28 51 worst=1.7 46 28 worst=0.5 9 worst=0.9 216 max-q=2.75 max-q=2.15 max-q=2.55 max-q=2.45 max-q=2.8 max-q=1.6 max-q=3.0 max-q=2.9 max-q=2.7 min-2=1.0 min-2=0.5 min-2=0.0 min-2=0.9 min-2=1.6

Index Access Scheduling [VLDB ’06] Inverted Block Index SA Scheduling Look-ahead Δi through precomputed score histograms Knapsack-based optimization of Score Reduction RA Scheduling 2-phase probing: Schedule RAs “late & last” Extended probabilistic cost model for integrating SA & RA scheduling SA SA SA 1.0 0.9 0.8 0.2 0.7 0.6 … Δ1,3 = 0.8 Δ3,3 = 0.2 RA

Probabilistic Pruning [VLDB ’04] Convolutions of score distributions (assuming independence) P [d gets in the final top-k] = title[“native”] f1 1 high1 f2 high2 eid … max- score 216 0.9 72 0.8 51 0.5 2 δ(d) par[“retrieval”] sampling eid … max- score 3 1.0 28 0.8 182 0.75 Probabilistic candidate pruning: Drop d from the candidate queue if P [d gets in the final top-k] < ε With probabilistic guarantees for precision & recall Indexing Time Query Processing Time

Dynamic Query Expansion [SIGIR ’05] TREC Robust Topic #363 Top-k (transport, tunnel, ~disaster) Incrementally merge inverted lists for expansion ti,1...ti,m in descending order of s(tij, d) Best-match score aggregation Specialized expansion operators Incremental Merge operator Nested Top-k operator (efficient phrase matching) Boolean (but ranked) retrieval mode Supports any sorted inverted index for text, structured records & XML SA SA transport d66 d93 d95 ... d101 tunnel d17 d11 d99 d42 d11 d92 d37 … ~disaster SA d42 d11 d92 ... d21 d78 d10 d1 d37 d32 d87 disaster accident fire Incr. Merge

Incremental Merge Operator Thesaurus lookups/ Relevance feedback Index list metadata (e.g., histograms) Initial high-scores Expansion terms ~t = { t1, t2, t3 } Large corpus term correlations sim(t, t1 ) = 1.0 t1 ... d78 0.9 d1 0.4 d88 0.3 d23 0.8 d10 0.9 0.4 ... d12 0.2 d78 0.1 d64 0.8 d23 d10 0.7 t2 sim(t, t2 ) = 0.9 Expansion similarities 0.72 0.18 sim(t, t3 ) = 0.5 t3 ... d99 0.7 d34 0.6 d11 0.9 d78 d64 SA 0.45 0.35 ~t d78 0.9 d23 0.8 d10 0.8 d64 0.72 d23 0.72 d10 0.63 d11 0.45 d78 0.45 d1 0.4 d88 0.3 ... Meta histograms seamlessly integrate Incremental Merge into probabilistic scheduling and candidate pruning

Some Experiments New XML-ified Wikipedia corpus (INEX 2006) 660,000 documents w/ 130,000,000 elements 125 INEX queries, each as content-only (CO) and content-and-structure (CAS) formulation CO: +“state machine” figure Mealy Moore CAS: //article[about(., “state machine” )] //figure[about(., Mealy ) or about(., Moore )] Primary cost metric: Cost = #SA + cR/cS #RA

TopX vs. Full-Merge Significant cost savings for large ranges of k CAS cheaper than CO !

Efficiency vs. Effectiveness Very good precision/runtime ratio for probabilistic pruning

Static vs. Dynamic Expansions Query expansions with up to m=292 keywords & phrases Balanced amount of sorted vs. random disk access Adaptive scheduling wrt. cR/cS cost ratio Dynamic expansions superior to static expansions & full-merge in both efficiency & effectiveness

Thanks… Gerhard Weikum Ralf Schenkel Norbert Fuhr, Michalis Vazirgiannis Holger Bast, Debapriyo Majumdar All the MPI & INEX folks

topx.sourceforge.net See our Sigmod’07 demo!