INEX ‘05 INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken.

Slides:



Advertisements
Similar presentations
Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Advertisements

Information Retrieval in Practice
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Gerhard Weikum Joint work with Martin Theobald and Ralf Schenkel Efficient Top-k Queries for XML Information Retrieval.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Max-Planck Institute Ralf Schenkel Max-Planck Institute Mohammed.
Tunable Compression of Word-level Index for Versioned Corpora Klaus Berberich, Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics Saarbruecken,
Information Retrieval in Practice
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Stanford University Ralf Schenkel Max-Planck Institute Mohammed.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
VLDB ´04 Top-k Query Evaluation with Probabilistic Guarantees Martin Theobald Gerhard Weikum Ralf Schenkel Max-Planck Institute for Computer Science SaarbrückenGermany.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
1/28 Efficient Top-k Queries for XML Information Retrieval Gerhard Weikum Joint work with Ralf Schenkel.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany.
IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.
Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,
Information Retrieval in Practice
Information Retrieval in Practice
Indexing & querying text
Max-Planck Institute for Informatics
Implementation Issues & IR Systems
Spatio-temporal Pattern Queries
Max Planck Institute for Informatics
Martin Theobald Max-Planck-Institut Informatik Stanford University
Laks V.S. Lakshmanan Depf. of CS UBC
Structure and Content Scoring for XML
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Chapter 5: Information Retrieval and Web Search
Structure and Content Scoring for XML
Information Retrieval and Web Design
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

INEX ‘05 INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 2 //article [ //sec [ about(.//, “XML retrieval”) ] //par [ about(.//, “native XML database”) ] ] //bib[about(.//item, “W3C”)] sec article sec par bib par title “Current Approaches to XML Data Manage- ment.” item “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ” “XML queries with an expres- sive power similar to that of Datalog …” par title “XML-QL: A Query Language for XML.” “Native XML database systems can store schemaless data... ” inproc “Proc. Query Languages Workshop, W3C,1998.” title “Native XML databases.” sec article sec par “Sophisticated technologies developed by smart people.” par title “The X ML Files” par title “The Ontology Game” title “The Dirty Little Secret” “What does XML add for retrieval? It adds formal ways …” bib “ w3c.org/xml” “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” title item url “XML”

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 3 TopX: Efficient XML-IR [VLDB ’05] Extend top-k query processing algorithms for sorted lists [Buckley ’85; Güntzer, Balke & Kießling ’00; Fagin ‘01] to XML data Non-schematic, heterogeneous data sources Combined inverted index for content & structure Avoid full index scans, postpone expensive random accesses to large disk-resident data structures Exploit cheap disk space for redundant indexing Goal: Efficiently retrieve the best results of a similarity query

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 4 XML-IR: History and Related Work IR on structured docs (SGML): IR on XML: Commercial software: MarkLogic, Verity?, IBM?, Oracle?,... XML query languages: XQuery (W3C) XPath 2.0 (W3C) NEXI (INEX Benchmark) XPath & XQuery Full-Text (W3C) XPath 1.0 (W3C) XML-QL (AT&T Labs) Web query languages: Lorel (Stanford U) Araneus (U Roma) W3QS (Technion Haifa) TeXQuery (AT&T Labs) WebSQL (U Toronto) XIRQL (U Dortmund / Essen) XXL & TopX (U Saarland / MPII) ApproXQL (U Berlin / U Munich) ELIXIR (U Dublin) JuruXML (IBM Haifa ) XSearch (Hebrew U) Timber (U Michigan) XRank & Quark (Cornell U) FleXPath (AT&T Labs) XKeyword (UCSD) OED etc. (U Waterloo) HySpirit (U Dortmund) HyperStorM (GMD Darmstadt) WHIRL (CMU)

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 5 Computational Model Precomputed content scores score(t i,e) ∈  E.g., term/element frequencies, probabilistic models (Okapi BM25), etc. Typically normalized to score(t i,e) ∈ [0,1] Monotonous score aggregation aggr: (D 1 ×…×D m )  (D 1 ×…×D m ) →  + E.g., sum, max, product (using log), cosine (using L 2 norm) Structural query conditions Complex query DAGs Aggregate constant score c for each matched structural condition (edges) Similarity queries (aka. “andish”) Non-conjunctive query evaluations Weak content matches can be compensated Vague structural matches Access model Disk-resident inverted index  Inexpensive sequential accesses (SA) to inverted lists: “getNextItem()”  More expensive random accesses (RA): “getItemBy(Id)”

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 6 Data Model Simplified XML model disregarding IDRef & XLink/XPointer Redundant full-contents Per-element term frequencies ftf(t i,e) for full-contents Pre/postorder labels for each tag-term pair XML-IR IR techniques for XML Clustering on XML Evaluation “xml ir” article title abs sec “xml ir ir technique xml clustering xml evaluation“ “ir technique xml“ “clustering xml evaluation“ “clustering xml” “evaluation“ title par ftf(“xml”, article 1 ) = 3 ftf(“xml”, article 1 ) = 3

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 7 Full-Content Scoring Model Full-content scores cast into an Okapi-BM25 probabilistic model with element-specific parameterization Basic scoring idea within IR-style family of TF*IDF ranking functions tagNavglengthk1k1 b article12,2232, sec96, par1,024, fig109, per-element statistics Additional static score mass c for relaxable structural conditions

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 8 Inverted Block-Index for Content & Structure eiddocidscoreprepostmax- score sec[clustering] title[xml]par[evaluation] sec[clustering] title[xml] par[evaluation] Inverted index over tag-term pairs (full-contents) Benefits from increased selectivity of combined tag-term pairs Accelerates child-or-descendant axis, e.g., sec//”clustering” eiddocidscoreprepostmax- score eiddocidscoreprepostmax- score Sequential block-scans Re-order elements in descending order of (maxscore, docid, score) per list Fetch all tag-term pairs per doc in one sequential block-access docid limits the range of in-memory structural joins Stored as inverted files or database tables (B + -tree indexes)

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 9 Navigational Index eiddocidprepost sec title[xml]par[evaluation] sec title par Additional navigational index Non-redundant element directory Supports element paths and branching path queries Random accesses using (docid, tag) as key Schema-oblivious indexing & querying eiddocidprepost eiddocidprepost

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 10 TopX Query Processing Adapt T hreshold A lgorithm (TA) paradigm Focus on inexpensive sequential/sorted accesses Postpone expensive random accesses Candidate d = connected sub-pattern with element ids and scores Incrementally evaluate path constraints using pre/postorder labels In-memory structural joins (nested loops, staircase, or holistic twig joins) Upper/lower score guarantees per candidate Remember set of evaluated dimensions E(d) worstscore(d) = ∑ i  E(d) score(t i,e) bestscore(d) = worstscore(d) + ∑ i  E(d) high i Early threshold termination Candidate queuing Stop, if Extensions Batching of sorted accesses & efficient queue management Cost model for random access scheduling Probabilistic candidate pruning for approximate top-k results [VLDB ’04] [Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85] [Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85]

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search worst=0.9 best= worst=0.5 best=2.5 9 TopX Query Processing By Example eiddocidscoreprepost eiddocidscoreprepost eiddocidscoreprepost worst=1.0 best=2.8 3 worst=0.9 best= worst=0.85 best= worst=0.8 best=2.65 worst=0.9 best= worst=0.5 best=2.4 9 doc 2 doc 17 doc 1 worst=0.9 best= doc 5 worst=1.0 best= doc 3 worst=0.9 best= worst=0.5 best=2.3 9 worst=0.85 best= score=1.7 best= score=0.5 best=1.3 9 worst=0.9 best= worst=1.0 best= worst=0.85 best= worst=0.8 best= worst=0.8 best= worst=0.1 best= worst=0.9 best= worst=1.0 best=1.9 3 worst=2.2 best= worst=0.5 best=0.5 9 worst=1.0 best=1.6 3 worst=0.85 best= worst=1.6 best= worst=0.9 best= worst=0.0 best=2.9 Pseudo- Candidate worst=0.0 best=2.8 worst=0.0 best=2.75 worst=0.0 best=2.65 worst=0.0 best=2.45 worst=0.0 best=1.7 worst=0.0 best=1.4 worst=0.0 best=1.35 sec[clustering] title[xml] Top-2 results worst= worst=0.5 9 worst= worst= worst= worst=1.0 3 worst= par[evaluation] min-2=0.0 min-2=0.5 min-2=0.9 min-2=1.6 sec[clustering] title[xml]par[evaluation] Candidate queue

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 12 CO.Thorough Element-granularity Turn query into pseudo CAS query using “//*” No post-filtering on specific element types = (rank 22 of 55) MAP = (rank 37 of 55) Old INEX_eval: MAP=0.058 (rank 3)

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 13 COS.Fetch&Browse Document-granularity Rank documents according to their best target element Strict evaluation of support & target elements Return all target elements per doc using the document score (no overlap) MAP = (rank 4 of 19)

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 14 SSCAS Element-granularity with strict support & target elements (no overlap) = 0.45 (ranks 1 & 2 of 25) MAP = & (ranks 1 & 6 )

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 15 Top-k Efficiency , , TopX – BenProbe ,25, ,970n/a10StructIndex ,122,318n/a10Join&Sort ,074,384 77,482n/a10StructIndex , , TopX – MinProbe ,902, , ,000TopX – BenProbe relPrec # SA CPU sec epsilon # RA k relPrec

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 16 Probabilistic Pruning , , , , TopX - MinProbe , , , , ,327 36, # SA CPU sec epsilon # RA k relPrec

INEX ‘05 An Efficient and Versatile Query Engine for TopX Search 17 Conclusions & Ongoing Work Efficient and versatile TopX query processor Extensible framework for text, semi-structured & structured data Probabilistic Extensions Probabilistic cost model for random access scheduling Very good precision/runtime ratio for probabilistic candidate pruning Full NEXI support Phrase matching, mandatory terms “+”, negation “-”, attributes Query weights (e.g., relevance feedback, ontological similarities) Scalability Optimized for runtime, exploits cheap disk space (redundancy factor 4-5 for INEX) Participated at TREC Terabyte Efficiency Task Dynamic and self-tuning query expansions [Sigir ’05] Incrementally merges inverted lists for a set of active expansions Vague Content & Structure (VCAS) queries (maybe next year..)