Max-Planck Institute for Informatics

Slides:



Advertisements
Similar presentations
Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Advertisements

Information Retrieval in Practice
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile.
Best-Effort Top-k Query Processing Under Budgetary Constraints
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
VLDB ´04 Top-k Query Evaluation with Probabilistic Guarantees Martin Theobald Gerhard Weikum Ralf Schenkel Max-Planck Institute for Computer Science SaarbrückenGermany.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Access Path Selection in a Relational Database Management System Selinger et al.
Term Weighting and Ranking Models Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
Chapter 6: Information Retrieval and Web Search
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Efficient Processing of Top-k Spatial Preference Queries
An Efficient and Versatile Query Engine for TopX Search Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics SaarbrückenGermany.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.
Winter Semester 2003/2004Selected Topics in Web IR and Mining5-1 5 Index Pruning 5.1 Index-based Query Processing 5.2 Pruning with Combined Authority/Similarity.
Vector Space Models.
INEX ‘05 INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for Informatics Saarbrücken.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
Information Retrieval in Practice
Queensland University of Technology
An Efficient Algorithm for Incremental Update of Concept space
Neighborhood - based Tag Prediction
Indexing & querying text
Chinese Academy of Sciences, Beijing, China
Introduction to Query Optimization
Spatio-temporal Pattern Queries
Compact Query Term Selection Using Topically Related Text
Max Planck Institute for Informatics
Martin Theobald Max-Planck-Institut Informatik Stanford University
Laks V.S. Lakshmanan Depf. of CS UBC
Information Organization: Clustering
Structure and Content Scoring for XML
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Chapter 5: Information Retrieval and Web Search
Structure and Content Scoring for XML
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Max-Planck Institute for Informatics Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics Saarbrücken Germany ACM SigIR ‘05

An Initial Example… TREC Robust Track ’04, hard query no. 363 (Aquaint news corpus) “transportation tunnel disasters” transportation tunnel disasters 1.0 Increased retrieval robustness Count only the best match per document and expansion set Increased efficiency Top-k-style query evaluations Open scans on new terms only on demand No threshold tuning 1.0 1.0 transit highway train truck metro “rail car” car … 0.9 0.8 0.7 0.6 0.5 0.1 tube underground “Mont Blanc” … 0.9 0.8 0.7 catastrophe accident fire flood earthquake “land slide” … 1.0 0.9 0.7 0.6 0.5 d2 d1 Expansion terms from relevance feedback, thesaurus lookups, Google top-10 snippets, etc. Term similarities, e.g., Rocchio, Robertson&Sparck-Jones, concept similarities, or other correlation measures ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Computational model & background on top-k algorithms Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Computational Model Vector space model with a Cartesian product space D1×…×Dm and a data set D  D1×…×Dm   m Precomputed local scores s(ti,d)∈ Di for all d∈ D e.g., tf*idf variations, probabilistic models (Okapi BM25), etc. typically normalized to s(ti,d)∈ [0,1] Monotonous score aggregation aggr: (D1×…×Dm )  (D1×…×Dm ) → + e.g., sum, max, product (using sum over log sij ), cosine (using L2 norm) Partial-match queries (aka. “andish”) Non-conjunctive query evaluations Weak local matches can be compensated Access model Disk-resident inverted index over large text corpus Inverted lists sorted by decreasing local scores  Inexpensive sequential accesses to per-term lists: “getNextItem()”  More expensive random accesses: “getItemBy(docid)” ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

No-Random-Access (NRA) Algorithm [Fagin et al., PODS ’01 Balke et al. VLDB ’00 Buckley&Lewit, SigIR ‘85] Corpus: d1,…,dn NRA(q,L): scan all lists Li (i = 1..m) in parallel // e.g., round-robin < d, s(ti ,d) > = Li.getNextItem() E(d) = E(d)  {i} highi = s(ti ,d) worstscore(d) = ∑E(d) s(t ,d) bestscore(d) = worstscore(d) + ∑E(d) high if worstscore(d) > min-k then add d to top-k min-k = min{ worstscore(d’) | d’  top-k} else if bestscore(d) > min-k then candidates = candidates  {d} if max {bestscore(d’) | d’ candidates}  min-k then return top-k d1 d1 d1 s(t1,d1) = 0.7 … s(tm,d1) = 0.2 Query q = (transportation, tunnel disaster) Inverted Index Rank Doc# Worst-score Best-score 1 d78 0.9 2.4 2 d64 0.8 3 d10 0.7 k = 1 Rank Doc# Worst-score Best-score 1 d78 1.4 2.0 2 d23 1.9 3 d64 0.8 2.1 4 d10 0.7 Rank Doc# Worst-score Best-score 1 d10 2.1 2 d78 1.4 2.0 3 d23 1.8 4 d64 1.2 d78 0.9 d23 0.8 d10 0.8 d1 0.7 d88 0.2 transport Scan depth 1 … Scan depth 2 Scan depth 3 Naive Join-then-Sort in between O(mn) and O(mn2) runtime d64 0.8 d23 0.6 d10 0.6 d10 0.2 d78 0.1 tunnel … d10 0.7 d78 0.5 d64 0.4 d99 0.2 d34 0.1 STOP! disaster … ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Computational model & background on top-k algorithms Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Dynamic & Self-tuning Query Expansions top-k (transport, tunnel, ~disaster) Incrementally merge inverted lists Li1…Lim’ in descending order of local scores Dynamically add lists into set of active expansions exp(ti) Only touch short prefixes of each list, don’t need to open all lists Best match score aggregation for combined term similarities and local scores d42 d11 d92 … virtual list ~disaster d42 d11 d92 ... d21 d78 d10 d1 d32 d87 disaster accident fire transport tunnel incr. merge d66 d95 d93 d17 d95 d11 d101 d99 ... ... Increased retrieval robustness & fewer topic drifts Increased efficiency through fewer active expansions No threshold tuning of term similarities in the expansions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Incremental Merge Operator Index list meta data (e.g., histograms) Relevance feedback, Thesaurus lookups,… Initial high-scores Expansion terms ~t = {t1,t2,t3} Correlation measures, Large corpus statistics … sim(t, t1 ) = 1.0 t1 ... d78 0.9 d1 0.4 d88 0.3 d23 0.8 d10 0.9 0.4 ... d12 0.2 d78 0.1 d64 0.8 d23 d10 0.7 t2 sim(t, t2 ) = 0.9 Expansion similarities 0.72 0.18 sim(t, t3 ) = 0.5 t3 ... d99 0.7 d34 0.6 d11 0.9 d78 d64 Incremental Merge iteratively triggered by top-k operator  sequential access “getNextItem()” 0.45 0.35 ~t d78 0.9 d23 0.8 d10 0.8 d64 0.72 d23 0.72 d10 0.63 d11 0.45 d78 0.45 d1 0.4 d88 0.3 ... ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Computational model & background on top-k algorithms Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04] For each physically stored index list Li Treat each s(ti,d)  [0,1] as a random variable Si and consider Approximate local score distribution using an equi-width histogram with n buckets ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04] For each physically stored index list Li Treat each s(ti,d)  [0,1] as a random variable Si and consider Approximate local score distribution using an equi-width histogram with n buckets For a virtual index list ~Li = Li1…Lim’ Consider the max-distribution (feature independence) Alternatively, construct meta histogram for the active expansions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Return current top-k list if candidate queue is empty! Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04] For each physically stored index list Li Treat each s(ti,d)  [0,1] as a random variable Si and consider Approximate local score distribution using an equi-width histogram with n buckets For a virtual index list ~Li = Li1…Lim’ Consider the max-distribution (feature independence) Alternatively, construct meta histogram for the active expansions For all d in the candidate queue Consider the convolution over local score distributions to predict aggregated scores Drop d from candidate queue, if Return current top-k list if candidate queue is empty! ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Computational model & background on top-k algorithms Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Incremental Merge for Multidimensional Phrases q = {undersea „fiber optic cable“} Top-k Nested Top-k operator iteratively prefetches & joins candidate items for each subquery condition “getNextItem()” Propagates candidates in descending order of bestscore(d) values to provide monotonous upper score bounds Provides [wortscore(d), bestscore(d)] guarantees to superordinate top-k operator Top-level top-k operator performs phrase tests only for the most promising items (random access) (Expensive predicates & minimal probes [Chang&Hwang, SIGMOD ‘02] ) Single threshold condition for algorithm termination (candidate pruning at the top-level queue only) Incr.Merge sim(„fiber optic cable“, „fiber optic cable“) = 1.0 sim(„fiber optic cable“, „fiber optics“) = 0.8 Nested Top-k Nested Top-k undersea … d14 0.9 d23 0.8 d32 0.7 d18 d1 fiber … d78 0.9 d1 0.7 d88 0.2 d23 0.8 d10 optic d34 d7 0.4 0.3 d12 0.6 cable d41 d2 0.1 d75 0.5 fiber … d78 0.9 d1 0.7 d88 0.2 d23 0.8 d10 optics d5 0.4 d47 0.1 d17 0.6 random access term-to- position index ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Computational model & background on top-k algorithms Outline Computational model & background on top-k algorithms Incremental Merge over inverted lists Probabilistic candidate pruning Phrase matching Experiments & Conclusions ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Experiments – Aquaint with Fixed Expansions Aquaint corpus of English news articles (528,155 docs) 50 “hard” queries from TREC 2004 Robust track WordNet expansions using a simple form of WSD Okapi-BM25 model for local scores, Dice coefficients as term similarities Fixed expansion technique (synonyms + first-order hyponyms) # SA #CPU sec P@10 MAP @1000 relPrec max(m) max(KB) # RA avg(m) Title-only Baseline Join&Sort 2.5 4 2,305,637 NRA-Baseline 1,439,815 9.4 432 KB 0.252 0.092 1.000 Static Expansions Join&Sort 35 118 20,582,764 NRA+Phrases, ε=0.0 18,258,834 210,531 245.0 37,355 KB 0.286 0.105 1.000 NRA+Phrases, ε=0.1 3,622,686 49,783 79.6 5,895 KB 0.238 0.086 0.541 Dynamic Expansions Incr.Merge, ε=0.0 35 118 7,908,666 53,050 159.1 17,393 KB 0.310 0.118 1.000 Incr.Merge, ε=0.1 5,908,017 48,622 79.4 13,424 KB 0.298 0.110 0.786 ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Experiments – Aquaint with Fixed Expansions, cont’d Probabilistic Pruning Performance Incremental Merge vs. top-k with static expansions Epsilon controls pruning aggressiveness 0 ≤ ε ≤ 1 ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Conclusions & Ongoing Work Increased efficiency Incremental Merge vs. Join-then-Sort & top-k using static expansions Very good precision/runtime ratio for probabilistic pruning Increased retrieval robustness Largely avoids topic drifts Modeling of fine grained semantic similarities (Incremental Merge & Nested Top-k operators) Scalability (see paper) Large expansions (m < 876 terms per query) on Aquaint Expansions for Terabyte collection (~25,000,000 docs) Efficient support for XML-IR (INEX Benchmark) Inverted lists for combined tag-term pairs e.g., sec=mining Efficiently supports child-or-descendant axis e.g., //article//sec//=mining Vague content & structure queries (VCAS) e.g., //article//~sec=~mining TopX-Engine, VLDB ’05 ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Thank you! ACM SigIR ‘05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing