Heavy-Tailed Distribution and Multi-Keyword Queries Surajit Chaudhuri, Kenneth Church, Arnd Christian K ö nig, Liying Sui Microsoft Corporation SIGIR 2007.

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Inverted Indexing for Text Retrieval Chapter 4 Lin and Dyer.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.
Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.
A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Modern Information Retrieval
IR Models: Structural Models
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Parallel and Distributed IR
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Summarization of XML Documents K Sarath Kumar. Outline I.Motivation II.System for XML Summarization III.Ranking Model and Summary Generation IV.Example.
AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section : MIMD Architectures Inverted Files November.
Concepts of Database Management Eighth Edition Chapter 3 The Relational Model 2: SQL.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Vector Space Models.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Lazy Maintenance of Materialized Views Jingren Zhou, Microsoft Research, USA Paul Larson, Microsoft Research, USA Hicham G. Elmongui, Purdue University,
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
Partition Architecture Yeon JongHeum
Evidence from Content INST 734 Module 2 Doug Oard.
K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston
Session 1 Module 1: Introduction to Data Integrity
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR
Survey on Long Queries in Keyword Search : Phrase-based IR Sungchan Park
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Modern Information Retrieval
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Implementation of Vector Space Model March 27, 2006.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras
Large Scale Search: Inverted Index, etc.
An Efficient Algorithm for Incremental Update of Concept space
Text Indexing and Search
Information Retrieval in Practice
Text Based Information Retrieval
CSCE 561 Information Retrieval System Models
Basic Information Retrieval
Introduction to Information Retrieval
Inverted Indexing for Text Retrieval
Information Retrieval and Web Design
Presentation transcript:

Heavy-Tailed Distribution and Multi-Keyword Queries Surajit Chaudhuri, Kenneth Church, Arnd Christian K ö nig, Liying Sui Microsoft Corporation SIGIR Summarized by JongHeum Yeon, IDS Lab., Seoul National University

Copyright  2008 by CEBT INTRODUCTION  Inverted Index in Information Retrieval T 0 = "it is what it is“, T 1 = "what is it“, T 2 = "it is a banana“ "a": {2}, "banana": {2}, "is": {0, 1, 2}, "it": {0, 1, 2}, "what": {0, 1} Search “what”, “is”, “it” – {0,1} ∩{0,1,2} ∩{0,1,2} = {0,1}  Some queries require costly deep traversal into long lists in web- sites(Amazon, eBay, …) with large catalogs of products  The challenge is to reduce the worst-case overhead required to process arbitrary keyword queries 2

Copyright  2008 by CEBT Motivating Scenario  More frequent terms have relatively long inverted lists  Intersections of long inverted indexes are very slow relative to other queries  Figure 20 million products Frequency : F(>900K)-M(50K)-L(<1K) 3

Copyright  2008 by CEBT Problem Statement  Given a document collection, propose a set of indexes to materialize  Time for intersecting keywords does not exceed a given threshold Δ  Additional indexes should not be larger than k(small factor) times the size of the original inverted index 4

Copyright  2008 by CEBT INDEX STRUCTURE AND USAGE  Notation Query Q words(Q) = {w 1, …, w l } k max : maximum number of terms in query γ : global vocabulary π : global ordering – Given keyword-combination C = {w 1, …, w l }, sort words by global ordering for avoiding permutations of keyword-combination size(Q) : number of items(=document) whose text contains all keyword of a query Q size(w) : single word w, number of documents containing w |Q| : number of keywords a query Q contains 5

Copyright  2008 by CEBT Cost Model  Cost Disk seeks to the beginning of posting lists + Scanning postings Unit of cost : scanning a single posting in an inverted index Δ : Cost bound 6

Copyright  2008 by CEBT Processing Strategies  Execution Strategies ID-intersection – Retrieves all inverted indexes of the queried keywords and intersects them – |Q| seeks accesses to disk, reading their contents entirely Post-filtering – When w i in Q is very rare, – Reading text of w i by inverted index, then verifying the remaining keyword constraints using text 7

Copyright  2008 by CEBT Index Structure  materialize combinations of frequent keywords and a small fraction of them  For each vocabulary items w, a list of all keyword combinations containing w for which they have materialized the corresponding inverted index 8

Copyright  2008 by CEBT  Query Q = {w 1, …, w l }  Q contains rare keyword : post-filtering strategy  Otherwise : retrieve all match-list entries Query Processing 9

Copyright  2008 by CEBT EXPERIMENTS  Evaluation of Query Cost Materialized the index structure : 10K frequent words K max = 4, Cost Seek = 1000 Δ : cost of scanning 20% of the number of postings Speed-ups – 18x (2 keywords) – 14x (4 keywords)  Evaluation of Index Sizes 899M postings No additional indexes for keywords occurring in less than 50 documents 141K keywords for indexing Multi-keyword index structures contained 734M postings  Accuracy of Intersection-size Estimation Match list covers 99.3% 10