Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener 201195001 Doğuş University.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Inverted Index Hongning Wang
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Information Retrieval IR 4. Plan This time: Index construction.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Introduction to the Personal Computer (PC): The Basics of “What you need to know” Brian Simms M.A., CCC-SLP Assistive Technology Specialist.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Compression of Inverted Indexes for Fast Query Evaluation Falk Scholer Hugh Williams John Yiannis Justin Zobel (RMIT University, Melbourne, Australia)
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Web Search Algorithms By Matt Richard and Kyle Krueger.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
Evidence from Content INST 734 Module 2 Doug Oard.
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
CS315 Introduction to Information Retrieval Boolean Search 1.
Why indexing? For efficient searching of a document
Search in Google's N-grams
Large Scale Search: Inverted Index, etc.
Information Retrieval in Practice
An Efficient Algorithm for Incremental Update of Concept space
Text Indexing and Search
Indexing & querying text
Information Retrieval in Practice
Text Based Information Retrieval
Implementation Issues & IR Systems
CSCE 561 Information Retrieval System Models
CMPS 561 Boolean Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
Chapter 5: Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Query processing: phrase queries and positional indexes
Information Retrieval and Web Design
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University

Search Engines... Need to evaluate queries extremely fast. Involve phrases. Supported with low disk overheads.

Introduction Most queries consist of simple list of words. Some of query terms must be ordered and adjacent.  Typically by enclosing and in quotation mark. Standart way to evaluate phrase queries to use inverted index.  Inverted Index(II) use List of posting (each posting include a document ID )  List of offsets.(ordinal word position)  II work with combinating the posting list for the query terms occurs in the documents. This process is fast but does not mean!  Because of common words.

Introduction Cont. A common term require several megabytes for each GB of Inverted Index's Data.  A crude solution is to use stopping The Google neglected common words in phrase queries until 2002  Until this, many more queries evaluated incorrectly.

Introduction Cont. A Nextword index is like a Inverted Index  Nextword index use Index term(firstword and nextword)  Nextword index work Each index term(firstword) is a list of the words(nextword) that follow that term. Firstword and nextword occur as a pair.  As a disadvantages is its storage size.  Must be processed linearly(Nextword process). With direct indexing, indexed 10 k most common phase queries reduces query evalution time by over %10.

Next... Introduction (Fin) Properties of Phrase Queries Inverted Index in Phrase Queries Partial Phrase and Nextword Indexing Combining Phrase and Inverted Indexing Experimental Result Conclusion

Properties of Queries In this research, used query logs by Excite from 1997 and 1999  These logs have similar properties.  queries including duplicates.  % 8.3 of these were explicit phrase queries.  In totaly, %5-10 are explicit.  Queries matched in an around 20 GB Web dataset.  Pharses queries, or % 8.4 include one of three common words as the, to and of. In totaly, %14.4 of phase queries include one of 20 commonest terms.

Properties of Queries In this research, used query logs by Excite from 1997 and 1999  These logs have similar properties.  queries including duplicates.  % 8.3 of these were explicit phrase queries.  In totaly, %5-10 are explicit.  Queries matched in an around 20 GB Web dataset.  Pharses queries, or % 8.4 include one of three common words as the, to and of. In totaly, %14.4 of phase queries include one of 20 commonest terms.

Properties of Queries Common words played important role!  In tower of london, can be safely neglected during evalution.  But in the spacial name like movie name or brand name  End of days or The who  These queries are diffucult to evaluate with stopwords removed.  Also query logs include;  To be or not to be  Who are we  All in all

Properties of Queries Stopping may yield efficiency gain,  But, significant number of queries cannot be correctly evaluated. Basic query is tower of london, it is evaluated as tower – london  Stopped first 3 commenest word  Result 309 x 10^6 matches  Stopped first 20 commenest word  Result 490 x 10^6 matches  Stopped first 254 commenest word  Result 1693 x 10^6 matches Most mixed problem in form and to.  Dismathes flights from london and flights to london

Properties of Queries Other dismathes examples; So many roads ->how many road Man in the moon -> man on the moon Among the phase queries include,  Generaly 2 words.  %34 in 3 words.  %1.3 in 6 or more word.

Properties of Queries Testing Data  Called WT10g collection.  This is GB Web data (HTML) and 1.67 million doc.  It is crawed in 1997

Most Frequent Words and Word Pairs

Next... Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries Partial Phrase and Nextword Indexing Combining Phrase and Inverted Indexing Experimental Result Conclusion

Inverted Index It is a standart method for supporting queries on large text DB. It is fast for ranked query evalution. It use two level structure  Upper level is a vocabulary or lexicon  Lower level is set of posting list. Zobel and Moffat (1998) notation;  D is document ID  F dt frequent of term indocument D  OX is position of term in document D

Inverted Index Let's look "hatful of hollow" This is general structure of Inverted Index  Term and Document frequences contain in it.  Word positions are ordinal.

Inverted Index Inverted Index Evaluator  It is open source MG text retrival engine  Descirebed by Witten et al.(1999) Inverted Index data size for WT10g is 1,429 MB  Stopped word data size is 427 MB (490 stopwords)  Stopped Inverted Index size is 1,002 MB

Inverted Index Result of Inverted Index performing

Next... Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries (Fin) Partial Phrase and Nextword Indexing Combining Phrase and Inverted Indexing Experimental Result Conclusion

Phrase Indexes Phase Index is an Inverted Index where items stored as a word sequence. A parcial phrase index with a vocabulary of five popular phrases.

Phrase Indexes A phrase index with L = 3 cannot be used efficient to 2 word queries  L=> 2 are stored as term in conventional inverted index.  L= 2 is organized for partial nextword indexes. Parcial Phrase Index  It is notation like;  D is document ID, f dp is term frequence of document. Offsets are not stored. The sets saves the cost of merging lists.

Phrase Indexes As examples are  Lord of the rings(19) and birtney spears(59)* in 2001 Given a stream of queries over a long period and fixed volume of memory May also be required to update the vocabulary or replace least frequently used queries. This research do not experiment with this approach. * is number of same request(Query)

Nextword Indexes A phrase query can never be less than two word. Nextword index is similar to inverted index. Term representation;  F wp is document frequence.  D is document ID.  F dwp is frequent of term of D.  OX is position of term in D.

Nextword Indexes A nextword index with two firstwords. An example : boulder municipal employee credit union  This can be grouped like boulder-municipal,employee- credit and credit-union Other example : historical railroads in new hamsphire  It can grouped as railroads in in preferences to in new AS railroad is much less common than in.

Nextword Indexes The nextword index for the WT10g collection is 2.75 GB in size.  It is exactly twice that of an inverted index file. The nextword index involves more complex structures than does processing with inverted index. Differences between Inverted Index and Nextword Index in queries

Next... Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries (Fin) Partial Phrase and Nextword Indexing (Fin) Combining Phrase and Inverted Indexing Experimental Result Conclusion

Combining Nextword and Inverted Indexing Propose that common words only be used as firstword in a parcial nextword index.

Combining Phrase and Inverted Indexing As an example, the query is new york city  can be resolved using the partial phrase index find the locations of new york and merging with the inverted index postings list for city.

Three-Way Index Combination It is include a parcial nextword, partial phrase, and full inverted index.

Next... Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries (Fin) Partial Phrase and Nextword Indexing (Fin) Combining Phrase and Inverted Indexing (Fin) Experimental Result Conclusion

Experimental Result All expriments were run on intel 700 Mhz Pentium III based server with 2 GB of memory. Result of Inverted and Nextword Indexing This table is include the memory usage of the combinations.

Result of Inverted and Nextword Indexing Result of n terms queries with Inverted and Nextword Indexing

Result of Inverted Index and Phrase This test evaluate in 100, 1000, most frequent distinct queries  Phrase index was less than %0.1of the collection  2.1MB, 4,8 MB, 12,8 MB  In query logs, an american dictionary of the english language AND los angeles department of water and power are in common queries. Experimental results,

Result of Inverted Index, Nextword Index and Phrase This result is based queries' testing with using phase queries as common queries, nextword(only stopped word) and inverted indexing.

Next... Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries (Fin) Partial Phrase and Nextword Indexing (Fin) Combining Phrase and Inverted Indexing (Fin) Experimental Result(Fin) Conclusion