Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Traveling Salesperson Problem
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Fast Algorithms For Hierarchical Range Histogram Constructions
Near-Duplicates Detection
Kernel memory allocation
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Variable Length Data and Records Eswara Satya Pavan Rajesh Pinapala CS 257 ID: 221.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.
Chapter 9: Greedy Algorithms The Design and Analysis of Algorithms.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Genetic Algorithms Nehaya Tayseer 1.Introduction What is a Genetic algorithm? A search technique used in computer science to find approximate solutions.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Design and Implementation of a Geographic Search Engine Alexander Markowetz Yen-Yu Chen Torsten Suel Xiaohui Long Bernhard Seeger.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Case Base Maintenance(CBM) Fabiana Prabhakar CSE 435 November 6, 2006.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Dijkstra’s Algorithm. Announcements Assignment #2 Due Tonight Exams Graded Assignment #3 Posted.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.
Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Frequency-aware Similarity Measures 1.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Evidence from Content INST 734 Module 2 Doug Oard.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.
Statistical Properties of Text
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Madhuri Gollu Id: 207. Agenda Agenda  Records with Variable Length Fields  Records with Repeating Fields  Variable Format Records  Records that do.
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
HUFFMAN CODES.
Text Indexing and Search
Chapter 12: Query Processing
Merge Sort 11/28/2018 2:21 AM The Greedy Method The Greedy Method.
Advanced Algorithms Analysis and Design
Implementation Based on Inverted Files
Variable Length Data and Records
Greedy Algorithms TOPICS Greedy Strategy Activity Selection
The Greedy Approach Young CS 530 Adv. Algo. Greedy.
Huffman Coding Greedy Algorithm
Connecting the Dots Between News Article
Presentation transcript:

Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted indexes

2  Introduction  Indexing and query evaluation strategies  Cost function  Index construction  Query evaluation  Experimental results  Conclusion

3 Precomputation of common term co-occurrences has been successfully applied to improve query performance in large scale search engines based on inverted indexes. Inverted indexes have been successfully deployed to solve scalable retrieval problems where documents are represented as bags of terms. Each term t is associated with a posting list, which encodes the documents that contain t.

4 D0 = " it is what it is " D1 = " what is it " D2 = " it is a banana " wordDocumentPositionFrequently " a "Document 2 " banana "Document 2 " is "Document 0,1, 2 " it "Document 0,1, 2 " what "Document 0,1 Inverted Index A term search for the terms "what", "is" and "it" would give the set {0,1}∩{0,1,2} ∩{0,1,2}={0,1}

5 For a selected set of terms in the index, we store bitmaps that encode term co-occurrences. Bitmap: A bitmap of size k for term t augments each posting to store the co-occurrences of t with k other terms, across every document in the index. Precomputed list: typically shorter, can only be used to evaluate queries containing all of its terms. Contains only the docids

6 Precomputed list Index with bitmaps(size=2,k=2) for terms York and Hall query workload chosen to represent each of these combinations by a separate posting list, the memory cost, as well as the complexity of picking the right combinations during query evaluation, would have become prohibitive.

Main Contribution: 1)Introduce the concept of bitmaps as a flexible way to store term co-occurrences. 2)Define the problem of selecting terms to precompute given a query workload and a memory budget and propose an efficient solution for it. 3)Show that bitmaps and precomputed lists complement each other, and that the combination significantly outperforms each technique individually. 4)Present experimental results over the TREC WT10g corpus demonstrating the benefits of the approach in practice. 7

8 Posting: 〈 docid, payload 〉 the occurrence of a term within a document  docid : the document identifier  Payload: used to store arbitrary information about each occurrence of term within document. And use part of the payload to store the co-occurrence bitmaps.  Basic operations on posting lists: 1. first(): returns the list's first posting 2. next(): returns the next posting or signals the end of list 3. search(d): returns the first posting with docid ≥d, or end of list if no such posting exists. This operation is typically implemented efficiently using the posting lists indexes.

9 conjunctive query q = t 1 t 2…… t n a search algorithm returns R R :the set of docids of all documents that match all terms t 1 t 2 ……t n. L 1 L 2…… L n : the posting lists of terms t 1 t 2 ……t n GOAL checks whether the current candidate document that match all terms from the shortest list appears in other lists.

10 Hall York New City New York L1 L2 L3 L4 L5 Query: “ New York City Hall ” Result R={Document 2 ( docid=2) }

11 measuring the lengths of the accessed postings lists and the evaluation time for each query. Focus on Minimum cost 1) the shortest list length |L 1 | 2) the random access cost 12+log|L i |. Suppose terms t 1 and t 2 frequently occur as a subquery and |L 1 | ≤ |L 2 |.

12 L1 L2 L3 L4 Hall York New City Query1:“ New York ” Query2:“ New York City ” Query3:“ New York City Hall ” Query4:“ New City Hall ” F(q1)=4*[(12+log4)+(12+log5)] F(q2)=4*[(12+log4)+(12+log5)+(12+log5)] F(q3)=3*[(12+log3)+(12+log4)+(12+log5)+(12=log5)] F(q4)=3*[(12+log3)+(12+log5)+(12=log5)]

13 Precomputed List: store the co-occurrences of t 1 t 2 as a new term t 12. The size of t 12 's list is exactly |L 1 ∩L 2 |. Advantage: (1)Reduce the number of posting lists accessed during query evaluation (2)Reduce the size of these listsBitmaps: add a bit to the payload of each posting in L 1. value of the bit is 1: document contains t 2, 0: otherwise.  allows the query evaluation algorithm to avoid accessing L 2 Cutting the second component of the cost function

14 Bitmap: the extra space required for adding a bitmap for term t j to term t i 's list is exactly |L i | since every posting in L i grows by one bit.EX: term New,York,City |L New | ≥ |L City | ≥ |L York | queriesNew York, City York, New York City Case 1:no previous bitmaps exist If adding a bitmap for term New to City's posting list. improves the evaluation of query New York City | L York |(G(| L New |) + G(| L City |)) → | L York |G(| L City |) Case 2:the list York already has bits for terms New and City total latency would be |L York | Define : B←association matrix Ex: b ij =1 if there is a bit for term t j in list L i 's bitmap. b City New = 1 in the example above.

15 Given a set of bitmaps B and a query q F(B,q) :the latency of evaluating q with the bitmaps indicated by B. S: the total space available for storing extra information Q = {q 1, q 2, …….} the query workload. 1.Consider the benefit of an extra bitmap,b ij, when a previous set B has already been selected. This is exactly F(B ∪ {b ij },q) - F(B,q). 2. ⊇ B has already been selected,( ∪ {b ij },q) - F(, q). computes the ratio of the benefit to the increase in index size

16 L1: Hall’s posting list L2: York’s posting list L3: New’s posting list L4: City’s posting list B:L new (bit) B:L new +York (bit) B:L new +City (bit) (bit) B:L new +City+York

17 L1 L2 L3 L4 Hall {New,City} York {New,City} New City Query(q1):“ New York City Hall“ Query(q2):“ New York City“

L1 L2 L3 L4 Hall {New,City,York} York {New,City} New City Query(q1):“ New York City Hall“ Query(q2):“ New York City“ F(B ∪ {b L1York },q1) = 3(7) F(B ∪ {b L1York },q2) = 3(3) λ L1York = [(7-3)+(3-3)]/3=4/3 18

19 L1 L2 L3 L4 Hall {New,City} York {New,City,Hall} New City Query(q1):“ New York City Hall“ Query(q2):“ New York City“ F(B ∪ {b L2 Hall },q1) = 4(7) F(B ∪ {b L2 Hall },q2) = 4(4) λ L2 Hall = [(7-4)+(4-4)]/4=3/4

20 Precomputed lists: Given a set of precomputed lists P = {p} ij, where p ij is the indicator variable representing whether the results of query t i t j were precomputed F(P,q) : the cost of evaluating query q given P Adding an extra precomputed list p to P can obviously only reduce F, but at the cost of storing a new list of size | L i ∩ L j |. select the precomputed list p ij that maximizes λ’ ij

21 L1 L2 L3 L4 Hall York New City New York Query(q1):“ New York City Hall“ Query(q2):“ New York City“ Query(q3):“ New City Hall ” F(P ∪ {p NewCity },q1) = 3*[(12+log3)+(12+log3)] F(P ∪ {p NewCity },q2) = 3*[(12+log3)] F(P ∪ {p NewCity },q3) = 3*[(12+log3)] New City λ‘ New City = [(3log5-3log3)+(3log5-3log3)+(3log5-log3)]/3

22 L1 L2 L3 L4 Hall York New City New York York City 1212 Query(q1):“ New York City Hall“ Query(q2):“ New York City“ Query(q3):“ New City Hall ” F(P ∪ {p NewCity },q1) = 2*[(12+log3)+(12+log3)] F(P ∪ {p NewCity },q2) = 2*[(12+log3)] F(P ∪ {p NewCity },q3) = 2*[(12+log3)+(12+log3)] λ‘ York City = [(24-log3+3log5)+(12-2log3+3log5)+(3log5-log3)]/2

23

24 L6 L5 L1 L2 L3 L4 Hall {New,City} York {New,City} New City New York {City} New City {Hall} Query(q1):“ New York City Hall“ Query(q2):“ New York City“ Query(q3):“ New City Hall ” F(P ∪ {p NewCity },q1) = 3*[(12+log3)+(12+log3)] F(P ∪ {p NewCity },q2) = 3*[(12+log3)] F(P ∪ {p NewCity },q3) = 3*[(12+log3)] λ‘ New City = [(3log5-3log3)+(3log5-3log3)+(3log5-log3)]/3 Normalize: λ‘ New City /

L6 L5 L1 L2 L3 L4 New City New York {City} New City {Hall} Query(q1):“ New York City Hall“ Query(q2):“ New York City“ Query(q3):“ New City Hall ” Hall {New,City} York {New,City} F(B ∪ {b L6 Hall },q1) = 3+3=6(6) F(B ∪ {b L6Hall },q2) = 3(3) F(B ∪ {b L6Hall },q3) = 3(6) λ L6 Hall = [(6-6)+(3-3)+(6-3)]/3=1 Normalize:1/1=1 25

26 Bitmap: Goal: find a subset of the lists that minimizes the query cost find L that covers q and minimizes F(B,q). L ⊆ {L 1,L 2, ……………,L n } L covers the query q ↔

27 City Hall {New,City} L1 L2 L3 L4 Query: “ New York City Hall ” New York {New,City} iL setMark(term)Unmark(term) 1(New){L1}NewYork,City,Hall 2(York){L1,L2}New,York,CityHall 3(City){L1,L2}New,York,CityHall 4 (Hall){L1,L2,L4}New,York,City,Hall

28 Precomputed lists: Goal: find the set of lists that minimize the cost function and jointly cover all of the query terms.

29 City Hall {New,City} L1 L3 L4 L5 Query: “ New York City Hall ” New York {New,City} iL setMark(term)Unmark 1(New){L New , L New York , L New City } New,York,CityHall 2(York){L New , L New York , L New City } New,York,CityHall 3(City){L New , L New York , L New City } New,York,CityHall 4 (Hall){L New , L New York , L New City , L Hall } New,York,City, Hall New York 2 New City 2323

30 Hybrid: 1. invokes Algorithm 3 to identify precomputed lists →minimizing |L 1 | 2. invokes Algorithm 2 for removing some of these lists that are covered by bitmaps in shorter lists.

31  Report in memory list access latencies measured after query rewrite and after preloading all posting lists into memory, averaged over several runs.  Indexed the TREC WT10g corpus consisting of 1.68 million web pages.  Built an inverted index where each posting contains a docid of four bytes and variable size payload containing bitmaps.  Used the AOL query log and sorted all of the queries according to their timestamps and discarded queries containing non- alphanumeric characters, as well as all additional information contained in the log beyond query strings.

32 The resulting 23.6M queries were split into training and testing sets. Training sets : 21M queries from the AOL log, spanning 2.5 months. Testing sets : 2.6M queries, spanning the following two weeks. The ratio between the average query latency when using the index with precomputed results and the average latency using the original index 32% 53%

33 evaluated two strategies of allocating a shared memory budget for bitmaps and precomputed lists: (1)Allocating a fixed fraction of memory budget for bitmaps and precomputed lists, first selecting precomputed lists and then bitmaps. (2) bitmaps and precomputed lists simultaneously using the hybrid. The ratio between the average query latency when using the index with precomputed results and the average latency using the original index.

34 Minimum relative intersection size(MRIS) Define: (For each query of at least two terms) the relative size of the shortest list resulting from an intersection of two query terms to the shortest list of a single term MRIS captures the potential benefit of adding the optimal precomputed list of two terms for this particular query.

35 the average query latency as a function of the precomputation budget from 0% (the original index without precomputation) to 300% (precomputed results occupy 3/4 of the index)

36 Evaluate the effect of precomputation on long tail queries All queries in the test set that did not appear in the training set the latency of all queries and compares it to that of the long tail queries, with and without precomputation 22% 33%

37 Query rewrite performance Evaluate how well the greedy query rewrite algorithm performs compared to the optimal the optimal query rewrite by evaluating our cost function on all possible rewrites given the index and selecting the one with the lowest cost.

38  Introduced the concept of bitmaps for optimizing query evaluation over inverted indexes.  Bitmaps allow for a flexible way of storing information about term co-occurrences and complement the traditional approach of precomputed lists.  Proposed a greedy procedure for the problem of selecting bitmaps and precomputed lists that is a constant approximation to the optimal algorithm.  The analysis of bitmaps and precomputed lists over the TREC WT10g corpus shows that the hybrid approach achieves 25% query performance improvement for 3% growth in index size and 71% for 4-fold index size increase.

Thank you for your listening ! 39