Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.

Slides:



Advertisements
Similar presentations
CSE Lecture 3 – Algorithms I
Advertisements

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Sorting and Searching. Searching List of numbers (5, 9, 2, 6, 3, 4, 8) Find 3 and tell me where it was.
1 Advanced Data Structures. 2 Topics Data structures (old) stack, list, array, BST (new) Trees, heaps, union-find, hash tables, spatial, string Algorithm.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Fourth Edition.
A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
1 Summary of lectures 1.Introduction to Algorithm Analysis and Design (Chapter 1-3). Lecture SlidesLecture Slides 2.Recurrence and Master Theorem (Chapter.
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
§6 B+ Trees 【 Definition 】 A B+ tree of order M is a tree with the following structural properties: (1) The root is either a leaf or has between 2 and.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
ITEC 2620A Introduction to Data Structures
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Chapter 3 Sec 3.3 With Question/Answer Animations 1.
CS 162 Intro to Programming II Searching 1. Data is stored in various structures – Typically it is organized on the type of data – Optimized for retrieval.
SEARCHING. Vocabulary List A collection of heterogeneous data (values can be different types) Dynamic in size Array A collection of homogenous data (values.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.
CSC 211 Data Structures Lecture 13
Search Engines.
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,
Algorithm Analysis CS 400/600 – Data Structures. Algorithm Analysis2 Abstract Data Types Abstract Data Type (ADT): a definition for a data type solely.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
1 Information Retrieval LECTURE 1 : Introduction.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Keyword search on encrypted data. Keyword search problem  Linux utility: grep  Information retrieval Basic operation Advanced operations – relevance.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
CS315 Introduction to Information Retrieval Boolean Search 1.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Lecture 2 Algorithm Analysis
CMPT 438 Algorithms.
Why indexing? For efficient searching of a document
Large Scale Search: Inverted Index, etc.
An Efficient Algorithm for Incremental Update of Concept space
Database Management System
Summary of lectures Introduction to Algorithm Analysis and Design (Chapter 1-3). Lecture Slides Recurrence and Master Theorem (Chapter 4). Lecture Slides.
RE-Tree: An Efficient Index Structure for Regular Expressions
Sorting by Tammy Bailey
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Chapter 12: Query Processing
Spatio-temporal Pattern Queries
Advanced Associative Structures
Chapter 3: The Efficiency of Algorithms
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
ITEC 2620M Introduction to Data Structures
Overview of Query Evaluation
Hashing.
Information Retrieval and Web Design
Efficient Aggregation over Objects with Extent
Presentation transcript:

Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine

Overview Web Search Engine Basics Algorithms for set operations Theoretical Analysis Experimental Analysis Engineering an Improved Algorithm Conclusions

Web Search Engine Basics Crawl: sequential gathering process Document ID (DocID) for each web page Cool sites: SIGIR SIGACT SIGCOMM SIGIR SIGCOMM SIGACT

Indexing: List of entries of type E.g. SIGCOMM Cool sites: SIGIR SIGACT SIGCOMM SIGIRSIGACT

Postings set: Set of docID’s containing a word or pattern. SIGACT {1,3} SIGCOMM {1,4} SIGCOMM Cool sites: SIGIR SIGACT SIGCOMM SIGIRSIGACT

Search Engine Basics (cont.) Postings set stored implicitly/explicitly in a string matching data structure PAT tree/array Inverted word index Suffix trees KMP (grep)...

String Matching Problem Different performance characteristics for each solution Time/Space tradeoff (empirical) Linear time/linear space lower bound [Demaine/L-O, SODA 2001]

Search Engine Basics (cont.) A user query is of the form: keyword 1  keyword 2  …  keyword n where  is one of { and,or } E.g. computer and science or internet

Evaluating a Boolean Query The interpretation of a boolean query is the mapping: keyword postings set and  (set intersection) or  (set union) E.g. {computer}  {science}  {internet }

Set Operations for Web Search Engines Average postings set size > 10 million Postings set are sorted

Intersection Time Complexity Worst case linear on size of postings sets: Θ(n) {1,3,5,7}  {1,3,5,7} On size of output? {1,3,5,7}  {2,4,6,8}

Adaptive Algorithms Assume the intersection is empty. What is the min number of comparisons needed to ascertain this fact? Examples {1,2,3,4}  {5,6,7,8}

Much ado About Nothing A sequence of comparisons is a proof of non-intersection if every possible instance of sets satisfying said sequence has empty intersection. E.g. A={1,3,5,7} B={2,4,6,8} a 1 < b 1 < a 2 < b 2 < a 3 < b 3 < a 4 < b 4

Adaptive Algorithms In [SODA 2000] we proposed an adaptive algorithm that intersects k sets in: k · | shortest proof of non-intersection | steps. Ideal for crawled, “bursty” data sets

How does it work? 1,_,3,... i n DocID universe set

Measuring Performance 100MB Web Crawl 5000 queries from Google

Baseline Standard Algorithm Sort sets by size Candidate answer set is smallest set For each set S in increasing order by size –For each element e in candidate set Binary search for e in S If e is not found remove from candidate set R emove elements before e in S

Upper Bound: Adaptive/Traditional Two-Smallest Algorithm

Lower Bound: Adaptive/Shortest Proof

Middle Bound: Adaptive/ Encoding of Shortest Proof

Side by Side Lower Bound Middle Bound

Possible Improvements Adaptive performs best in two-three sets Traditional algorithm often terminates after first pair of sets Galloping seems better than binary search Adaptive keeps a dynamic definition of “smallest set” Candidate elements aggressively tested

Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9}

Experimental Results Test orthogonally each possible improvement Cyclic or Two Smallest Symmetric Update Smallest Advance on Common Element Gallop Factor/Binary Search

Binary Search vs. Gallop

Advance on Common Element

Small Adaptive Combines best of Adaptive and Two-Smallest Two-smallest Symmetric Advance on common element Update on smallest Gallop with factor 2

Small Adaptive

Small Adaptive is faster than Two-Smallest Aggregate speed-up 2.9 x comparisons Faster than Adaptive

Conclusions Faster intersection algorithm for Web Search Engines Adaptive measure for set operations Information theoretic “middle bound” Standard speed-up techniques for other settings THE END

Total # of elements in a query Number of queries for each total size Query Log

Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9, 12}