Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Introduction to Information Retrieval
Fast Algorithms For Hierarchical Range Histogram Constructions
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.
Outline Introduction Related work on packet classification Grouper Performance Empirical Evaluation Conclusions.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
1 Hashing, randomness and dictionaries Rasmus Pagh PhD defense October 11, 2002.
CS Lecture 9 Storeing and Querying Large Web Graphs.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Cache Conscious Indexing for Decision-Support in Main Memory Pradip Dhara.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Computer Organization Cs 147 Prof. Lee Azita Keshmiri.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1. Outline Introduction Related work on packet classification Grouper Performance Analysis Empirical Evaluation Conclusions 2/42.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
Prof. Bayer, DWH, Ch.5, SS Chapter 5. Indexing for DWH D1Facts D2.
1 CPS216: Data-intensive Computing Systems Operators for Data Access (contd.) Shivnath Babu.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Chapter 1 Introduction File Structures Readings: Folk, Chapter 1.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Evidence from Content INST 734 Module 2 Doug Oard.
Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.
K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Vector Space Model March 27, 2006.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
CPS216: Data-intensive Computing Systems
Text Indexing and Search
Indexing & querying text
Hashing Alexandra Stefan.
Database Management Systems (CS 564)
File Processing : Query Processing
Randomized Algorithms CS648
Index Construction: sorting
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
Database Design and Programming
CPS216: Advanced Database Systems
Minwise Hashing and Efficient Search
Hash Functions for Network Applications (II)
Yan Huang - CSCI5330 Database Implementation – Query Processing
Presentation transcript:

Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research

Outline Introduction Intersection via fixed-width partitions Intersection via randomized partitions Experiments

Outline Introduction Intersection via fixed-width partitions Intersection via randomized partitions Experiments

Introduction Motivation: general operation in many contexts – Information retrieval (boolean keyword queries) – Evaluation of conjunctive predicates – Data mining – Web search Preprocessing index in linear (to set size) space Fast online processing algorithms Focusing on in-memory index

Related Work Sorted lists merge B-trees, skip lists, or treaps Adaptive algorithms Hash-table lookup Word-level parallelism – Intersection size r is small – w: word size – Map L 1 and L 2 to a small range [Hwang and Lin 72], [Knuth 73], [Brown and Tarjan 79], [Pugh 90], [Blelloch and Reid-Miller 98], … [Bille, Pagh, Pagh 07] |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r (bound # comparisons w.r.t. opt) [Demaine, Lopez-Ortiz, and Munro 00], …

Basic Idea L1L1 L2L2 Preprocessing: 1. Partitioning into small groups Small groups Hash images Hash images Small groups 2. Hash mapping to a small range {1, …, w} w: word size

Basic Idea L1L1 L2L2 1. At most r = |L 1 ∩ L 2 | (usually small) pairs of small groups intersect 2. If a pair of small groups does NOT intersect, then, w.h.p., their hash images do not intersect either (we can rule it out) Two observations: Small groups Hash images Hash images Small groups

Basic Idea I1I1 I2I2 h(I 1 ) h(I 2 ) L1L1 L2L2 1. For each possible pair, compute h(I 1 ) ∩ h(I 2 ) Small groups Hash images Hash images Small groups 2.1. If empty, skip (if r = |L 1 ∩ L 2 | is small we can skip many); 2.2. otherwise, compute the intersection of two small groups I 1 ∩ I 2 Online processing:

Basic Idea I1I1 I2I2 h(I 1 ) h(I 2 ) L1L1 L2L2 1. How to partition? Small groups Hash images Hash images Small groups 2. How to compute the intersection of two small groups? Two questions:

Sets of comparable sizes Sets of skewed sizes (reuse the index) Generalized to k sets and set union Improved performance (wall-clock time) in practice (compared to existing solutions) – (i) Hidden constant; (ii) random/scan access – Best in most cases; otherwise, close to the best Our Results, (better when n<<m) (efficient in practice) |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r

Outline Introduction Intersection via fixed-width partitions Intersection via randomized partitions Experiments

Outline – Sort two sets L 1 and L 2 – Partition into equal-depth intervals (small groups) – Compute the intersection of two small groups using QuickIntersect(I 1, I 2 ) pairs of small groups Fixed-Width Partitions elements in each interval |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r w: word size

(Online processing) Lookup 1’s in h(I 1 ) ∩ h(I 2 ) and “1-entries” in the hash tables Go back to I 1 and I 2 for each “1-entry” Bad case: two different elements in I 1 and I 2 are mapped to the same one in {1, 2, …, w} Subroutine QuickIntersect (Preprocessing) Map I 1 and I 2 to {1, 2, …, w} using a hash function h {1, 2, …, w} h h (Online processing) Compute h(I 1 ) ∩ h(I 2 ) (bitwise-AND) Hash images are encoded as words of size w

(Online processing) Lookup 1’s in h(I 1 ) ∩ h(I 2 ) and “1-entries” in the hash tables Go back to I 1 and I 2 for each “1-entry” Bad case: two different elements in I 1 and I 2 are mapped to the same one in {1, 2, …, w} Analysis (Preprocessing) Map I 1 and I 2 to {1, 2, …, w} using a hash function h {1, 2, …, w} h h (Online processing) Compute h(I 1 ) ∩ h(I 2 ) (bitwise-AND) one operation |h(I 1 ) ∩ h(I 2 )| = (|I 1 ∩ I 2 | + # bad cases) operations one bad case for each pair in expectation Total complexity: Hash images are encoded as words of size w |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r

Analysis Optimize the parameter a bit --- total # of small groups (|I 1 | = a, |I 2 | = b) --- ensure one bad case for each pair of I 1 and I 2 s.t. Total complexity: |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r w: word size

Outline Introduction Intersection via fixed-width partitions Intersection via randomized partitions Experiments

Outline – Two sets L 1 and L 2 – Grouping hash function g (different from h): group elements in L 1 and L 2 according to g(.) – Compute the intersection of two small groups: I 1 i ={x ∈ L 1 | g(x)=i} and I 2 i ={y ∈ L 2 | g(y)=i} using QuickIntersect Randomized Partition g(x) = 1g(x) = 2… |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r w: word size The same grouping function for all sets

Outline – Two sets L 1 and L 2 – Grouping hash function g (different from h): group elements in L 1 and L 2 according to g(.) Randomized Partition … |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r w: word size The same grouping function for all sets g(x) = 1 g(y) = 1 g(x) = 2 g(y) = 2…

Analysis Number of bad cases for each pair Then follow the previous analysis: Generalized to k sets Rigorous analysis is trickier

Data Structures Multi-resolution structure for online processing Linear (to the size of each set) space

A Practical Version Outline – Motivation: linear scan is cheap in memory – Instead of using QuickIntersect to compute I 1 i ∩ I 2 i, just linear scan them, in time O(|I 1 i | +| I 2 i |) – (Preprocessing) Map I 1 i and I 2 i into {1, …, w} using a hash function h – (Online processing) Linear scan I 1 i and I 2 i only if h(I 1 i ) ∩ h(I 2 i ) ≠ w: word size

A Practical Version Outline – Motivation: linear scan is cheap in memory – Instead of using QuickIntersect to compute I 1 i ∩ I 2 i, just linear scan them, in time O(|I 1 i | +| I 2 i |) h(I 1 i ) h(I 2 i ) I1iI1i I2iI2i Linear scan

A Practical Version Outline – Motivation: linear scan is cheap in memory – Instead of using QuickIntersect to compute I 1 i ∩ I 2 i, just linear scan them, in time O(|I 1 i | +| I 2 i |) h(I 1 i ) h(I 2 i ) I1iI1i I2iI2i Linear scan

Analysis The probability of false positive is bounded: – Using p hash functions, false positive happens with lower probability: at most Complexity – When I 1 i ∩ I 2 i =, (false positive) scan it with prob – When I 1 i ∩ I 2 i ≠, must scan it (only r such pairs) Rigorous analysis is trickier

Intersection of Short-Long Lists Outline ( when |I 1 i |>>|I 2 i | ) – Linear scan in I 2 i – Binary search in I 1 i – Based on the same data structure, we get L1:L1: L2:L2:

Outline Introduction Intersection via fixed-width partitions Intersection via randomized partitions Experiments

Synthetic data – Uniformly generating elements in {1, 2, …, R} Real data – 8 million Wikipedia documents (inverted index – sets) – The 10 thousand most frequent queries from Bing (online intersection queries) Implemented in C, 4GB 64-bit 2.4GHz PC Time (in milliseconds)

Experiments Sorted lists merge B-trees, skip lists, or treaps Adaptive algorithms Hash-table lookup Word-level parallelism – Intersection size r is small – w: word size – Map L 1 and L 2 to a small range [Hwang and Lin 72], [Knuth 73], [Brown and Tarjan 79], [Pugh 90], [Blelloch and Reid-Miller 98], … [Bille, Pagh, Pagh 07] |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r (bound # comparisons w.r.t. opt) [Demaine, Lopez-Ortiz, and Munro 00], …

Experiments Sorted lists merge B-trees, skip lists, or treaps Adaptive algorithms Hash-table lookup Word-level parallelism Merge SkipList Adaptive Hash BPP SvS Lookup

Experiments Our approaches – Fixed-width partition – Randomized partition – The practical version – Short-long list intersection

Experiments Our approaches – Fixed-width partition – Randomized partition – The practical version – Short-long list intersection IntGroup RanGroup RanGroupScan HashBin

Varying the Size of Sets n = m = 1M~10M r = n*1% |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r

Varying the Size of Intersection n = m = 10M r = n*(0.005%~100%) = 500~10M |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r

Varying the Ratio of Set Sizes m = 10M, m/n = 2~500 r = n*1% |L 1 |=n, |L 2 |=m, |L 1 ∩ L 2 |=r

Real Data

Conclusion and Future Work Simple and fast set intersection algorithms Novel performance guarantee in theory Better wall-clock performance: best in most cases; otherwise, close to the best Future work – Storage compression in our approaches – Select the best algorithm/parameter by estimating the size of intersection (RanGroup v.s. HashBin)(RanGroup v.s. Merge) (parameter p)(sizes of groups)

Compression (preliminary results) Storage compression in RanGroup – Grouping hash function g: random permutation/prefix of permuted ID – Storage of short lists suffix of permuted ID Compared with Merge algo on  -compression of inverted index – Compressed Merge: 70% compression, 7 times slower (800ms) – Compressed RanGroup: 40% compression, 1.5 times slower (140ms)

Remarks (cut or not?) Linear additional space for preprocessing and indexing Generalized to k lists Generalized to disk I/O model Difference between Pagh’s and ours – Pagh: Mapping + Word-level parallelism – Ours: Grouping + Mapping

Data Structures (remove) Linear scan

Number of Lists (skip) size = 10M

Number of Hash Functions (remove) p = 1 p = 2 p = 4 p = 6 p = 8

Compression (remove) n = m = 1M~10M r = n*1%

Intersection of Short-Long Lists Outline – Two sorted lists L 1 and L 2 – Hash function g: Group elements in L 1 and L 2 according to g(.) – For each element in I 2 i ={y ∈ L 2 | g(y)=i}, binary search it in I 1 i ={x ∈ L 1 | g(x)=i} (Preprocessing) (Online processing) L1:L1: L2:L2:

Analysis Online processing time – Random variable S i : size of I 2 i a)E(S i ) = 1; b)∑ i |I 1 i | = n; c)concavity of log(.) Comments: 1. The same performance guarantee as B-trees, skip-lists, and treaps 2. Simple and efficient in practice 3. Can be generalized for k lists

Linear scan Data Structures Random access