Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach

Slides:

Advertisements

Similar presentations

Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

By Cynthia Rodriguez University of Texas at San Antonio

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Analysis of Algorithms

Fast Algorithms For Hierarchical Range Histogram Constructions

Data Structures Using C++ 2E

Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.

CS 253: Algorithms Chapter 11 Hashing Credit: Dr. George Bebis.

CompSci Searching & Sorting. CompSci Searching & Sorting The Plan  Searching  Sorting  Java Context.

Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.

the fourth iteration of this loop is shown here

Hashing Techniques.

Algorithmic Complexity Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

Efficiency of Algorithms

Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.

Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Third Edition Additions by Shannon Steinfadt SP’05.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.

Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.

Cmpt-225 Algorithm Efficiency.

Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.

Hash Tables1 Part E Hash Tables  

Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Fourth Edition.

Hash Tables1 Part E Hash Tables  

Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.

J Cheng et al,. CVPR14 Hyunchul Yang( 양현철 )

Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.

ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.

Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.

1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)

CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:

Jessie Zhao Course page: 1.

Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.

Comp 335 File Structures Hashing.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CSC 211 Data Structures Lecture 13

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.

Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.

Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.

Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.

Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.

Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.

Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.

1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.

Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.

1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

Chapter 13 C Advanced Implementations of Tables – Hash Tables.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

TOPIC 5 ASSIGNMENT SORTING, HASH TABLES & LINKED LISTS Yerusha Nuh & Ivan Yu.

1/39 COMP170 Tutorial 13: Pattern Matching T: P:.

A new matching algorithm based on prime numbers N. D. Atreas and C. Karanikas Department of Informatics Aristotle University of Thessaloniki.

Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.

CSCI 210 Data Structures and Algorithms

School of Computing Clemson University Fall, 2012

The break signal in climate records: Random walk or random deviations

Hash functions Open addressing

Fast Fourier Transform

Chapter 3: The Efficiency of Algorithms

Hash Tables – 2 Comp 122, Spring 2004.

Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures

Improved Two-Way Bit-parallel Search

Lecture-Hashing.

Presentation transcript:

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Contents What’s the problem? What use is it? Is it (3-SUM) hard? How have we solved it? How good is our solution?

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Maximal Subset Matching problem Given a pattern, P and a text, T: We want to find the largest “match” of P in T This is also referred to as the “constellation” problem (originally by B. Chazelle) - of size m - of size n

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Maximal Subset Matching problem What is a “match”?

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Maximal Subset Matching problem What is a “match”? A point p i in P matches a point t j in T with a shift, v if: p i + v = t j

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Maximal Subset Matching problem A subset of P, M is a subset match if: There exists a shift, v, with which all points in M match points in T The Maximal Subset Matching problem is… to find the size of the largest subset match for a given P, T What is a “match”?

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Application to Music Information Retrieval Allows for matches shifted in time and pitch Intrinsically handles polyphonic music which traditional string based methods do not Other Applications: Protein structure alignment Pharmacophore identification Image registration Model-based object recognition

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Is Maximal Subset Matching hard? 3-SUM is… There is a simple algorithm to solve 3-SUM in O(n 2 ) time No lower complexity solutions are known It is conjectured that this is a lower bound Maximum Subset Matching has been proven to be 3-SUM HARD G i venase t T o f n i n t egers: I s t h erea t r i p l ea ; b ; c 2 T suc h t h a t a + b + c = 0 ? “Many fundamental geometric problems fall in this class”

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach MSMBP Bit-parallel implementation O(nm) time O(n) space with very low constants Cross-correlation implemented via Bit-sets MSMFT FFT based implementation O(n*log(m)) time O(n) space Cross-correlation implemented via Fast Fourier Transforms The Structure The Algorithms 1. Randomly project the pattern and the text into 1D 2. “Length reduce” the data to decrease sparsity 3. Perform a cross-correlation at each alignment of the length reduced pattern and text 4. Find the shift in the length reduced pattern that gave the largest value in the cross-correlation 5. Using the “improved estimate”, infer the shift in the original data. 6. Return the size of the match with this shift.

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach (a) Randomised Projection and (b) Length Reduction g ( x ) = axmo d q, h ( x ) = g ( x ) mo d san dh 2 ( x ) = ( g ( x ) + q ) mo d s Using hash functions: Where: q = a random prime in [2N,…,4N] (N is the maximum of the projected values of P’ and T’) a = a random in [1,…,q-1] s = r*n, where r>1 is a constant Projected pattern points are mapped to h(x) in the pattern binary array Projected text points are mapped to h(x) and h2(x) in the text binary array Both arrays are of length r*n, where n is the number of text points binary array of length r*n (See Cole and Hariharan [3])

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Lemma 1: Significance: If some point matches so that p + v = t then (h(p) +h(v)) mod s matches either h(t) or h2(t) By counting the number of 1’s in common at each alignment we can estimate the true subset match in the original data Proof: ( h ( x ) + h ( y )) mo d s = ( h ( x + y ) i f g ( x ) + g ( y ) < q, h 2 ( x + y ) o t h erw i se I f g ( x ) + g ( y ) ¸ q, t h en g ( x + y ) = g ( x ) + g ( y ) ¡ q. I f g ( x ) + g ( y ) < q, t h eng ( x + y ) = g ( x ) + g ( y ). = ( g ( x ) + g ( y )) mo d s. = ( g ( x ) mo d s + g ( y ) mo d s ) mo d s ( A s h ( x ) = g ( x ) mo d s ) ( A sg ( x ) = axmo d q ) Why does this work? ( h ( x ) + h ( y )) mo d s

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Estimating the Size of the Largest Subset Match Estimation based on projected and length reduced matches: high variance which grows linearly as the number of true matches decreases (discussed in paper) An improved Estimate: 1.Find the best match of the length reduced pattern in the text. 2.Determine in O(m) time which points in the reduced pattern match the text at that shift. 3.Look up, by the use of a precalculated hash table, where each of the matching points where matched from in the 1D projection, P’ and T’. 4.Now we have a shift for each pair of points in P’ and T’. This may have rare- inconsistencies due to collisions. We therefore perform a count and take the most frequent shift. 1.Finally we return the size of the match at this shift. When does this work?

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Bit-Parallel Cross-correlation (MSMBP) We store the reduced pattern and text arrays as bitsets and perform a bit-parallel correlation using ANDs and counts: –Correlation of two architectural words can be found using an AND followed by a count of the number of 1’s in the result in constant time –Count implemented by use of a look-up table. –Each reduced array is of size r*n so the bitset has O(n) words so gives each correlation in O(n) time –We need to find the correlation at each shift. –To shift the text we must shift every word in the text so takes O(n) time again. Therefore, naively, this method takes O(n 2 ) time ( O ( n ) + O ( n )) O ( n ) = O ( n 2 ) (Correlation) (Alignments) (Shift)

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Bit-Parallel Cross-correlation (MSMBP) We reduce this complexity by taking advantage of the sparseness of the reduced pattern array when m << n: –p has O(n) words but only O(m) non-zero values: we only store these at worst m words. this reduces each correlation computation to O(m) time However, we also need to reduce the number of shifts required: | | | | | | |… By use of pointer arithmetic, we can align the data to any constant*b alignment (where b is the byte-size) in constant time | | | | | | |… A single full shift of t gives us access to alignments c*b +1 for any c So by calculating the correlations out of order, we need to perform only b shifts This results in an O(nm) time complexity algorithm

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach FFT Cross-correlation (MSMFT) Uses the same steps as MSMBP except the cross-correlation step is implemented using FFTs (Fast Fourier Transforms): This uses the property of the FFT that for numerical strings: This can be calculated accurately and efficiently in O(n*log(m)) time (thanks to the FFTW team for the implementation used, see [5]) p ¢ t ( i ) d e f = m X j = 1 p j t ( i + j ¡ 1 ) ; 1 · i · n ; ( W h ere t ( i ) i s t h em l eng t h su b s t r i ngo f t, b eg i nn i nga t pos i t i on i )

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Speed Comparisons (1) Increasing Text size with proportional Pattern size (25%,75%) (P3 is the queue based method of Ukkonen at Al. [7] with complexity O(n*m*log(m))

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Speed Comparisons (2) Increasing Text size with fixed Pattern size (40 points) Constant Text size ( points) with increasing Pattern size

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Accuracy Tests Match % ActualRun 1 Run 2 Run 3 Avr. Diff 90% % 75% % 25%50 100% 10% % Match %ActualRun 1 Run 2 Run 3 Avr. Diff 1 st, 2 nd 100%,10%200, % 100%,50%200, % 100%,90%200, % 100%,99%200, % 75%,10%150, % 75%,65%150, % 75%,70%150, % 75%,73%150, % 50%,10%100, % 50%,40%100, % 50%,45%100, % 25%, 5%50, % 25%,15%50, % 25%,20%50, % Match % - The percentage of the pattern that existed in the text Actual – The sizes of the actual best matches Run 1,2,3 – The sizes of the matches found by the algorithm in each test. Avr. Diff – The average percentage of the largest present match that was returned. The text used was 4000 points in both cases Only MSMBP was used for accuracy testing as the two algorithms differ only in performance

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Conclusions We have presented two algorithms, MSMBP with O(nm) and MSMFT with O(n*log(m)) time complexity, both with O(n) space We have shown that these are efficient on large random point sets We have also shown that the accuracy is very high, even in situations theorised in the paper to have a lower probability of success. We have shown experimentally speed ups of several orders of magnitude in some cases without a significant decrease in accuracy The Authors would like to thank Manolis Christodoulakis for the original implementation of the MSMFT algorithm and the EPSRC for the funding of the second author.

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Questions? (from xkcd.com)