Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach ben.sach.05@bristol.ac.uk

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Contents What’s the problem? What use is it? Is it (3-SUM) hard? How have we solved it? How good is our solution?

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Maximal Subset Matching problem Given a pattern, P and a text, T: We want to find the largest “match” of P in T This is also referred to as the “constellation” problem (originally by B. Chazelle) - of size m - of size n

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Maximal Subset Matching problem What is a “match”?

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Maximal Subset Matching problem What is a “match”? A point p i in P matches a point t j in T with a shift, v if: p i + v = t j

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach The Maximal Subset Matching problem A subset of P, M is a subset match if: There exists a shift, v, with which all points in M match points in T The Maximal Subset Matching problem is… to find the size of the largest subset match for a given P, T What is a “match”?

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Application to Music Information Retrieval Allows for matches shifted in time and pitch Intrinsically handles polyphonic music which traditional string based methods do not Other Applications: Protein structure alignment Pharmacophore identification Image registration Model-based object recognition

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Is Maximal Subset Matching hard? 3-SUM is… There is a simple algorithm to solve 3-SUM in O(n 2 ) time No lower complexity solutions are known It is conjectured that this is a lower bound Maximum Subset Matching has been proven to be 3-SUM HARD G i venase t T o f n i n t egers: I s t h erea t r i p l ea ; b ; c 2 T suc h t h a t a + b + c = 0 ? “Many fundamental geometric problems fall in this class”

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach MSMBP Bit-parallel implementation O(nm) time O(n) space with very low constants Cross-correlation implemented via Bit-sets MSMFT FFT based implementation O(n*log(m)) time O(n) space Cross-correlation implemented via Fast Fourier Transforms The Structure The Algorithms 1. Randomly project the pattern and the text into 1D 2. “Length reduce” the data to decrease sparsity 3. Perform a cross-correlation at each alignment of the length reduced pattern and text 4. Find the shift in the length reduced pattern that gave the largest value in the cross-correlation 5. Using the “improved estimate”, infer the shift in the original data. 6. Return the size of the match with this shift.

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach (a) Randomised Projection and (b) Length Reduction g ( x ) = axmo d q, h ( x ) = g ( x ) mo d san dh 2 ( x ) = ( g ( x ) + q ) mo d s Using hash functions: Where: q = a random prime in [2N,…,4N] (N is the maximum of the projected values of P’ and T’) a = a random in [1,…,q-1] s = r*n, where r>1 is a constant Projected pattern points are mapped to h(x) in the pattern binary array Projected text points are mapped to h(x) and h2(x) in the text binary array Both arrays are of length r*n, where n is the number of text points binary array of length r*n (See Cole and Hariharan [3])

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Lemma 1: Significance: If some point matches so that p + v = t then (h(p) +h(v)) mod s matches either h(t) or h2(t) By counting the number of 1’s in common at each alignment we can estimate the true subset match in the original data Proof: ( h ( x ) + h ( y )) mo d s = ( h ( x + y ) i f g ( x ) + g ( y ) < q, h 2 ( x + y ) o t h erw i se I f g ( x ) + g ( y ) ¸ q, t h en g ( x + y ) = g ( x ) + g ( y ) ¡ q. I f g ( x ) + g ( y ) < q, t h eng ( x + y ) = g ( x ) + g ( y ). = ( g ( x ) + g ( y )) mo d s. = ( g ( x ) mo d s + g ( y ) mo d s ) mo d s ( A s h ( x ) = g ( x ) mo d s ) ( A sg ( x ) = axmo d q ) Why does this work? ( h ( x ) + h ( y )) mo d s

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Estimating the Size of the Largest Subset Match Estimation based on projected and length reduced matches: high variance which grows linearly as the number of true matches decreases (discussed in paper) An improved Estimate: 1.Find the best match of the length reduced pattern in the text. 2.Determine in O(m) time which points in the reduced pattern match the text at that shift. 3.Look up, by the use of a precalculated hash table, where each of the matching points where matched from in the 1D projection, P’ and T’. 4.Now we have a shift for each pair of points in P’ and T’. This may have rare- inconsistencies due to collisions. We therefore perform a count and take the most frequent shift. 1.Finally we return the size of the match at this shift. When does this work?

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Bit-Parallel Cross-correlation (MSMBP) We store the reduced pattern and text arrays as bitsets and perform a bit-parallel correlation using ANDs and counts: –Correlation of two architectural words can be found using an AND followed by a count of the number of 1’s in the result in constant time –Count implemented by use of a look-up table. –Each reduced array is of size r*n so the bitset has O(n) words so gives each correlation in O(n) time –We need to find the correlation at each shift. –To shift the text we must shift every word in the text so takes O(n) time again. Therefore, naively, this method takes O(n 2 ) time ( O ( n ) + O ( n )) O ( n ) = O ( n 2 ) (Correlation) (Alignments) (Shift)

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Bit-Parallel Cross-correlation (MSMBP) We reduce this complexity by taking advantage of the sparseness of the reduced pattern array when m << n: –p has O(n) words but only O(m) non-zero values: we only store these at worst m words. this reduces each correlation computation to O(m) time However, we also need to reduce the number of shifts required: |01010010|01000100|01011011|10000100|10100100|10010010|… By use of pointer arithmetic, we can align the data to any constant*b alignment (where b is the byte-size) in constant time |10100100|10001000|10110111|00001001|01001001|00100100|… A single full shift of t gives us access to alignments c*b +1 for any c So by calculating the correlations out of order, we need to perform only b shifts This results in an O(nm) time complexity algorithm

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach FFT Cross-correlation (MSMFT) Uses the same steps as MSMBP except the cross-correlation step is implemented using FFTs (Fast Fourier Transforms): This uses the property of the FFT that for numerical strings: This can be calculated accurately and efficiently in O(n*log(m)) time (thanks to the FFTW team for the implementation used, see [5]) p ¢ t ( i ) d e f = m X j = 1 p j t ( i + j ¡ 1 ) ; 1 · i · n ; ( W h ere t ( i ) i s t h em l eng t h su b s t r i ngo f t, b eg i nn i nga t pos i t i on i )

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Speed Comparisons (1) Increasing Text size with proportional Pattern size (25%,75%) (P3 is the queue based method of Ukkonen at Al. [7] with complexity O(n*m*log(m))

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Speed Comparisons (2) Increasing Text size with fixed Pattern size (40 points) Constant Text size (960000 points) with increasing Pattern size

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Accuracy Tests Match % ActualRun 1 Run 2 Run 3 Avr. Diff 90%180 100% 75%150 100% 25%50 100% 10%2045523% Match %ActualRun 1 Run 2 Run 3 Avr. Diff 1 st, 2 nd 100%,10%200,20200 100% 100%,50%200,100200 100% 100%,90%200,180200 100% 100%,99%200,198200 100% 75%,10%150,20150 100% 75%,65%150,130150 100% 75%,70%150,140150 14098% 75%,73%150,146150 100% 50%,10%100,20100 100% 50%,40%100,80100 100% 50%,45%100,90100 9097% 25%, 5%50,1050 100% 25%,15%50,3050 100% 25%,20%50,404050 93% Match % - The percentage of the pattern that existed in the text Actual – The sizes of the actual best matches Run 1,2,3 – The sizes of the matches found by the algorithm in each test. Avr. Diff – The average percentage of the largest present match that was returned. The text used was 4000 points in both cases Only MSMBP was used for accuracy testing as the two algorithms differ only in performance

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Conclusions We have presented two algorithms, MSMBP with O(nm) and MSMFT with O(n*log(m)) time complexity, both with O(n) space We have shown that these are efficient on large random point sets We have also shown that the accuracy is very high, even in situations theorised in the paper to have a lower probability of success. We have shown experimentally speed ups of several orders of magnitude in some cases without a significant decrease in accuracy The Authors would like to thank Manolis Christodoulakis for the original implementation of the MSMFT algorithm and the EPSRC for the funding of the second author.

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach Questions? (from xkcd.com)

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach

Similar presentations

Presentation on theme: "Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach

Similar presentations

Presentation on theme: "Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach"— Presentation transcript:

Similar presentations

About project

Feedback