Download presentation
Presentation is loading. Please wait.
Published byAriel Wiggins Modified over 9 years ago
1
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based on Chuan Xiao’s presentation slides in ICDE ’09
2
Outline Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments 2
3
Motivation Data Cleaning UniversityCityStatePostal Code University of New South WalesSydneyNSW2052 University of SydneySydneyNSW2006 University of MelbourneMelbourneVictoria3010 University of QueenslandBrisbaneQueensland4072 University of New South ValesSydneyNSW2052 3
4
More Application Near duplicate Web page detection Obama Has Busy Final Day Before Taking Office as Bush Says Farewells New York Times Jan 19th, 2009 iht.com Jan 20, 2009 4
5
Outline Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments 5
6
(Traditional) Set Similarity Join Each record is tokenized into a set Given a collection of records, the set similarity join problem is to fi nd all pairs of records,, such that sim(x,y) t Common similarity functions: –jaccard: –cosine: –dice: x = {A,B,C,D,E} y = {B,C,D,E,F} 4/6 = 0.67 4/5 = 0.8 8/10 = 0.8 6 What if t is unknown beforehand?
7
What If t is Unknown Beforehand? Example – using jaccard similarity function –w = {A, B, C, D, E} –x = {A, B, C, E, F} –y = {B, C, D, E, F} –z = {B, C, F, G, H} –If t = 0.7 no results –If t = 0.4 ,,,, (too many results and long running time) Return the top-k results ranked by their similarity values –if k = 1 7
8
Top-k Set Similarity Join Return top-k pairs of records, ranked by similarity scores Advantages over traditional similarity join –Without specifying a threshold –Output results progressively benefit interactive applications –Produce most meaningful results under limited resources/time constraints Can be stopped at any time, but still guarantee sim(output results) sim(unseen pairs) 8
9
Outline Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments 9
10
Straightforward Solution Start from a certain t, repeat the following steps: –answer traditional sim-join with t as threshold –if # of results k, stop and output k results with highest sim –else, decrease t Example (jaccard, k = 2) –w = {A, B, C, E} –x = {A, B, C, E, F} –y = {B, C, D, E, F} –z = {B, C, F, G, H} –t = 0.9 no result –t = 0.8 –t = 0.7 –t = 0.6 , results don’t change! Which thresholds shall we enumerate? 0.8, 0.6 10
11
Naïve and Index-Based Algorithms Na ï ve Algorithm: –Compare every pair of objects -> O(n 2 ) time complexity Index-based Algorithm [Sarawagi et al. SIGMOD04] : Record Set Index Construction Candidate Generation Verification Result Pairs tokenrecord_id Awxy Bxz … Cyz … <w,x><w,x> <w,y><w,y> <x,y><x,y> <x,z><x,z> … inverted lists 11
12
Sort the tokens by a global ordering –increasing order of document frequency Only need to index the first few tokens (prefix) for each record Example: jaccard t = 0.8 |x y| 4 if |x|=|y|=5 AB CD upper bound O(x,y) = 3 < 4! prefix sorted EFG EFG 12 Prefix Filter [Chaudhuri et al. ICDE06, Bayardo et al. WWW07] x y Must share at least one token in prefix to be a candidate pair –For jaccard, prefix length = |x| * (1 – t) + 1 each t is associated with a prefix length
13
Outline Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments 13
14
Necessary Thresholds Each prefix is associated with a threshold –the maximum possible similarity a record can achieve with other records ABC x = 1.00.80.6 t 14 1.00.75 0.50.25 x y z 1.00.80.6 0.40.2 1.00.90.80.7 0.60.50.40.30.20.1
15
Event-driven Model Problem: repeated invocation of sim-join algorithm –t is decreasing run sim-join algorithm in an incremental way Prefix Event –Initialize prefix length for each record as 1 –For each prefix event Probe the inverted list of the token for candidate pairs, verify the candi date pairs, and insert them into temp results Insert x into A ’ s inverted list Extend prefix by one token maintain prefix events with a max-heap on t –Stop until t k-th temp result ’ s similarity 15
16
Topk-join - Example 16 ABCE ABCEF BCDEF BCFGH w x y z tokenrecord_id Awx Byzxw Cyz inverted list prefix event (w,x) = 0.8 (y,z) = 0.43 (x,y) = 0.67 temporary result jaccard, k=2 verified t wice! t=0.6 2nd temp result’s sim
17
Optimizations - Verification In the above example, (w,x) and (y,z) have been verified twice How to avoid repeated verification? –Memorize all verified pairs with a hash table too much memory consumption –Check if this pair will be identified again when it is verified for the first time –Keep only those will be identified again before algorithm stops –Guarantee no pair will be verified twice ABDEF ACDEF x y 1.00.80.6 if k-th temp result’s sim = 0.7 won’t be identified again! 17
18
Optimizations - Indexing How to reduce inverted list size to save memory? –t is decreasing calculate the upper bound of similarity for future probings into inverted lists –Don ’ t insert into inverted list if upper bound k-th temp result ’ s similarity ACDEF BCDEF x y 18 0.8 max. similarity = 4/6 = 0.67
19
Outline Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments 19
20
Experiment Settings Algorithms –topk-join –pptopk: modified ppjoin[ Xiao, et al. WWW08 ], a prefix-filter based approach, with t = 0.95, 0.90, 0.85... Measure –Compare topk-join and pptopk (candidate size, running time) –Output results progressively Dataset dataset# of recordsavg. record size DBLP (author, title)855k14.0 TREC (author, title, abstract)348k130.1 TREC-3GRAM348k868.5 UNIREF-3GRAM (protein seq.)500k372.9 20
21
Experiment Results 21
22
Experiment Results 22
23
Experiment Results 23
24
Thank You! Any questions or comments?
25
Related Work Index-based approaches –S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIG MOD, 2004 –C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approxi mate string searches. in ICDE, 2008 Prefix-based approaches –S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joi ns in data cleaning. In ICDE, 2006 –R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007 –C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplic ate detection. In WWW, 2008 PartEnum –A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLD B, 2006 25
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.