Download presentation
Presentation is loading. Please wait.
Published byLeah Silvey Modified over 9 years ago
1
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
2
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 2/34
3
Background String Similarity Join Find similar string pairs between two string sets An essential operation in many applications 2011/4/13 Fast-Join @ ICDE2011 Card #NameAddrPhn 1234****Jeffery UllmanCS Dept. Stanford, CA111-1111 1018****Marvin MinskyCS Dept., MIT, MA222-2222 ………… Card #NameEmailTel 1205****David PattersonPatterson@ucb.com999-9999 0101****Jeffrey Ullmanullman@stanford.com(650)111-1111 ………… Jeffery Ullman Jeffrey Ullman Perform a similarity join on name attribute 3/34
4
Background String Similarity Join Find similar string pairs between two string sets An essential operation in many applications 2011/4/13 Fast-Join @ ICDE2011 User Id QueryTimestamp 1018**** ICDE 2011 Hanover 2011-01-15 10:12:10 1234**** NBA All Stars 2011 2011-01-15 11:05:06 2823**** ICDE Hannover 2011-01-15 11:10:10 6345**** weather Hanover 2011-01-15 12:34:10 … … … Perform a self similarity join on query attribute 4/34
5
Motivation 2011/4/13 Fast-Join @ ICDE2011 Existing Similarity Metrics Token-based Similarity Character-based Similarity Hybrid Similarity Dice, Cosine, Jaccard, … Edit Distance, Edit Similarity, … GED [SIGMOD 03] Jaccard(S1, S2) = 1/3 ED(S1, S2) = 8GED(S1, S2) = 0 S1 = “nba mcgrady”, S2 = “macgrady nba” 5/34
6
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 6/34
7
Token-based Similarity Dice similarity Cosine similarity Jaccard similarity 2011/4/13 Fast-Join @ ICDE2011 T 1 = {nba, mcgrady} T 2 = {macgrady, nba} |T 1 ∩ T 2 | =1 Example Exactly matched token pairs, i.e. T 1 ∩ T 2 7/34
8
2011/4/13 Fast-Join @ ICDE2011 T1T1 T2T2 mcgrady nba wnba macgrady nba 0.125 0.75 0.875 0.143 1 0.125 Weighted Bipartite Graph 3.Fuzzy Overlap: Maximum Weighted Matching (Quantify token similarity) Better than |T 1 ∩ T 2 |= 1 8/34
9
Fuzzy-Token Similarity Fuzzy-Dice similarity Fuzzy-Cosine similarity Fuzzy-Jaccard similarity 2011/4/13 Fast-Join @ ICDE2011 T 1 = {nba, mcgrady} T 2 = {macgrady, nba} Example 9/34
10
Comparison with Existing Similarities 2011/4/13 Fast-Join @ ICDE2011 10/34
11
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 11/34
12
2011/4/13 Fast-Join @ ICDE2011 String Similarity Join using Fuzzy-Token Similarity s1s1 “kobe and trancy” s2s2 “trcy macgrady mvp” …… s' 1 “kobe bryant age” s' 2 “mvp tracy mcgrady” …… T1T1 {kobe, and, trancy} T2T2 {trcy, macgrady, mvp} …… T’ 1 {kobe, bryant, age} T’ 2 {mvp, tracy, mcgrady} …… Tokenization (s 2, s’ 2 ), … Naive Solution Enumerating N 2 pairs Quite Expensive !!! Naive Solution Enumerating N 2 pairs Quite Expensive !!! 12/34
13
Using Existing Methods 2011/4/13 Fast-Join @ ICDE2011 13/34
14
Our Signature Scheme 2011/4/13 Fast-Join @ ICDE2011 The superscript denotes which token generates the signature The superscript denotes which token generates the signature 14/34
15
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for Token Sets Signature Scheme for Tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 15/34
16
Fast-Join @ ICDE2011 2011/4/13 Prefix Filtering Signature Scheme Alphabetical Order Remove 2 largest signatures 16/34
17
2011/4/13 Fast-Join @ ICDE2011 Token Sensitive Signature Scheme Prefix Filtering No! Token Sensitive Yes! 17/34
18
2011/4/13 Fast-Join @ ICDE2011 Candidates : {(T2,T4)} Delete the maximal number of largest signatures that contain 2 tokens Alphabetical Order Token Sensitive Signature Scheme (Cont’d) Candidates : {(T 1,T 2 ),(T 1,T 3 ),(T 1,T 4 ),(T 2,T 4 )} 18/34
19
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 19/34
20
2011/4/13 Fast-Join @ ICDE2011 Partition-NED Signature Scheme 20/34
21
2011/4/13 Fast-Join @ ICDE2011 Partition t’ 21/34
22
2011/4/13 Fast-Join @ ICDE2011 Partition t 22/34
23
2011/4/13 Fast-Join @ ICDE2011 Partition t (Cont’d) -3 -2 2 23/34
24
2011/4/13 Fast-Join @ ICDE2011 Punning Techniques Reduce substrings from 21 to 8 24/34
25
Comparison with Partition-ED (SIGMOD 09) 2011/4/13 Fast-Join @ ICDE2011 25/34
26
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 26/34
27
Experiment Setup Data sets DBLP Author: Author names from DBLP dataset AOL Query Log: Queries from AOL dataset Environment C++, GCC 4.2.3, Ubuntu Intel Core 2 Quad X5450 3.00GHz processor and 4 GB memory 2011/4/13 Fast-Join @ ICDE2011 27/34
28
Result Quality 2011/4/13 Fast-Join @ ICDE2011 28/34
29
Evaluation on Different Signature Schemes for Tokens 2011/4/13 Fast-Join @ ICDE2011 29/34
30
Evaluation on Different Signature Schemes for Token Sets 2011/4/13 Fast-Join @ ICDE2011 30/34
31
Put Everything Together 2011/4/13 Fast-Join @ ICDE2011 31/34
32
Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for Token Sets Signature Scheme for Tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 32/34
33
Conclusion Fuzzy-token similarity Hybrid similarity Subsume many well-known similarities High result quality String similarity join using fuzzy-token similarity Signature-based framework Token-sensitive signature scheme Partition-NED signature scheme Achieve higher performance than the state-of-the-art methods both theoretically and experimentally 2011/4/13 Fast-Join @ ICDE2011 33/34
34
2011/4/13 Fast-Join @ ICDE2011 http://dbgroup.cs.tsinghua.edu.cn/wangjn/projects/fastjoin/ 34/34
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.