Byung-Won On (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) Divesh Srivastava (AT&T Labs – Research) Group Linkage ICDE
Outline Introduction Matching Bipartite Graph Group Linkage Bipartite matching Pre-processing step to speed up Greedy matching Heuristic measure Experiment & Result Conclusion 2
Introduction Poor quality data in databases Transcription errors Lack of standards for recording Poor database design How to identify whether two entities are approximately the same? Group linkage problem Ex: “J.Ullman” “J.D.Ullman” “Ullman, Jeffrey” 3
Group Linkage Problem Ex : Lily Hsueh Paper1 Davy Jones Peter Pan Paper5 Paper4 Paper3 Paper2 ACM DBLP K.L.Hsueh Group : each author Records : a list of citations per author Implement 4
Matching Matching: A matching in a graph G is a set of non-loop edges with no shared endpoints Maximum matching: A matching that contains the largest possible number of edges. 5
Bipartite Graph Bipartite Graph: A graph is bipartite if V is the union of two disjoint independent sets called partite sets of G Bipartite matching 6
Group Linkage(1) Jaccard similarity measure between two sets s 1 and s 2 Records from the two groups can be put into matching when they are identical. 7
Group Linkage(2) NotationDescription DRelation of multi-attribute records g 1,g 2,……Groups of records in D g 1, r 2, ……Records in D sim(r i, r j )Arbitrary record-level similarity function θ Group-level similarity threshold ρ Record-level similarity threshold MMaximum weight bipartite matching BMBipartite matching based group linkage 8
Group Linkage(3) g2g2 g1g1 r 11 r 25 r 24 r 23 r 22 r 21 r 14 r 13 r 12,each normalize Group similarity Similar records K.L.Hsueh Lily.Hsueh Register Allocation & Spilling via graph coloring Register Allocation and Spilling via graph coloring 9
Bipartite Matching 10 Record-level similarity measure [5] S.Chaudhuri, V.Ganti, and R. Kaushik. “A primitive Operator for Similarity Joins in Data Cleaning”. In IEEE ICED, 2006 Maximum weight bipartite matching (BM) [10] S. Guha, N.Koudas, A. Marathe, and D. Srivastava. “Merging the Results of Approximate Match Operations”. In VLCB, pages , Applying this strategy for every pair of groups is infeasible. pre-processing step Greedy matching Heuristic measure
Greedy Matching(1) S1: For each record r i ∈ g 1, find a record r j ∈ g 2 with the highest record-level similarity among those with sim() ≥ ρ. S2: Same as S1 g2g2 g1g1 r 11 r 25 r 24 r 23 r 22 r 21 r 14 r 13 r 12 May not be a matching! 11
Greedy Matching(2) Upper and lower bounds to BMsim, ρ g2g2 g1g1 r 11 r 25 r 24 r 23 r 22 r 21 r 14 r 13 r 12 12
Greedy Matching(2) is bounded Only when, the more expensive computation would be needed. 13
Heuristic Measure In practice that pairs of groups with a high value of will share at least one record with a high record-level similarity. Simpler and faster measure 14
Implementation Implemented UBsim, ρ, LBsim, ρ, and MAXsim, ρ in SQL. (We only discuss UB) Notation: groupauthor record in a groupcitations of an author group linkage problemlinkage between authors key to linkauthor names 15
Experiment Real data sets: Data sets from ACM and DBLP citation digital libraries. R1: uniform data sets R1a —average # of citations: left=41, right=25 R1b —average # of citations: left=40, right=55 R2: skewed data sets R2 DB —average # of citations: left=30, right=9 R2 AI —average # of citations: left=31, right=10 R2 Net —average # of citations: left=22, right=6 16
Experiment Synthetic data sets: S1 a and S1 b : same as R1 a, but dummy authors are injected to the right S1 a : # of citations 1/3 S1 b : # of citations 3 S2: using “dbgen” tool to generate dummy authors with varying levels of errors and inserted it to the right data set. 17
Experiment Evaluation Metrics—average recall if a 2 is included in the top-k answer window for a 1, then recall becomes 1, and 0 otherwise Compared Methods A(k 1 )|B(k 2 ). Step1: A, window size k 1 Step2: B, window size k 2 Microsoft SQL Server 2000 on Pentium III 3GHZ/512MB machine 18
Results uniform data set : R1 real data set 19
Results S1 and S2 synthetic data sets JA incorrect select dummy authors JA and BM are directly applied to S2 BM outperforms JA by 16-17% 20
Results R2 real data set UB MAX UB outperform MAX in recall UB MAX Pre-processing using: 21
Results Record-level similarity measure : cosine similarity with TF/IDF weighting. Running time against R2 (in sec) 22
Results Window size 23
Conclusion Proposed a bipartite matching based group similarity measure to solve group linkage problem. Proved upper and lower bounds of BM can be used for speed- up. BM is more robust group similarity measure than others 24