Download presentation
Presentation is loading. Please wait.
Published byStephanie McKinney Modified over 9 years ago
1
Byung-Won On (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) Divesh Srivastava (AT&T Labs – Research) Group Linkage ICDE 2007 1
2
Outline Introduction Matching Bipartite Graph Group Linkage Bipartite matching Pre-processing step to speed up Greedy matching Heuristic measure Experiment & Result Conclusion 2
3
Introduction Poor quality data in databases Transcription errors Lack of standards for recording Poor database design How to identify whether two entities are approximately the same? Group linkage problem Ex: “J.Ullman” “J.D.Ullman” “Ullman, Jeffrey” 3
4
Group Linkage Problem Ex : Lily Hsueh Paper1 Davy Jones Peter Pan Paper5 Paper4 Paper3 Paper2 ACM DBLP K.L.Hsueh Group : each author Records : a list of citations per author Implement 4
5
Matching Matching: A matching in a graph G is a set of non-loop edges with no shared endpoints Maximum matching: A matching that contains the largest possible number of edges. 5
6
Bipartite Graph Bipartite Graph: A graph is bipartite if V is the union of two disjoint independent sets called partite sets of G Bipartite matching 6
7
Group Linkage(1) Jaccard similarity measure between two sets s 1 and s 2 Records from the two groups can be put into matching when they are identical. 7
8
Group Linkage(2) NotationDescription DRelation of multi-attribute records g 1,g 2,……Groups of records in D g 1, r 2, ……Records in D sim(r i, r j )Arbitrary record-level similarity function θ Group-level similarity threshold ρ Record-level similarity threshold MMaximum weight bipartite matching BMBipartite matching based group linkage 8
9
Group Linkage(3) g2g2 g1g1 r 11 r 25 r 24 r 23 r 22 r 21 r 14 r 13 r 12,each normalize Group similarity Similar records K.L.Hsueh Lily.Hsueh Register Allocation & Spilling via graph coloring Register Allocation and Spilling via graph coloring 9
10
Bipartite Matching 10 Record-level similarity measure [5] S.Chaudhuri, V.Ganti, and R. Kaushik. “A primitive Operator for Similarity Joins in Data Cleaning”. In IEEE ICED, 2006 Maximum weight bipartite matching (BM) [10] S. Guha, N.Koudas, A. Marathe, and D. Srivastava. “Merging the Results of Approximate Match Operations”. In VLCB, pages 636-647, 2004. Applying this strategy for every pair of groups is infeasible. pre-processing step Greedy matching Heuristic measure
11
Greedy Matching(1) S1: For each record r i ∈ g 1, find a record r j ∈ g 2 with the highest record-level similarity among those with sim() ≥ ρ. S2: Same as S1 g2g2 g1g1 r 11 r 25 r 24 r 23 r 22 r 21 r 14 r 13 r 12 May not be a matching! 11
12
Greedy Matching(2) Upper and lower bounds to BMsim, ρ g2g2 g1g1 r 11 r 25 r 24 r 23 r 22 r 21 r 14 r 13 r 12 12
13
Greedy Matching(2) is bounded Only when, the more expensive computation would be needed. 13
14
Heuristic Measure In practice that pairs of groups with a high value of will share at least one record with a high record-level similarity. Simpler and faster measure 14
15
Implementation Implemented UBsim, ρ, LBsim, ρ, and MAXsim, ρ in SQL. (We only discuss UB) Notation: groupauthor record in a groupcitations of an author group linkage problemlinkage between authors key to linkauthor names 15
16
Experiment Real data sets: Data sets from ACM and DBLP citation digital libraries. R1: uniform data sets R1a —average # of citations: left=41, right=25 R1b —average # of citations: left=40, right=55 R2: skewed data sets R2 DB —average # of citations: left=30, right=9 R2 AI —average # of citations: left=31, right=10 R2 Net —average # of citations: left=22, right=6 16
17
Experiment Synthetic data sets: S1 a and S1 b : same as R1 a, but dummy authors are injected to the right S1 a : # of citations 1/3 S1 b : # of citations 3 S2: using “dbgen” tool to generate dummy authors with varying levels of errors and inserted it to the right data set. 17
18
Experiment Evaluation Metrics—average recall if a 2 is included in the top-k answer window for a 1, then recall becomes 1, and 0 otherwise Compared Methods A(k 1 )|B(k 2 ). Step1: A, window size k 1 Step2: B, window size k 2 Microsoft SQL Server 2000 on Pentium III 3GHZ/512MB machine 18
19
Results uniform data set : R1 real data set 19
20
Results S1 and S2 synthetic data sets JA incorrect select dummy authors JA and BM are directly applied to S2 BM outperforms JA by 16-17% 20
21
Results R2 real data set UB MAX UB outperform MAX in recall UB MAX Pre-processing using: 21
22
Results Record-level similarity measure : cosine similarity with TF/IDF weighting. Running time against R2 (in sec) 22
23
Results Window size 23
24
Conclusion Proposed a bipartite matching based group similarity measure to solve group linkage problem. Proved upper and lower bounds of BM can be used for speed- up. BM is more robust group similarity measure than others 24
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.