Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1
Motivation SrcNamePhoneAddressCity VA-Link Wireless GLENDALE GALLERIAGLENDALE VAbercrombie GLENDALE GALLERIAGLENDALE VAbercrombie & Fitch GLENDALE GALLERIAGLENDALE VAeropostale GLENDALE GALLERIAGLENDALE VAerosoles GLENDALE GALLERIAGLENDALE V Newtown Pizza Palace Church hill RdNEWTOWN V Pizza Palace Of Newtown Church hill RdNEWTOWN 2 s s s Cleaned Data s s s Search Box SrcNamePhoneAddressCity DAerosoles GLENDALE GALLERIAGLENDALE DAldo Shoes GLENDALE GALLERIAGLENDALE D Newtown Pizza Palace Church hill RdNewtown D Pizza Palace of Newtown Church Hill RdNewtown SrcNamePhoneAddressCity AA 24 Hour 1 A 1 Locksmith GLENDALE GALLERIAGLENDALE AA Link Wireless GLENDALE GALLERIAGLENDALE AAbercrombie GLENDALE GALLERIAGLENDALE AAbercrombie & Fitch GLENDALE GALLERIAGLENDALE A Newtown Pizza Palace Church hill RdNewtown AAldo Shoes GLENDALE GALLERIAGLENDALE AAlert Cellular GLENDALE GALLERIAGLENDALE SrcNamePhoneAddressCity T Newtown Pizza Palace Church hill RdNewtown TAldo Shoes GLENDALE GALLERIAGLENDALE TAmerican Eagle Outfitters GLENDALE GALLERIAGLENDALE TANN TAYLOR GLENDALE GALLERIAGLENDALE TAnn Taylor Stores GLENDALE GALLERIAGLENDALE
Motivation Which type of listing are they? A: the same business B: different businesses sharing the same phone# C: different businesses, only one correctly associated with the given phone# 3
Current Solution Uniqueness constraint – Each real-world entity has a unique value. E.g., phone, address The data may not satisfy the constraint – Erroneous values – Small number of exceptions Current two-step solution – Step 1: Record Linkage link records that are likely to refer to the same real-world entity [A.K Elmagarmid, TKDE07], [W.Winkler, Tech Report06] – Step 2: Data Fusion decide the correct values in the presence of conflicts [J. Bleiholder et. al, ACM Computing Surveys] 4
Limitations of Current Solution SOURCE NAMEPHONEADDRESS s1 Microsofe Corp.xxx Microsoft Way Microsofe Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan W. s2 Microsoft Corp.xxx Microsoft Way Microsofe Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s3 Microsoft Corp.xxx Microsoft Way Microsoft Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s4 Microsoft Corp.xxx Microsoft Way Microsoft Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s5 Microsoft Corp.xxx Microsoft Way Microsoft Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s6 Microsoft Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s7 MS Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s8 MS Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s9Macrosoft Inc.xxx Sylvan Way s10MS Corp.xxx Sylvan Way Locally resolving conflicts for linked records may overlook important global evidence Erroneous values may prevent correct matching Traditional techniques may fall short when exceptions to the uniqueness constraints exist (Microsoft Corp.,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) 5
Our Solution Perform linkage and fusion simultaneously – Able to identify incorrect value from the beginning, so can improve linkage Make global decisions – Consider sources that associate a pair of values in the same record, so can improve fusion Allow small number of violations for capturing possible exceptions in the real world 6
Road Map Motivation and overview Problem definition Solution Evaluations on YP data Conclusions 7
Problem Input A set of independent data sources, each providing a set of records A set of (soft) uniqueness constraints – Uniqueness constraint (hard constraint): Business Name, Business Phone, Business Address – Soft uniqueness constraint (soft constraint): Business Phone 8 1-p 1 1-p 2
Problem Output Real-world entities For each (soft) uniqueness attribute of each entity – True value (if any) – Various representations of each true value (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) (Microsoft Corp.,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) 9
K-Partite Graph Encoding s(1) N1 1 Microsoft Way Microsofe Corp. P1 A1 xxx-1255 N3 N2 N4 P2 P3 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. s(1-2) s(1-5,7,8) s(2-5) s(2-6) s(6) S(7-8) s(1-2) s(1-5) S(3-5) S(10) S(2-10) S(1-9) S(2-9) s(1) 10 s(1) S1Microsofe Corp.XXX Microsoft Way
Solution Encoding N3 N1N2 1 Microsoft Way xxx-1255 Microsofe Corp. N4 P1 A1 P2 P3 P4 A2 Microsoft Corp. MS Corp.Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. Clustering problem & Matching problem 11
Solution Encoding with Hard Constraint 12 Microsofe Corp. N3 N1N2 1 Microsoft Way xxx-1255 N4 P1 A1 P2 P3 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. C1 C2 C3 C4 Clustering problem
Road Map Motivation and overview Problem definition Solution Clustering w.r.t. hard constraint Matching w.r.t. soft constraint Evaluations on YP data Conclusions 13
Clustering w.r.t. Hard Constraints N3 N1 N2 1 Microsoft Way xxx-1255 Microsofe Corp. N4 P1 A1 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-0500 A3 2 Sylvan W. C1C4 Ideal clustering: – high cohesion within each cluster – low correlation between different clusters Objective function – Davis-Bouldin Index (Minimization) Average distance of – similarity distance – association distance
Similarity Distance 15 N3 N1 N2 1 Microsoft Way xxx-1255 Microsofe Corp. N4 P1 A1 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-0500 A3 2 Sylvan W d 2 S (C1,C4) = 1-0 = 1 d 3 S (C1,C4) = 1-0 = 1 C1C4 d 1 S (C1,C1) = 1 ( )/3 = 0.25 (name) d 2 S (C1,C1) = 0 (phone) d 3 S (C1,C1) = 0 (address) d S (C1,C1) = ( )/3 = d 1 S (C1,C4) = 1 ( )/3 = 0.4 d S (C1,C4) = ( )/3=0.8 Similarity of values Defined for each attribute
Association Distance 16 N3 N1N2 1 Microsoft Way xxx-1255 Microsofe Corp. s(1) N4 P1 A1 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-0500 s(2-5) S(7-8) s(1-2) S(3-5) S(10)S(1-9) A3 2 Sylvan W. S(2-10) s(1-2) s(1-5,7,8) s(2-6)S(7-8) S(2-9) s(1) d 1,3 A (C1,C1) = 1 8/9 = 0.11 d 2,3 A (C1,C1) = 1 7/8 = C1C4 d 1,2 A (C1,C1) = 1 7/9 = 0.22 d A (C1,C4) = ( )/3 = 0.93 d 1,2 A (C1,C4) = 1 max(1/10,0/10) = 0.9 d A (C1,C1) = ( )/3 = S(10) 9 sources (S1-S8,S10) mention (N1,N2,N3,P1) 9 sources (S1-S8,S10) mention (N1,N2,N3,P1) 7 sources (S1-S5,S7,S8) Support (N1,N2,N3)-P1 7 sources (S1-S5,S7,S8) Support (N1,N2,N3)-P1 d 1,3 A (C1,C4) = 0.9 d 2,3 A (C1,C4) = 1 Association by edges Defined for each pair of attributes 10 sources (S1-S10) mention (N1,N2,N3,N4) (P1,P4) 10 sources (S1-S10) mention (N1,N2,N3,N4) (P1,P4) 1 source (S10) supports (N1,N2,N3)-P4 1 source (S10) supports (N1,N2,N3)-P4 No connection between (N4,P1) No connection between (N4,P1)
Greedy Algorithm Obtaining optimal clustering is intractable – [T.F. Gonzales., 82],[J. Simal et al., 06] Hill climbing approximation: CLUSTER – Step1: Initialization Cluster value representations by their similarity. Do majority voting to associate clusters – Step2: Adjustment For each node, moving to the cluster that minimize this DB index – Step3: Convergence checking terminate if step 2 doesnt change the clustering result. Otherwise, repeat step 2 The algorithm converges 17
18 N3 N1 1 Microsoft Way xxx-1255 N4 P1 A1 P2 P3 P4 A2 N2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. C1 C2 C3 C4 Microsofe Corp. Φ=0.94 Φ=1.16 Φ=0.93 Φ=0.89 Φ=0.71 Φ=0.45
Road Map Motivation and overview Problem definition Solution Clustering w.r.t. hard constraint Matching w.r.t. soft constraint Evaluations on YP data Conclusions 19
Matching w.r.t. Soft Constraints Next? Matching problem How to match? N3 N1N2 1 Microsoft Way xxx-1255 Microsofe Corp. N4 P1 A1 P2 P3 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. NC1 1 Microsoft Way xxx-1255 Microsofe Corp. NC4 PC1 AC1 PC2 PC3 PC4 AC4 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx Sylvan W. 7 s(1-5,7,8) 1 S(6) 5 s(1-5) 1 S(10) 9 S(1-9) 9 S(1-9) 1 S(10) 8 S(1-8) GRAPH TRANSFORM 20
Matching w.r.t. Soft Constraint Intuitions – Largest sum of weights – Smallest gap – How to balance these two goals? Optimization problem – Maximize – Subject to Two-phase greedy algorithm: MATCH P2P1P3 N 1 (s1) 9 (s2-s10) 10 (s1-s10) Solution 2 Gap(N) = 9 P2P1P3 N 1 (s1) 9 (s2-s10) 10 (s1-s10) Solution 1 Gap(N) = 1 21 P2P1P3 N 1 (s1) 9 (s2-s10) 10 (s1-s10) Solution 3 Gap(N) = 0
Road Map Motivation and overview Problem definition Solution Evaluations on YP data Conclusions 22
Experiment Settings Dataset I – Business listings for two zip codes(07035-Lincoln Park NJ, Belmar, NJ) from multiple sources ZipBusiness Source #Sources#Srcs/business Zip Records #Recs#Names#Phones#Addresses#(Err Ps) Zip Constraint Violation N PP NN AA N %(2.6).8%(2.7)2%(2.3)12.6%(5.1) %(2)1%(3)4%(2)4%(8.5) 23
Matching of values of different attributes Clustering of values of the same attribute Precision Recall F-measure Experiment Settings Implementation – MATCH (invoking CLUSTER first) – LINK: record linkage only – FUSE: data fusion only – LINKFUSE: first LINK, then FUSE Golden Standard: by manually checking Measures: Precision/Recall/F-measure 24 NotationDescription Matched pairs for the golden standard Matched pairs for our results Clustered pairs for the golden standard Clustered pairs for our results
Accuracy Matching (NAME-PHONE) Matching (NAME-ADDRESS)07035 Clustering (NAME) Matching (NAME-PHONE)07715 Matching (NAME-ADDRESS)07715 Clustering (NAME) MATCH achieves highest F-measure in most cases Improves LINK by 11% on name-phone matching, by 20% on name clustering LINK vs. FUSE vs. LINKFUSE LINK: high recall in matching FUSE: high precision in matching, high precision in name clustering LINKFUSE: only slightly better than FUSE in matching and similar to LINK in clustering 25
Efficiency and Scalability Data set II – Entire listing: 40+M records Hadoop-based linkage framework – Fuzzy self-join using Hadoop – Partition records into strongly connected components Efficiency – Linear growth – Execution time 26 ModuleExecution time (hour) Record extraction0.002 Fuzzy self join0.89 Connected component0.89 linkage1.36 Overall3.26 median 95th percentile 99th percentile max
Conclusions In the real-world, we need to resolve duplicates and conflicts at the same time. We reduce the problem to a k-partite graph clustering and matching problem – Combine linkage and fusion – Apply them in the global fashion Experiments show high accuracy and scalability 27
Thank You! 28