Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1.

Similar presentations


Presentation on theme: "Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1."— Presentation transcript:

1 Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

2 Motivation SrcNamePhoneAddressCity VA-Link Wireless81854914492148 GLENDALE GALLERIAGLENDALE VAbercrombie81850207282229 GLENDALE GALLERIAGLENDALE VAbercrombie & Fitch81855074922151 GLENDALE GALLERIAGLENDALE VAeropostale81854589722187 GLENDALE GALLERIAGLENDALE VAerosoles81824624551163 GLENDALE GALLERIAGLENDALE V Newtown Pizza Palace203426611465 Church hill RdNEWTOWN V Pizza Palace Of Newtown 203426611465 Church hill RdNEWTOWN 2 s s s Cleaned Data s s s Search Box SrcNamePhoneAddressCity DAerosoles81824624551163 GLENDALE GALLERIAGLENDALE DAldo Shoes81840906121157 GLENDALE GALLERIAGLENDALE D Newtown Pizza Palace203426611465 Church hill RdNewtown D Pizza Palace of Newtown2034266114Church Hill RdNewtown SrcNamePhoneAddressCity AA 24 Hour 1 A 1 Locksmith81824046443210 GLENDALE GALLERIAGLENDALE AA Link Wireless81854914492148 GLENDALE GALLERIAGLENDALE AAbercrombie81850207282229 GLENDALE GALLERIAGLENDALE AAbercrombie & Fitch81855074922151 GLENDALE GALLERIAGLENDALE A Newtown Pizza Palace203426611465 Church hill RdNewtown AAldo Shoes81854825402154 GLENDALE GALLERIAGLENDALE AAlert Cellular81824047792148 GLENDALE GALLERIAGLENDALE SrcNamePhoneAddressCity T Newtown Pizza Palace203426611465 Church hill RdNewtown TAldo Shoes81854825402154 GLENDALE GALLERIAGLENDALE TAmerican Eagle Outfitters81895618932182 GLENDALE GALLERIAGLENDALE TANN TAYLOR81824603502178 GLENDALE GALLERIAGLENDALE TAnn Taylor Stores81824603501108 GLENDALE GALLERIAGLENDALE

3 Motivation Which type of listing are they? A: the same business B: different businesses sharing the same phone# C: different businesses, only one correctly associated with the given phone# 3

4 Current Solution Uniqueness constraint – Each real-world entity has a unique value. E.g., phone, address The data may not satisfy the constraint – Erroneous values – Small number of exceptions Current two-step solution – Step 1: Record Linkage link records that are likely to refer to the same real-world entity [A.K Elmagarmid, TKDE07], [W.Winkler, Tech Report06] – Step 2: Data Fusion decide the correct values in the presence of conflicts [J. Bleiholder et. al, ACM Computing Surveys] 4

5 Limitations of Current Solution SOURCE NAMEPHONEADDRESS s1 Microsofe Corp.xxx-12551 Microsoft Way Microsofe Corp.xxx-94001 Microsoft Way Macrosoft Inc.xxx-05002 Sylvan W. s2 Microsoft Corp.xxx-12551 Microsoft Way Microsofe Corp.xxx-94001 Microsoft Way Macrosoft Inc.xxx-05002 Sylvan Way s3 Microsoft Corp.xxx-12551 Microsoft Way Microsoft Corp.xxx-94001 Microsoft Way Macrosoft Inc.xxx-05002 Sylvan Way s4 Microsoft Corp.xxx-12551 Microsoft Way Microsoft Corp.xxx-94001 Microsoft Way Macrosoft Inc.xxx-05002 Sylvan Way s5 Microsoft Corp.xxx-12551 Microsoft Way Microsoft Corp.xxx-94001 Microsoft Way Macrosoft Inc.xxx-05002 Sylvan Way s6 Microsoft Corp.xxx-22551 Microsoft Way Macrosoft Inc.xxx-05002 Sylvan Way s7 MS Corp.xxx-12551 Microsoft Way Macrosoft Inc.xxx-05002 Sylvan Way s8 MS Corp.xxx-12551 Microsoft Way Macrosoft Inc.xxx-05002 Sylvan Way s9Macrosoft Inc.xxx-05002 Sylvan Way s10MS Corp.xxx-05002 Sylvan Way Locally resolving conflicts for linked records may overlook important global evidence Erroneous values may prevent correct matching Traditional techniques may fall short when exceptions to the uniqueness constraints exist (Microsoft Corp.,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) 5

6 Our Solution Perform linkage and fusion simultaneously – Able to identify incorrect value from the beginning, so can improve linkage Make global decisions – Consider sources that associate a pair of values in the same record, so can improve fusion Allow small number of violations for capturing possible exceptions in the real world 6

7 Road Map Motivation and overview Problem definition Solution Evaluations on YP data Conclusions 7

8 Problem Input A set of independent data sources, each providing a set of records A set of (soft) uniqueness constraints – Uniqueness constraint (hard constraint): Business Name, Business Phone, Business Address – Soft uniqueness constraint (soft constraint): Business Phone 8 1-p 1 1-p 2

9 Problem Output Real-world entities For each (soft) uniqueness attribute of each entity – True value (if any) – Various representations of each true value (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) (Microsoft Corp.,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) 9

10 K-Partite Graph Encoding s(1) N1 1 Microsoft Way Microsofe Corp. P1 A1 xxx-1255 N3 N2 N4 P2 P3 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. s(1-2) s(1-5,7,8) s(2-5) s(2-6) s(6) S(7-8) s(1-2) s(1-5) S(3-5) S(10) S(2-10) S(1-9) S(2-9) s(1) 10 s(1) S1Microsofe Corp.XXX-12551 Microsoft Way

11 Solution Encoding N3 N1N2 1 Microsoft Way xxx-1255 Microsofe Corp. N4 P1 A1 P2 P3 P4 A2 Microsoft Corp. MS Corp.Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. Clustering problem & Matching problem 11

12 Solution Encoding with Hard Constraint 12 Microsofe Corp. N3 N1N2 1 Microsoft Way xxx-1255 N4 P1 A1 P2 P3 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. C1 C2 C3 C4 Clustering problem

13 Road Map Motivation and overview Problem definition Solution Clustering w.r.t. hard constraint Matching w.r.t. soft constraint Evaluations on YP data Conclusions 13

14 Clustering w.r.t. Hard Constraints N3 N1 N2 1 Microsoft Way xxx-1255 Microsofe Corp. N4 P1 A1 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-0500 A3 2 Sylvan W. C1C4 Ideal clustering: – high cohesion within each cluster – low correlation between different clusters Objective function – Davis-Bouldin Index (Minimization) Average distance of – similarity distance – association distance

15 Similarity Distance 15 N3 N1 N2 1 Microsoft Way xxx-1255 Microsofe Corp. N4 P1 A1 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-0500 A3 2 Sylvan W. 0.950.65 0.4 0.7 0.9 d 2 S (C1,C4) = 1-0 = 1 d 3 S (C1,C4) = 1-0 = 1 C1C4 d 1 S (C1,C1) = 1 (0.95+0.65+0.65)/3 = 0.25 (name) d 2 S (C1,C1) = 0 (phone) d 3 S (C1,C1) = 0 (address) d S (C1,C1) = (0.25+0+0)/3 = 0.083 0 0 0 d 1 S (C1,C4) = 1 (0.7+0.7+0.4)/3 = 0.4 d S (C1,C4) = (0.4+1+1)/3=0.8 Similarity of values Defined for each attribute

16 Association Distance 16 N3 N1N2 1 Microsoft Way xxx-1255 Microsofe Corp. s(1) N4 P1 A1 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-0500 s(2-5) S(7-8) s(1-2) S(3-5) S(10)S(1-9) A3 2 Sylvan W. S(2-10) s(1-2) s(1-5,7,8) s(2-6)S(7-8) S(2-9) s(1) d 1,3 A (C1,C1) = 1 8/9 = 0.11 d 2,3 A (C1,C1) = 1 7/8 = 0.125 C1C4 d 1,2 A (C1,C1) = 1 7/9 = 0.22 d A (C1,C4) = (0.9+0.9+1)/3 = 0.93 d 1,2 A (C1,C4) = 1 max(1/10,0/10) = 0.9 d A (C1,C1) = (0.22+0.11+0.125)/3 = 0.153 S(10) 9 sources (S1-S8,S10) mention (N1,N2,N3,P1) 9 sources (S1-S8,S10) mention (N1,N2,N3,P1) 7 sources (S1-S5,S7,S8) Support (N1,N2,N3)-P1 7 sources (S1-S5,S7,S8) Support (N1,N2,N3)-P1 d 1,3 A (C1,C4) = 0.9 d 2,3 A (C1,C4) = 1 Association by edges Defined for each pair of attributes 10 sources (S1-S10) mention (N1,N2,N3,N4) (P1,P4) 10 sources (S1-S10) mention (N1,N2,N3,N4) (P1,P4) 1 source (S10) supports (N1,N2,N3)-P4 1 source (S10) supports (N1,N2,N3)-P4 No connection between (N4,P1) No connection between (N4,P1)

17 Greedy Algorithm Obtaining optimal clustering is intractable – [T.F. Gonzales., 82],[J. Simal et al., 06] Hill climbing approximation: CLUSTER – Step1: Initialization Cluster value representations by their similarity. Do majority voting to associate clusters – Step2: Adjustment For each node, moving to the cluster that minimize this DB index – Step3: Convergence checking terminate if step 2 doesnt change the clustering result. Otherwise, repeat step 2 The algorithm converges 17

18 18 N3 N1 1 Microsoft Way xxx-1255 N4 P1 A1 P2 P3 P4 A2 N2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. C1 C2 C3 C4 Microsofe Corp. Φ=0.94 Φ=1.16 Φ=0.93 Φ=0.89 Φ=0.71 Φ=0.45

19 Road Map Motivation and overview Problem definition Solution Clustering w.r.t. hard constraint Matching w.r.t. soft constraint Evaluations on YP data Conclusions 19

20 Matching w.r.t. Soft Constraints Next? Matching problem How to match? N3 N1N2 1 Microsoft Way xxx-1255 Microsofe Corp. N4 P1 A1 P2 P3 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. NC1 1 Microsoft Way xxx-1255 Microsofe Corp. NC4 PC1 AC1 PC2 PC3 PC4 AC4 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 2 Sylvan W. 7 s(1-5,7,8) 1 S(6) 5 s(1-5) 1 S(10) 9 S(1-9) 9 S(1-9) 1 S(10) 8 S(1-8) GRAPH TRANSFORM 20

21 Matching w.r.t. Soft Constraint Intuitions – Largest sum of weights – Smallest gap – How to balance these two goals? Optimization problem – Maximize – Subject to Two-phase greedy algorithm: MATCH P2P1P3 N 1 (s1) 9 (s2-s10) 10 (s1-s10) Solution 2 Gap(N) = 9 P2P1P3 N 1 (s1) 9 (s2-s10) 10 (s1-s10) Solution 1 Gap(N) = 1 21 P2P1P3 N 1 (s1) 9 (s2-s10) 10 (s1-s10) Solution 3 Gap(N) = 0

22 Road Map Motivation and overview Problem definition Solution Evaluations on YP data Conclusions 22

23 Experiment Settings Dataset I – Business listings for two zip codes(07035-Lincoln Park NJ, 07715-Belmar, NJ) from multiple sources ZipBusiness Source #Sources#Srcs/business 07035662151-7 0771514961-3 Zip Records #Recs#Names#Phones#Addresses#(Err Ps) 070351629115483973572 077152662431845512 Zip Constraint Violation N PP NN AA N 070358%(2.6).8%(2.7)2%(2.3)12.6%(5.1) 077154%(2)1%(3)4%(2)4%(8.5) 23

24 Matching of values of different attributes Clustering of values of the same attribute Precision Recall F-measure Experiment Settings Implementation – MATCH (invoking CLUSTER first) – LINK: record linkage only – FUSE: data fusion only – LINKFUSE: first LINK, then FUSE Golden Standard: by manually checking Measures: Precision/Recall/F-measure 24 NotationDescription Matched pairs for the golden standard Matched pairs for our results Clustered pairs for the golden standard Clustered pairs for our results

25 Accuracy 07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS)07035 Clustering (NAME) 07715 Matching (NAME-PHONE)07715 Matching (NAME-ADDRESS)07715 Clustering (NAME) MATCH achieves highest F-measure in most cases Improves LINK by 11% on name-phone matching, by 20% on name clustering LINK vs. FUSE vs. LINKFUSE LINK: high recall in matching FUSE: high precision in matching, high precision in name clustering LINKFUSE: only slightly better than FUSE in matching and similar to LINK in clustering 25

26 Efficiency and Scalability Data set II – Entire listing: 40+M records Hadoop-based linkage framework – Fuzzy self-join using Hadoop – Partition records into strongly connected components Efficiency – Linear growth – Execution time 26 ModuleExecution time (hour) Record extraction0.002 Fuzzy self join0.89 Connected component0.89 linkage1.36 Overall3.26 median 95th percentile 99th percentile max 2572103

27 Conclusions In the real-world, we need to resolve duplicates and conflicts at the same time. We reduce the problem to a k-partite graph clustering and matching problem – Combine linkage and fusion – Apply them in the global fashion Experiments show high accuracy and scalability 27

28 Thank You! 28


Download ppt "Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1."

Similar presentations


Ads by Google