Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1.

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

3.6 Support Vector Machines
EE384y: Packet Switch Architectures
Unit-iv.
Section 1.8 Homework questions?. Section Concepts 1.8 Linear Equations in Two Variables Slide 2 Copyright (c) The McGraw-Hill Companies, Inc. Permission.
Chapter 5: CPU Scheduling
Advanced Piloting Cruise Plot.
Sugar 2.0 Formal Specification Language D ana F isman 1,2 Cindy Eisner 1 1 IBM Haifa Research Laboratory 1 IBM Haifa Research Laboratory 2 Weizmann Institute.
Kapitel S3 Astronomie Autor: Bennett et al. Raumzeit und Gravitation Kapitel S3 Raumzeit und Gravitation © Pearson Studium 2010 Folie: 1.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Multicriteria Decision-Making Models
Chapter 1 The Study of Body Function Image PowerPoint
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
L3S Research Center University of Hanover Germany
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
Scalable Routing In Delay Tolerant Networks
We need a common denominator to add these fractions.
Summary of Convergence Tests for Series and Solved Problems
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA
Marcus T. Schmitz and Bashir M. Al-Hashimi
C1 Sequences and series. Write down the first 4 terms of the sequence u n+1 =u n +6, u 1 =6 6, 12, 18, 24.
Algorithms for Geometric Covering and Piercing Problems Robert Fraser PhD defence Nov. 23, 2012.
Reducing Order Enforcement Cost in Complex Query Plans Ravindra Guravannavar and S. Sudarshan (To appear in ICDE 2007)
Ken C. K. Lee, Baihua Zheng, Huajing Li, Wang-Chien Lee VLDB 07 Approaching the Skyline in Z Order 1.
Chapter 4: Informed Heuristic Search
AMCS/CS229: Machine Learning
CS525: Special Topics in DBs Large-Scale Data Management
1 Chapter 10 Multicriteria Decision-Marking Models.
Data Structures: A Pseudocode Approach with C
ABC Technology Project
1 Generating Network Topologies That Obey Power LawsPalmer/Steffan Carnegie Mellon Generating Network Topologies That Obey Power Laws Christopher R. Palmer.
Shadow Prices vs. Vickrey Prices in Multipath Routing Parthasarathy Ramanujam, Zongpeng Li and Lisa Higham University of Calgary Presented by Ajay Gopinathan.
Compressing Forwarding Tables Ori Rottenstreich (Technion, Israel) Joint work with Marat Radan, Yuval Cassuto, Isaac Keslassy (Technion, Israel) Carmi.
Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
VOORBLAD.
A D ICHOTOMY ON T HE C OMPLEXITY OF C ONSISTENT Q UERY A NSWERING FOR A TOMS W ITH S IMPLE K EYS Paris Koutris Dan Suciu University of Washington.
Quadratic Inequalities
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Constant, Linear and Non-Linear Constant, Linear and Non-Linear
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
Abbas Edalat Imperial College London Contains joint work with Andre Lieutier (AL) and joint work with Marko Krznaric (MK) Data Types.
25 seconds left…...
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
We will resume in: 25 Minutes.
Reporting and Analyzing Cash Flows
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.
PSSA Preparation.
Copyright McGraw-Hill/Irwin, 2005 Short-Run and Long- Run Aggregate Supply Short-Run Aggregate Supply Long-Run Aggregate Supply Equilibrium with.
Distributed Computing 9. Sorting - a lower bound on bit complexity Shmuel Zaks ©
1 A Systematic Review of Cross- vs. Within-Company Cost Estimation Studies Barbara Kitchenham Emilia Mendes Guilherme Travassos.
Synthesis For Finite State Machines. FSM (Finite State Machine) Optimization State tables State minimization State assignment Combinational logic optimization.
Distributed Computing 5. Snapshot Shmuel Zaks ©
Secret Sharing, Matroids, and Non-Shannon Information Inequalities.
Adaptive Segmentation Based on a Learned Quality Metric
PODC 2007 © 2007 IBM Corporation Constructing Scalable Overlays for Pub/Sub With Many Topics Problems, Algorithms, and Evaluation G. Chockler, R. Melamed,
New Opportunities for Load Balancing in Network-Wide Intrusion Detection Systems Victor Heorhiadi, Michael K. Reiter, Vyas Sekar UNC Chapel Hill UNC Chapel.
Finding Skyline Nodes in Large Networks. Evaluation Metrics:  Distance from the query node. (John)  Coverage of the Query Topics. (Big Data, Cloud Computing,
Record Linkage with Uniqueness Constraints and Erroneous Values
Presentation transcript:

Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

Motivation SrcNamePhoneAddressCity VA-Link Wireless GLENDALE GALLERIAGLENDALE VAbercrombie GLENDALE GALLERIAGLENDALE VAbercrombie & Fitch GLENDALE GALLERIAGLENDALE VAeropostale GLENDALE GALLERIAGLENDALE VAerosoles GLENDALE GALLERIAGLENDALE V Newtown Pizza Palace Church hill RdNEWTOWN V Pizza Palace Of Newtown Church hill RdNEWTOWN 2 s s s Cleaned Data s s s Search Box SrcNamePhoneAddressCity DAerosoles GLENDALE GALLERIAGLENDALE DAldo Shoes GLENDALE GALLERIAGLENDALE D Newtown Pizza Palace Church hill RdNewtown D Pizza Palace of Newtown Church Hill RdNewtown SrcNamePhoneAddressCity AA 24 Hour 1 A 1 Locksmith GLENDALE GALLERIAGLENDALE AA Link Wireless GLENDALE GALLERIAGLENDALE AAbercrombie GLENDALE GALLERIAGLENDALE AAbercrombie & Fitch GLENDALE GALLERIAGLENDALE A Newtown Pizza Palace Church hill RdNewtown AAldo Shoes GLENDALE GALLERIAGLENDALE AAlert Cellular GLENDALE GALLERIAGLENDALE SrcNamePhoneAddressCity T Newtown Pizza Palace Church hill RdNewtown TAldo Shoes GLENDALE GALLERIAGLENDALE TAmerican Eagle Outfitters GLENDALE GALLERIAGLENDALE TANN TAYLOR GLENDALE GALLERIAGLENDALE TAnn Taylor Stores GLENDALE GALLERIAGLENDALE

Motivation Which type of listing are they? A: the same business B: different businesses sharing the same phone# C: different businesses, only one correctly associated with the given phone# 3

Current Solution Uniqueness constraint – Each real-world entity has a unique value. E.g., phone, address The data may not satisfy the constraint – Erroneous values – Small number of exceptions Current two-step solution – Step 1: Record Linkage link records that are likely to refer to the same real-world entity [A.K Elmagarmid, TKDE07], [W.Winkler, Tech Report06] – Step 2: Data Fusion decide the correct values in the presence of conflicts [J. Bleiholder et. al, ACM Computing Surveys] 4

Limitations of Current Solution SOURCE NAMEPHONEADDRESS s1 Microsofe Corp.xxx Microsoft Way Microsofe Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan W. s2 Microsoft Corp.xxx Microsoft Way Microsofe Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s3 Microsoft Corp.xxx Microsoft Way Microsoft Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s4 Microsoft Corp.xxx Microsoft Way Microsoft Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s5 Microsoft Corp.xxx Microsoft Way Microsoft Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s6 Microsoft Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s7 MS Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s8 MS Corp.xxx Microsoft Way Macrosoft Inc.xxx Sylvan Way s9Macrosoft Inc.xxx Sylvan Way s10MS Corp.xxx Sylvan Way Locally resolving conflicts for linked records may overlook important global evidence Erroneous values may prevent correct matching Traditional techniques may fall short when exceptions to the uniqueness constraints exist (Microsoft Corp.,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) 5

Our Solution Perform linkage and fusion simultaneously – Able to identify incorrect value from the beginning, so can improve linkage Make global decisions – Consider sources that associate a pair of values in the same record, so can improve fusion Allow small number of violations for capturing possible exceptions in the real world 6

Road Map Motivation and overview Problem definition Solution Evaluations on YP data Conclusions 7

Problem Input A set of independent data sources, each providing a set of records A set of (soft) uniqueness constraints – Uniqueness constraint (hard constraint): Business Name, Business Phone, Business Address – Soft uniqueness constraint (soft constraint): Business Phone 8 1-p 1 1-p 2

Problem Output Real-world entities For each (soft) uniqueness attribute of each entity – True value (if any) – Various representations of each true value (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) (Microsoft Corp.,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) 9

K-Partite Graph Encoding s(1) N1 1 Microsoft Way Microsofe Corp. P1 A1 xxx-1255 N3 N2 N4 P2 P3 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. s(1-2) s(1-5,7,8) s(2-5) s(2-6) s(6) S(7-8) s(1-2) s(1-5) S(3-5) S(10) S(2-10) S(1-9) S(2-9) s(1) 10 s(1) S1Microsofe Corp.XXX Microsoft Way

Solution Encoding N3 N1N2 1 Microsoft Way xxx-1255 Microsofe Corp. N4 P1 A1 P2 P3 P4 A2 Microsoft Corp. MS Corp.Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. Clustering problem & Matching problem 11

Solution Encoding with Hard Constraint 12 Microsofe Corp. N3 N1N2 1 Microsoft Way xxx-1255 N4 P1 A1 P2 P3 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. C1 C2 C3 C4 Clustering problem

Road Map Motivation and overview Problem definition Solution Clustering w.r.t. hard constraint Matching w.r.t. soft constraint Evaluations on YP data Conclusions 13

Clustering w.r.t. Hard Constraints N3 N1 N2 1 Microsoft Way xxx-1255 Microsofe Corp. N4 P1 A1 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-0500 A3 2 Sylvan W. C1C4 Ideal clustering: – high cohesion within each cluster – low correlation between different clusters Objective function – Davis-Bouldin Index (Minimization) Average distance of – similarity distance – association distance

Similarity Distance 15 N3 N1 N2 1 Microsoft Way xxx-1255 Microsofe Corp. N4 P1 A1 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-0500 A3 2 Sylvan W d 2 S (C1,C4) = 1-0 = 1 d 3 S (C1,C4) = 1-0 = 1 C1C4 d 1 S (C1,C1) = 1 ( )/3 = 0.25 (name) d 2 S (C1,C1) = 0 (phone) d 3 S (C1,C1) = 0 (address) d S (C1,C1) = ( )/3 = d 1 S (C1,C4) = 1 ( )/3 = 0.4 d S (C1,C4) = ( )/3=0.8 Similarity of values Defined for each attribute

Association Distance 16 N3 N1N2 1 Microsoft Way xxx-1255 Microsofe Corp. s(1) N4 P1 A1 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-0500 s(2-5) S(7-8) s(1-2) S(3-5) S(10)S(1-9) A3 2 Sylvan W. S(2-10) s(1-2) s(1-5,7,8) s(2-6)S(7-8) S(2-9) s(1) d 1,3 A (C1,C1) = 1 8/9 = 0.11 d 2,3 A (C1,C1) = 1 7/8 = C1C4 d 1,2 A (C1,C1) = 1 7/9 = 0.22 d A (C1,C4) = ( )/3 = 0.93 d 1,2 A (C1,C4) = 1 max(1/10,0/10) = 0.9 d A (C1,C1) = ( )/3 = S(10) 9 sources (S1-S8,S10) mention (N1,N2,N3,P1) 9 sources (S1-S8,S10) mention (N1,N2,N3,P1) 7 sources (S1-S5,S7,S8) Support (N1,N2,N3)-P1 7 sources (S1-S5,S7,S8) Support (N1,N2,N3)-P1 d 1,3 A (C1,C4) = 0.9 d 2,3 A (C1,C4) = 1 Association by edges Defined for each pair of attributes 10 sources (S1-S10) mention (N1,N2,N3,N4) (P1,P4) 10 sources (S1-S10) mention (N1,N2,N3,N4) (P1,P4) 1 source (S10) supports (N1,N2,N3)-P4 1 source (S10) supports (N1,N2,N3)-P4 No connection between (N4,P1) No connection between (N4,P1)

Greedy Algorithm Obtaining optimal clustering is intractable – [T.F. Gonzales., 82],[J. Simal et al., 06] Hill climbing approximation: CLUSTER – Step1: Initialization Cluster value representations by their similarity. Do majority voting to associate clusters – Step2: Adjustment For each node, moving to the cluster that minimize this DB index – Step3: Convergence checking terminate if step 2 doesnt change the clustering result. Otherwise, repeat step 2 The algorithm converges 17

18 N3 N1 1 Microsoft Way xxx-1255 N4 P1 A1 P2 P3 P4 A2 N2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. C1 C2 C3 C4 Microsofe Corp. Φ=0.94 Φ=1.16 Φ=0.93 Φ=0.89 Φ=0.71 Φ=0.45

Road Map Motivation and overview Problem definition Solution Clustering w.r.t. hard constraint Matching w.r.t. soft constraint Evaluations on YP data Conclusions 19

Matching w.r.t. Soft Constraints Next? Matching problem How to match? N3 N1N2 1 Microsoft Way xxx-1255 Microsofe Corp. N4 P1 A1 P2 P3 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. NC1 1 Microsoft Way xxx-1255 Microsofe Corp. NC4 PC1 AC1 PC2 PC3 PC4 AC4 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx Sylvan W. 7 s(1-5,7,8) 1 S(6) 5 s(1-5) 1 S(10) 9 S(1-9) 9 S(1-9) 1 S(10) 8 S(1-8) GRAPH TRANSFORM 20

Matching w.r.t. Soft Constraint Intuitions – Largest sum of weights – Smallest gap – How to balance these two goals? Optimization problem – Maximize – Subject to Two-phase greedy algorithm: MATCH P2P1P3 N 1 (s1) 9 (s2-s10) 10 (s1-s10) Solution 2 Gap(N) = 9 P2P1P3 N 1 (s1) 9 (s2-s10) 10 (s1-s10) Solution 1 Gap(N) = 1 21 P2P1P3 N 1 (s1) 9 (s2-s10) 10 (s1-s10) Solution 3 Gap(N) = 0

Road Map Motivation and overview Problem definition Solution Evaluations on YP data Conclusions 22

Experiment Settings Dataset I – Business listings for two zip codes(07035-Lincoln Park NJ, Belmar, NJ) from multiple sources ZipBusiness Source #Sources#Srcs/business Zip Records #Recs#Names#Phones#Addresses#(Err Ps) Zip Constraint Violation N PP NN AA N %(2.6).8%(2.7)2%(2.3)12.6%(5.1) %(2)1%(3)4%(2)4%(8.5) 23

Matching of values of different attributes Clustering of values of the same attribute Precision Recall F-measure Experiment Settings Implementation – MATCH (invoking CLUSTER first) – LINK: record linkage only – FUSE: data fusion only – LINKFUSE: first LINK, then FUSE Golden Standard: by manually checking Measures: Precision/Recall/F-measure 24 NotationDescription Matched pairs for the golden standard Matched pairs for our results Clustered pairs for the golden standard Clustered pairs for our results

Accuracy Matching (NAME-PHONE) Matching (NAME-ADDRESS)07035 Clustering (NAME) Matching (NAME-PHONE)07715 Matching (NAME-ADDRESS)07715 Clustering (NAME) MATCH achieves highest F-measure in most cases Improves LINK by 11% on name-phone matching, by 20% on name clustering LINK vs. FUSE vs. LINKFUSE LINK: high recall in matching FUSE: high precision in matching, high precision in name clustering LINKFUSE: only slightly better than FUSE in matching and similar to LINK in clustering 25

Efficiency and Scalability Data set II – Entire listing: 40+M records Hadoop-based linkage framework – Fuzzy self-join using Hadoop – Partition records into strongly connected components Efficiency – Linear growth – Execution time 26 ModuleExecution time (hour) Record extraction0.002 Fuzzy self join0.89 Connected component0.89 linkage1.36 Overall3.26 median 95th percentile 99th percentile max

Conclusions In the real-world, we need to resolve duplicates and conflicts at the same time. We reduce the problem to a k-partite graph clustering and matching problem – Combine linkage and fusion – Apply them in the global fashion Experiments show high accuracy and scalability 27

Thank You! 28