Record Linkage with Uniqueness Constraints and Erroneous Values

Slides:

Advertisements

Similar presentations

ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.

Advertisements

Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1.

Dr. Miguel Bagajewicz Sanjay Kumar DuyQuang Nguyen Novel methods for Sensor Network Design.

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.

CrowdER - Crowdsourcing Entity Resolution

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

Data Mining Classification: Alternative Techniques

Computing Kemeny and Slater Rankings Vincent Conitzer (Joint work with Andrew Davenport and Jayant Kalagnanam at IBM Research.)

Algorithms + L. Grewe.

Experiments We measured the times(s) and number of expanded nodes to previous heuristic using BFBnB. Dynamic Programming Intuition. All DAGs must have.

Entity Profiling with Varying Source Reliabilities

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.

Databases and Processing Modes. Fundamental Data Storage Concepts and Definitions What is an entity? An entity is something about which information is.

Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

CMSC 345 Fall 2000 Unit Testing. The testing process.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Cut-based & divisive clustering Clustering algorithms: Part 2b Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern.

INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information.

CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

1 Channel Coding (III) Channel Decoding. ECED of 15 Topics today u Viterbi decoding –trellis diagram –surviving path –ending the decoding u Soft.

Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Chapter 9 Designing Databases 9.1.

Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

CSE280Stefano/Hossein Project: Primer design for cancer genomics.

Clustering Data Streams A presentation by George Toderici.

Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.

Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Slides from Luna Dong’s VLDB Tutorials

Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS

RE-Tree: An Efficient Index Structure for Regular Expressions

Probabilistic Data Management

Chapter 2: Intro to Relational Model

Localization with witnesses

Lecture 9: Entity Resolution

Data Integration with Dependent Sources

Discovering Functional Communities in Social Media

MURI Kickoff Meeting Randolph L. Moses November, 2008

Database Systems Instructor Name: Lecture-3.

On the k-Closest Substring and k-Consensus Pattern Problems

Consensus Partition Liang Zheng 5.21.

Fair Clustering through Fairlets ( NIPS 2017)

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Efficient Record Linkage in Large Data Sets

A Framework for Testing Query Transformation Rules

Actively Learning Ontology Matching via User Interaction

Graphical solution A Graphical Solution Procedure (LPs with 2 decision variables can be solved/viewed this way.) 1. Plot each constraint as an equation.

Efficient Processing of Top-k Spatial Preference Queries

Donghui Zhang, Tian Xia Northeastern University

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Fragment Assembly 7/30/2019.

Efficient Aggregation over Objects with Extent

Presentation transcript:

Record Linkage with Uniqueness Constraints and Erroneous Values Zhang Xiaojian 2010 November 26 WAMDM Group Meeting

Data integration process Schema matching E.Rahm VLDBJ01 Two challenges Heterogeneous sources Schema level Instance level conflicting data Value level contradiction Application1 Application2 Cleaned Data Data exchange R.Fagin TODS05 Data Fusion Duplicate Detection Record Linkage Schema matching Duplicate detection Record linkage A.K.E TKDE07 Entity resolution Tect Report Stanford Data fusion X Dong VLDB09 Data fusion Felix WWW06 s s s s s Name Address Age John R Smith 16 Main Street 16 J R Smith 16 Main St NULL s uncertainty Data fusion Felix ACMC08 Data integration process

Contents Motivation Problem definition Solution Experimental results Conclusions Getting some problems from the paper

Motivation s1 s2 integration s3 Cleaned Data Search Box s4 Src Name Phone Address City V A-Link Wireless 8185491449 2148 GLENDALE GALLERIA GLENDALE Abercrombie 8185020728 2229 GLENDALE GALLERIA Abercrombie & Fitch 8185507492 2151 GLENDALE GALLERIA Aeropostale 8185458972 2187 GLENDALE GALLERIA Aerosoles 8182462455 1163 GLENDALE GALLERIA Newtown Pizza Palace 2034266114 65 Church hill Rd NEWTOWN Pizza Palace Of Newtown s2 Src Name Phone Address City D Aerosoles 8182462455 1163 GLENDALE GALLERIA GLENDALE Aldo Shoes 8184090612 1157 GLENDALE GALLERIA Newtown Pizza Palace 2034299114 65 Church hill Rd Newtown Pizza Palace of Newtown 2034266114 Church Hill Rd integration s3 Src Name Phone Address City A A 24 Hour 1 A 1 Locksmith 8182404644 3210 GLENDALE GALLERIA GLENDALE A Link Wireless 8185491449 2148 GLENDALE GALLERIA Abercrombie 8185020728 2229 GLENDALE GALLERIA Abercrombie & Fitch 8185507492 2151 GLENDALE GALLERIA Newtown Pizza Palace 2034266114 65 Church hill Rd Newtown Aldo Shoes 8185482540 2154 GLENDALE GALLERIA Alert Cellular 8182404779 Cleaned Data Search Box s4 Src Name Phone Address City T Newtown Piza Palace 2034266114 65 Church hill Rd Newtown Aldo Shoes 8185482540 2154 GLENDALE GALLERIA GLENDALE American Eagle Outfitters 8189561893 2182 GLENDALE GALLERIA ANN TAYLOR 8182460350 2178 GLENDALE GALLERIA Ann Taylor Stores 1108 GLENDALE GALLERIA

Current Solution Current two-step solution Uniqueness constraint Step 1: Record Linkage link records that are likely to refer to the same real-world entity [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06] Step 2: Data Fusion merge the linked records and decide the correct values for each result entity in the presence of conflicts [J. Bleiholder et. al, ACM Computing Surveys08] Uniqueness constraint Many real world entities has a unique value for the attribute. E.g. Website(IP ), Phone, Facebook account Co-existence of conflicts and duplicates makes the problem hard to solve

Limitations of Current Solution SOURCE NAME PHONE ADDRESS s1 Microsofe Corp. xxx-1255 1 Microsoft Way xxx-9400 Macrosoft Inc. xxx-0500 2 Sylvan W. s2 Microsoft Corp. 2 Sylvan Way s3 s4 s5 s6 xxx-2255 s7 MS Corp. s8 s9 s10 (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) Assume that Phone and Address satisfy uniqueness constraints Erroneous values may prevent correct matching Current solutions may fall short when the uniqueness constraints exist (PHONE) 9400 missing

Contents Motivation Problem definition Solution Experimental results Conclusions and Future work

Problem Definition Input Output: A set of records provided by a set of independent data sources A set of (hard or soft) uniqueness constraints Output: Real-world entities For each (hard or soft) uniqueness attribute of each entity True value

Concepts Entity and Attribute Constraint E.g., Value vs. Representations (e.g., New York City  New York City, NYC, N.Y.C) Constraint Uniqueness constraint (hard constraint): DA Business Name, Business Phone, Business Address Soft uniqueness constraint (soft constraint): DA Business Phone (e.g., p1=30%, p2=10% ) Where p1 is the upper bound probability of an entity having multiple values for A and p2 is the upper bound probability of a value of A being shared by multiple entities. Special case: key attribute (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) 1-p1 1-p1 1-p2 1-p2

Contents Motivation Problem definition Solution Experimental results Conclusions and Future work

K-Partite Graph Encoding (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) N1 Microsofe Corp. s(1) P1 xxx-1255 A1 1 Microsoft Way S1 Microsofe Corp. Xxx-1255 1 Microsoft Way

Encoding of the ideal solution Microsofe Corp. Microsoft Corp. MS Corp. Macrosoft Inc. N1 N2 N3 N4 P1 P2 P3 P4 xxx-9400 xxx-1255 xxx-2255 xxx-0500 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Pre-processing for the K-partite graph Clustering in every partite (subset)

Clustering with Hard Constraint Microsofe Corp. N3 N1 N2 1 Microsoft Way xxx-1255 N4 P1 A1 P2 P3 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. C1 C2 C3 C4 Clustering the whole graph G(S)

Clustering w.r.t hard constraint Ideal clustering should meet two requests High cohesion within each cluster Low correlation between different clusters Objective function for getting “best” clustering Choosing Davies-Bouldin index [Davies and Bouldin TPAML79] The goal is to minimize Davies-Bouldin index min( ) corresponds to complement of cohesion corresponds to complement of correlation High cohesion Low correlation

Computing cluster distance Cluster distance function is similarity distance for measuring similarity between value representations of the same attributes. is association distance for measuring association between value representations of different attributes. The key is how to calculate and for computing cluster distance

Similarity Distance Within the same cluster How to get  N1 N2 N3 P1 A1 0.95 0.65 Microsofe Corp. Microsoft Corp. MS Corp. xxx-1255 1 Microsoft Way C1 N4 P4 A2 A3 Macrosoft Corp. 2 Sylvan Way xxx-0500 C4 0.9 0.4 0.7 d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3 = 0.25 (name) d2S(C1,C1) = 0 (phone) d3S(C1,C1) = 0 (address) N1 N2 N3 N4 1.0 0.95 0.65 0.7 0.4 A1 A3 1.0 A2 0.9 dS(C1,C1) = (0.25+0+0)/3 = 0.083 Within the different clusters d1S(C1,C4) = 1 − (0.7+0.7+0.4)/3 = 0.4 (name) d2S(C1,C4) = 1-0 = 1 (phone) d3S(C1,C4) = 1-0 = 1 (address) dS(C1,C4) = (0.4+1+1)/3=0.8

Association Distance Within the same cluster How to get association distance Within the same cluster d1,2A (C1,C1) = 1 − 7/9 = 0.22  d1,3A(C1,C1) = 1− 8/9 = 0.11 d2,3A (C1,C1) = 1− 7/8 = 0.125 Microsoft Corp. Macrosoft Inc. Microsofe Corp. MS Corp. dA(C1,C1) = (0.22+0.11+0.125)/3 = 0.153 N1 N2 N3 N4 s(1-2) s(1-5,7,8) s(2-6) S(7-8) s(2-5) S(10) S(1-9) Within the different clusters s(1) S(7-8) d1,2A (C1,C4) = 1 − max(1/10,0/10) = 0.9 s(1) P1 S(10) P4 S(2-9) d1,3A(C1,C4) = 0.9 d2,3A (C1,C4) = 1 xxx-1255 xxx-0500 S(2-10) dA(C1,C4) = (0.9+0.9+1)/3 = 0.93 s(1) A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C1 C4

Greedy Algorithm--CLUSTER Obtaining optimal clustering is intractable [T.F. Gonzales., 82],[J. Simal et al., 06] Algorithm: CLUSTER Step1: Initialization Cluster value representations according to their similarity distance and association distance Step2: Adjustment For each node, moving to the cluster that minimize this Davies-Bouldin(DB) index Step3: Convergence checking stop if step 2 doesn’t change the clustering result. Otherwise, repeat step 2

N3 N1 N2 N4 P1 P2 P3 P4 A2 A3 A1 Φ=0.94 Φ=1.16 Φ=0.93 Φ=0.71 Φ=1.15 Φ=0.92 Microsoft Corp. MS Corp. Microsofe Corp. Macrosoft Inc. N3 N1 N2 N4 Φ=0.89 Φ=0.71 Φ=0.45 P1 P2 P3 P4 xxx-1255 xxx-9400 xxx-0500 xxx-2255 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C1 C2 C3 C4

Matching w.r.t. Soft Constraints NC1 1 Microsoft Way xxx-1255 Microsofe Corp. NC4 PC1 AC1 PC2 PC3 PC4 AC4 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 2 Sylvan W. 7 s(1-5,7,8) 1 S(6) 5 s(1-5) S(10) 9 S(1-9) 8 S(1-8) Graph Transform Next step is to find the best matching between key attribute and soft uniqueness attributes How to match?

Matching w.r.t. Soft Constraint Goals Maximizing the sum of weights of selected edges w(e) Minimizing the gap for each node Gap(N) How to balance above two goals? Giving a score function to balance w(e) and Gap(N) Getting the “best” matching Maximize Score function Greedy algorithm: MATCHT Getting Gap(N) and W(u,v) N1 1 (s1) 9 (s2-s10) 7 (s4-s10) P1 P2 P3

Continue the example Solution 1 Solution 2 P1 P1 P2 P2 P3 P4 P1 P2 P2 (s3-s5) 3 (s3-s5) 9 (s2-s10) Greedily select 9 (s2-s10) 1 (s1) 1 (s1) 8 (s2-s9) 8 (s2-s9) 7 (s4-s10) 10 (s1-s10) 7 (s4-s10) 10 (s1-s10) Greedily select P1 P1 P2 P2 P3 P4 P1 P2 P2 P3 P4 P4 Gap(N1) = 9 Gap(N2) = 5 Gap(N1) = 3 Gap(N2) = 0 Gap(P1) = 0 Gap(P2) = 4 Gap(P2) = 4 Gap(P4) = 2 w(N1,P1) = 1 w(N2,P2) = 3 w(N1,P2) = 7 w(N2,P4) = 8 Solution 3 Solution 4 N1 N2 N1 N2 3 (s3-s5) 3 (s3-s5) 9 (s2-s10) 9 (s2-s10) 1 (s1) Greedily select 1 (s1) 8 (s2-s9) 8 (s2-s9) 10 (s1-s10) 7 (s4-s10) 10 (s1-s10) 7 (s4-s10) P1 P2 P3 P3 P4 P1 P2 P3 P4 P4 Gap(N1) =1 Gap(N2) = 0 Gap(N1) =0 Gap(N2) = 0 Gap(P3) = 0 Gap(P4) = 2 Gap(P4) = 2 Gap(P4) = 2 w(N1,P3) =9 w(N2,P2) = 8 w(N1,P4) =10 w(N2,P2) = 8

Contents Motivation Problem definition Solution Experimental results Conclusions and Future work

Experiment Settings Dataset I Business listings for two zip codes(07035,07715) from multiple sources Zip Business Source #Sources #Sources/business 07035 662 15 1—7 07715 149 6 1—3 Zip Records #Records #Names #Phones #Addresses #(Error Phones) 07035 1629 1154 839 735 72 07715 266 243 184 55 12

Experiment Settings Implementation MATCH +CLUSTER LINK: linkage only FUSE: data fusion only LINKFUSE: first LINK , second FUSE Golden Standard: by manually checking Measures: Precision/Recall/F-measure Matching of values of different attributes Clustering of values of the same attribute Precision Recall F-measure

Accuracy 07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS) 07035 Clustering (NAME) 07715 Matching (NAME-PHONE) 07715 Matching (NAME-ADDRESS) 07715 Clustering (NAME)

Efficiency and Scalability

Conclusions In the real-world, we need to resolve duplicates and conflicts at the same time. We reduce the problem to a k-partite graph clustering and matching problem Combine linkage and fusion Experiments show high efficiency and scalability

Thank You!