Download presentation
Presentation is loading. Please wait.
Published bySharleen Atkinson Modified over 9 years ago
1
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching
2
Introduction “Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several databases” Also known as: Record or data linkage Entity resolution Object identification Field matching
3
Aims & Challenges Three tasks: Schema matching Data matching Data fusion Challenges: Lack of unique entity identifier and data quality Computation complexity Lack of training data (e.g. gold standards) Privacy and confidentiality (health informatics & data mining)
4
Overview of Data Matching Five major steps: Data pre-processing Indexing Record pair comparison Classification Evaluation
5
Diagram
6
Data Pre-processing Remove unwanted characters and words Expand abbreviations and correct misspellings Segment attributes into well-defined and consistent output attributes Verify the correctness of attribute values
7
Example of Data Pre-processing
8
Indexing Reduces computational complexity Generates candidate record pairs Common technique—Blocking
9
Example of Blocking
10
Record Pair Comparison Comparison vector – vector of numerical similarity values
11
Example of Record Pair Comparison
12
Jaro and Winkler String Comparison Jaro: Combines edit distance and q-gram based comparison Winkler: Increases Jaro similarity for up to four agreeing initial chars
13
Record Pair Classification Two-class or three-class classification: Match or non-match Match or non-match or potential match (requires clerical review) Supervised and unsupervised Active learning
14
Example of Record Pair Classification
15
Unsupervised Classification Threshold-based classification Probabilistic classification Cost-based classification Rule-based classification Clustering-based classification
16
Probabilistic Classification Three-class based Different weights assigned to different attributes Newcombe & Kennedy – cardinalities Comparison vectors, binary comparison Conditionally independent attributes assumed
17
Formulae
18
Example of Probabilistic Classification
19
Active Learning Trains a model with small set of seed data Classifies comparison vectors not in training set as matches or non-matches Asks users for help on the most difficult to classify Adds manually classified to training data set Trains the next, improved, classification model Repeats until stopping criteria met
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.