CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Slides:



Advertisements
Similar presentations
CSCI3170 Introduction to Database Systems
Advertisements

CrowdER - Crowdsourcing Entity Resolution
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Large-Scale Entity-Based Online Social Network Profile Linkage.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Problem Semi supervised sarcasm identification using SASI
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.
Graph Analysis Matching Program Burdette Pixton. Record Linkage Object Identification Problem Identifies possible links in pedigrees Advantages Compress.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Aki Hecht Seminar in Databases (236826) January 2009
Chapter 14 Getting to First Base: Introduction to Database Concepts.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
C LIENT R EGISTRY OpenEMPI: Operations Support Training SYSNET International, Inc.
- Darshana Pathak - Dr. Hye-Chung Kum.  Overview  Entity resolution process  About Framework  Configuration file  Class Details  How to …  Future.
Software School of Hunan University Database Systems Design Part III Section 5 Design Methodology.
CSCI 3140 Module 2 – Conceptual Database Design Theodore Chiasson Dalhousie University.
Concepts and Terminology Introduction to Database.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
© Hanson Research Corporation Deduping contacts in Sage CRM 24 th Day of November 2010.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
COMPARISON OF IMAGE ANALYSIS FOR THAI HANDWRITTEN CHARACTER RECOGNITION Olarik Surinta, chatklaw Jareanpon Department of Management Information System.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
Privacy Preserving Schema and Data Matching Scannapieco, Bertino, Figotin and Elmargarmid Presented by : Vidhi Thapa.
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School.
TOPIC : Introduction to Fault Simulation
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.
Data Mining and Decision Support
Entity Relationship Diagram (ERD). Objectives Define terms related to entity relationship modeling, including entity, entity instance, attribute, relationship.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Large-Scale Record Linkage Support for Cloud Computing Platforms Yuan Xue, Bradley Malin, Elizabeth Durham EECS Department, Biomedical Informatics Department,
Kim HS Introduction considering that the amount of MRI data to analyze in present-day clinical trials is often on the order of hundreds or.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
1 Unsupervised Learning from URL Corpora Deepak P*, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras *Work done while at IIT Madras.
Introduction to Probabilistic Record Linking
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Lecture 9: Entity Resolution
Chapter 3 The Relational Model.
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Getting to First Base: Introduction to Database Concepts
Getting to First Base: Introduction to Database Concepts
Getting to First Base: Introduction to Database Concepts
Distributed Edge Computing
Chapter 3 The Relational Model
Presentation transcript:

CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Introduction “Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several databases” Also known as:  Record or data linkage  Entity resolution  Object identification  Field matching

Aims & Challenges Three tasks:  Schema matching  Data matching  Data fusion Challenges:  Lack of unique entity identifier and data quality  Computation complexity  Lack of training data (e.g. gold standards)  Privacy and confidentiality (health informatics & data mining)

Overview of Data Matching Five major steps:  Data pre-processing  Indexing  Record pair comparison  Classification  Evaluation

Diagram

Data Pre-processing Remove unwanted characters and words Expand abbreviations and correct misspellings Segment attributes into well-defined and consistent output attributes Verify the correctness of attribute values

Example of Data Pre-processing

Indexing Reduces computational complexity Generates candidate record pairs Common technique—Blocking

Example of Blocking

Record Pair Comparison Comparison vector – vector of numerical similarity values

Example of Record Pair Comparison

Jaro and Winkler String Comparison Jaro:  Combines edit distance and q-gram based comparison Winkler:  Increases Jaro similarity for up to four agreeing initial chars

Record Pair Classification Two-class or three-class classification:  Match or non-match  Match or non-match or potential match (requires clerical review) Supervised and unsupervised Active learning

Example of Record Pair Classification

Unsupervised Classification Threshold-based classification Probabilistic classification Cost-based classification Rule-based classification Clustering-based classification

Probabilistic Classification Three-class based Different weights assigned to different attributes  Newcombe & Kennedy – cardinalities Comparison vectors, binary comparison Conditionally independent attributes assumed

Formulae

Example of Probabilistic Classification

Active Learning Trains a model with small set of seed data Classifies comparison vectors not in training set as matches or non-matches Asks users for help on the most difficult to classify Adds manually classified to training data set Trains the next, improved, classification model Repeats until stopping criteria met