Lecture 9: Entity Resolution

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

CrowdER - Crowdsourcing Entity Resolution
Albert Gatt Corpora and Statistical Methods Lecture 13.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Imbalanced data David Kauchak CS 451 – Fall 2013.
Large-Scale Entity-Based Online Social Network Profile Linkage.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Machine learning continued Image source:
Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Evaluating Performance for Data Mining Techniques
Crash Course on Machine Learning
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Data mining and machine learning A brief introduction.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Entity Resolution for Big Data Lise Getoor University of Maryland College Park, MD Ashwin Machanavajjhala Duke University Durham, NC
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Data Mining Practical Machine Learning Tools and Techniques
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Semi-Supervised Clustering
Optimizing Parallel Algorithms for All Pairs Similarity Search
Ananya Das Christman CS311 Fall 2016
Simone Paolo Ponzetto University of Heidelberg Massimo Poesio
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Intro to Machine Learning
Improvement Selection:
The Elements of Statistical Learning
Introduction to Data Science Lecture 7 Machine Learning Overview
Machine Learning Week 1.
Machine Learning Today: Reading: Maria Florina Balcan
CSSE463: Image Recognition Day 11
File Organizations and Indexing
Neuro-Computing Lecture 4 Radial Basis Function Network
COSC 4335: Other Classification Techniques
iSRD Spam Review Detection with Imbalanced Data Distributions
Dimension reduction : PCA and Clustering
Climate Group 2 Jiajun LI, Serena DONG, Charis DENG.
Artificial Intelligence Lecture No. 28
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Leverage Consensus Partition for Domain-Specific Entity Coreference
Text Categorization Berlin Chen 2003 Reference:
Machine learning overview
Junheng, Shengming, Yunsheng 11/09/2018
CS639: Data Management for Data Science
Perceptron Learning Rule
“Traditional” image segmentation
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Perceptron Learning Rule
Perceptron Learning Rule
Presentation transcript:

Lecture 9: Entity Resolution

Today’s Agenda Overview of Data Integration Entity Resolution (ER) Pairwise ER

Section 1 1. Data Integration

Section 1 Data, data, data…

Data Integration = Value Section 1 Data Integration = Value Step 0: Source Selection Step 1: Schema Alignment Step 2: Entity Resolution Step 3: Data Fusion

Modern Data Integration Section 1 Modern Data Integration

Section 2 2. Entity Resolution

What is Entity Resolution? Section 2 What is Entity Resolution? Problem of identifying and linking/grouping different manifestations of the same real world object. Examples: Different ways of addressing the same person in text Web pages with different descriptions of the same business Different photos of the same person

Entity Resolution has itself duplicate names Section 2 Entity Resolution has itself duplicate names Record linkage, duplicate detection, fuzzy match, reference reconciliations, object consolidation, entity clustering, reference matching, merge/purge, deduplication, coreference resolution, object identification, approximate match….

Section 2 Examples

Section 2 Examples Name/attribute ambiguity

Abstract Problem Statement Section 2 Abstract Problem Statement

Section 2 Deduplication

Record linkage / Entity Matching Section 2 Record linkage / Entity Matching

Section 2 Reference Matching

Section 2 Reference Matching

Section 2 Solving ER

Metrics Cluster level metrics: Pairwise metrics: Section 2 Metrics Pairwise metrics: Precision/Recall, F1 # of predicted matching pairs Cluster level metrics: Purity, completeness, complexity Precision/recall/F1: cluster-level, closest cluster

Section 2 Typical Assumptions

Section 2 ER vs. Classification

ER vs. (Multi-relational) Clustering Section 2 ER vs. (Multi-relational) Clustering Computing entities from records is a clustering problem In typical clustering algorithms (k-means, LDA, etc.) number of clusters is a constant or sub linear in R In ER: number of clusters is linear in R, and average cluster size is a constatnt. Significant fraction of clusters are singletons.

Section 3 3. Pairwise ER

Section 3 Pairwise Match Score Problem: Given a vector of component-wise similarities for a pair of records (x,y) compute P(x and y match). Solutions: Weighted sum of average of component-wise similarity scores. Threshold determines match or non-match Hard to pick weights – Hard to tune a threshold Rules about what constitutes a match Finding the right set of rules is hard

Section 3 Basic ML Approach

Section 3 Fellegi & Sunter Model

ML Pairwise Approaches Section 3 ML Pairwise Approaches Supervised ML algorithms: Decision trees Support vector machines Ensembles of classifiers Conditional random fields Issues: Training set generation Imbalanced classes – many more negatives than positives (even after eliminating obvious non-matches with Blocking)

Creating a Training Set is a key issue Section 3 Creating a Training Set is a key issue

Avoid creating a dataset Section 3 Avoid creating a dataset Unsupervised / Semi-supervised methods EM, generative models Active learning Ensemble methods, active learning to optimize for precision/recall crowdsourcing

Section 3 Summary Many algorithms for independent classification of pairs of records as match/non-match ML based classification & Fellegi-Sunter Pro: advanced state of the art Con: building high fidelity training sets is a hard problem Active learning and Crowdsourcing for ER are active areas of research (next lecture)