Lecture 9: Entity Resolution

Slides:

Advertisements

Similar presentations

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.

Advertisements

CrowdER - Crowdsourcing Entity Resolution

Albert Gatt Corpora and Statistical Methods Lecture 13.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Imbalanced data David Kauchak CS 451 – Fall 2013.

Large-Scale Entity-Based Online Social Network Profile Linkage.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Machine learning continued Image source:

Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.

Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

Evaluating Performance for Data Mining Techniques

Crash Course on Machine Learning

Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.

Data mining and machine learning A brief introduction.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Entity Resolution for Big Data Lise Getoor University of Maryland College Park, MD Ashwin Machanavajjhala Duke University Durham, NC

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation ： Yao-Min Huang Date ： 09/15/2004.

CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.

CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Data Mining Practical Machine Learning Tools and Techniques

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Semi-Supervised Clustering

Optimizing Parallel Algorithms for All Pairs Similarity Search

Ananya Das Christman CS311 Fall 2016

Simone Paolo Ponzetto University of Heidelberg Massimo Poesio

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Intro to Machine Learning

Improvement Selection:

The Elements of Statistical Learning

Introduction to Data Science Lecture 7 Machine Learning Overview

Machine Learning Week 1.

Machine Learning Today: Reading: Maria Florina Balcan

CSSE463: Image Recognition Day 11

File Organizations and Indexing

Neuro-Computing Lecture 4 Radial Basis Function Network

COSC 4335: Other Classification Techniques

iSRD Spam Review Detection with Imbalanced Data Distributions

Dimension reduction : PCA and Clustering

Climate Group 2 Jiajun LI, Serena DONG, Charis DENG.

Artificial Intelligence Lecture No. 28

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Leverage Consensus Partition for Domain-Specific Entity Coreference

Text Categorization Berlin Chen 2003 Reference:

Machine learning overview

Junheng, Shengming, Yunsheng 11/09/2018

CS639: Data Management for Data Science

Perceptron Learning Rule

“Traditional” image segmentation

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Perceptron Learning Rule

Perceptron Learning Rule

Presentation transcript:

Lecture 9: Entity Resolution

Today’s Agenda Overview of Data Integration Entity Resolution (ER) Pairwise ER

Section 1 1. Data Integration

Section 1 Data, data, data…

Data Integration = Value Section 1 Data Integration = Value Step 0: Source Selection Step 1: Schema Alignment Step 2: Entity Resolution Step 3: Data Fusion

Modern Data Integration Section 1 Modern Data Integration

Section 2 2. Entity Resolution

What is Entity Resolution? Section 2 What is Entity Resolution? Problem of identifying and linking/grouping different manifestations of the same real world object. Examples: Different ways of addressing the same person in text Web pages with different descriptions of the same business Different photos of the same person

Entity Resolution has itself duplicate names Section 2 Entity Resolution has itself duplicate names Record linkage, duplicate detection, fuzzy match, reference reconciliations, object consolidation, entity clustering, reference matching, merge/purge, deduplication, coreference resolution, object identification, approximate match….

Section 2 Examples

Section 2 Examples Name/attribute ambiguity

Abstract Problem Statement Section 2 Abstract Problem Statement

Section 2 Deduplication

Record linkage / Entity Matching Section 2 Record linkage / Entity Matching

Section 2 Reference Matching

Section 2 Reference Matching

Section 2 Solving ER

Metrics Cluster level metrics: Pairwise metrics: Section 2 Metrics Pairwise metrics: Precision/Recall, F1 # of predicted matching pairs Cluster level metrics: Purity, completeness, complexity Precision/recall/F1: cluster-level, closest cluster

Section 2 Typical Assumptions

Section 2 ER vs. Classification

ER vs. (Multi-relational) Clustering Section 2 ER vs. (Multi-relational) Clustering Computing entities from records is a clustering problem In typical clustering algorithms (k-means, LDA, etc.) number of clusters is a constant or sub linear in R In ER: number of clusters is linear in R, and average cluster size is a constatnt. Significant fraction of clusters are singletons.

Section 3 3. Pairwise ER

Section 3 Pairwise Match Score Problem: Given a vector of component-wise similarities for a pair of records (x,y) compute P(x and y match). Solutions: Weighted sum of average of component-wise similarity scores. Threshold determines match or non-match Hard to pick weights – Hard to tune a threshold Rules about what constitutes a match Finding the right set of rules is hard

Section 3 Basic ML Approach

Section 3 Fellegi & Sunter Model

ML Pairwise Approaches Section 3 ML Pairwise Approaches Supervised ML algorithms: Decision trees Support vector machines Ensembles of classifiers Conditional random fields Issues: Training set generation Imbalanced classes – many more negatives than positives (even after eliminating obvious non-matches with Blocking)

Creating a Training Set is a key issue Section 3 Creating a Training Set is a key issue

Avoid creating a dataset Section 3 Avoid creating a dataset Unsupervised / Semi-supervised methods EM, generative models Active learning Ensemble methods, active learning to optimize for precision/recall crowdsourcing

Section 3 Summary Many algorithms for independent classification of pairs of records as match/non-match ML based classification & Fellegi-Sunter Pro: advanced state of the art Con: building high fidelity training sets is a hard problem Active learning and Crowdsourcing for ER are active areas of research (next lecture)