Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

ECG Signal processing (2)
DECISION TREES. Decision trees  One possible representation for hypotheses.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Topics of discussions: 1. Go over classification methods: what can be said about applicability in different situations. 2. Models not perfect; what can.
Data Mining Classification: Alternative Techniques
Learning Visual Similarity Measures for Comparing Never Seen Objects Eric Nowak, Frédéric Jurie CVPR 2007.
CMPUT 466/551 Principal Source: CMU
Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.
Assuming normally distributed data! Naïve Bayes Classifier.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Reduced Support Vector Machine
Ensemble Learning: An Introduction
Active Learning with Support Vector Machines
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Active Learning for Class Imbalance Problem
Data mining and machine learning A brief introduction.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information.
Text Clustering.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Fast and accurate text classification via multiple linear discriminant projections Soumen Chakrabarti Shourya Roy Mahesh Soundalgekar IIT Bombay
Non-Bayes classifiers. Linear discriminants, neural networks.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Automating information integration Sunita Sarawagi.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Advanced data mining with TagHelper and Weka
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Machine Learning Lecture 9: Clustering
Classification Nearest Neighbor
SAD: 6º Projecto.
Asymmetric Gradient Boosting with Application to Spam Filtering
K Nearest Neighbor Classification
Classification Nearest Neighbor
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Interactive De-Duplicate using Active Learning*
Machine Learning in Practice Lecture 26
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
CS 391L: Machine Learning Clustering
Linear Discrimination
A task of induction to find patterns
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
A task of induction to find patterns
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
What is Artificial Intelligence?
Presentation transcript:

Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey

Active Learning for de-duplication De-duplication systems try to learn a function: Where D is the data set. –f is learned using a labeled training data set –Normally, D is large, so many sets L p are possible. Choosing a representative & useful L p is hard. Instead of a fixed set L p, in Active Learning the learner interactively chooses pairs from D  D to be labeled and added to L p.

The ALIAS de-duplicator Input –Set D p of pairs of data records represented as feature vectors (features might include edit distance, soundex, etc). –Initial set L p of some elements of D p labeled as duplicates or non-duplicates. Set T = L p Loop until user satisfaction: –Train classifier C using T. –Use C to choose a set S of instances from D p for labeling. –Get labels for S from user, and set T = T  S.

The ALIAS de-duplicator

Active Learning How do we choose the set S of instances to label? Idea: Choose most uncertain instances. We’re given that +’s and –’s can be separated by some point, and assume that probability of – or + is linear between labeled examples r and b. The point m –maximally uncertain, –also the point that reduces our “confusion region” the most. –So choose m!

Measuring Uncertainty with Committees Train a committee of several slightly different versions of a classifier. Uncertainty(x)  entropy committee (x) Form committees by –Randomizing model parameters –Partitioning training data –Partitioning attributes

Methods for Forming Committees

Committee Size

Representativeness of an Instance We need informative instances, not just uncertain ones. Solution: sample n of the kn most uncertain instances, weighted by uncertainty. –k = 1  no sampling –kn = all data  full-sampling Why not use information gain?

Sampling for Representativeness

Evaluation – Different Classifiers Decision Trees & Naïve Bayes: –Committees of 5 via parameter randomization SVMs –Uncertainty = distance from separator Start with one dup, one non-dup, add a new training example each round (n = 1), partial sampling (k = 5). Similarity functions – 3-Grams match, % overlapping words, approx. edit distance, special handling of #s/nulls. Data sets: –Bibliography: citation pairs from Citeseer, 0.5% duplicates. –Address: pairs, 0.25% duplicates.

Evaluation – different classifiers

Value of Active Learning

Example Decision Tree

Conclusions Active Learning improves performance over random selection. –Uses two orders of magnitude less training data. –Note: not due just to change in +/- mix. In these experiments, Decision Trees outperformed SVMs and Naïve Bayes.