Download presentation
Presentation is loading. Please wait.
Published byBlaise Wiggins Modified over 8 years ago
1
Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey
2
Active Learning for de-duplication De-duplication systems try to learn a function: Where D is the data set. –f is learned using a labeled training data set –Normally, D is large, so many sets L p are possible. Choosing a representative & useful L p is hard. Instead of a fixed set L p, in Active Learning the learner interactively chooses pairs from D D to be labeled and added to L p.
3
The ALIAS de-duplicator Input –Set D p of pairs of data records represented as feature vectors (features might include edit distance, soundex, etc). –Initial set L p of some elements of D p labeled as duplicates or non-duplicates. Set T = L p Loop until user satisfaction: –Train classifier C using T. –Use C to choose a set S of instances from D p for labeling. –Get labels for S from user, and set T = T S.
4
The ALIAS de-duplicator
5
Active Learning How do we choose the set S of instances to label? Idea: Choose most uncertain instances. We’re given that +’s and –’s can be separated by some point, and assume that probability of – or + is linear between labeled examples r and b. The point m –maximally uncertain, –also the point that reduces our “confusion region” the most. –So choose m!
6
Measuring Uncertainty with Committees Train a committee of several slightly different versions of a classifier. Uncertainty(x) entropy committee (x) Form committees by –Randomizing model parameters –Partitioning training data –Partitioning attributes
7
Methods for Forming Committees
8
Committee Size
9
Representativeness of an Instance We need informative instances, not just uncertain ones. Solution: sample n of the kn most uncertain instances, weighted by uncertainty. –k = 1 no sampling –kn = all data full-sampling Why not use information gain?
10
Sampling for Representativeness
11
Evaluation – Different Classifiers Decision Trees & Naïve Bayes: –Committees of 5 via parameter randomization SVMs –Uncertainty = distance from separator Start with one dup, one non-dup, add a new training example each round (n = 1), partial sampling (k = 5). Similarity functions – 3-Grams match, % overlapping words, approx. edit distance, special handling of #s/nulls. Data sets: –Bibliography: 32131 citation pairs from Citeseer, 0.5% duplicates. –Address: 44850 pairs, 0.25% duplicates.
12
Evaluation – different classifiers
14
Value of Active Learning
16
Example Decision Tree
17
Conclusions Active Learning improves performance over random selection. –Uses two orders of magnitude less training data. –Note: not due just to change in +/- mix. In these experiments, Decision Trees outperformed SVMs and Naïve Bayes.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.