Re-active Learning: Active Learning with Re-labeling

Slides:

Advertisements

Similar presentations

also known as the “Perceptron”

Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

1 Fast Asymmetric Learning for Cascade Face Detection Jiaxin Wu, and Charles Brubaker IEEE PAMI, 2008 Chun-Hao Chang 張峻豪 2009/12/01.

Visual Recognition Tutorial

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.

Ensemble Learning: An Introduction

Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,

Bayesian Learning Rong Jin.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,

Conditional & Joint Probability A brief digression back to joint probability: i.e. both events O and H occur Again, we can express joint probability in.

GEOMETRIC VIEW OF DATA David Kauchak CS 451 – Fall 2013.

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.

Universit at Dortmund, LS VIII

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.

Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.

Data Mining and Decision Support

Crowdsourcing Control: Moving Beyond Multiple Choice (Full paper in UAI 2012) Christopher H. Lin, Mausam, Daniel S. Weld University of Washington HCOMP-2012,

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

UCSpv: Principled Voting in UCS Rule Populations Gavin Brown, Tim Kovacs, James Marshall.

Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.

Ensemble Classifiers.

Machine Learning: Ensemble Methods

Making Sense of Statistical Significance Inference as Decision

What is a Hidden Markov Model?

Modeling Annotator Accuracies for Supervised Learning

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

SSL Chapter 4 Risk of Semi-supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers.

Statistical Models for Automatic Speech Recognition

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

COMP61011 : Machine Learning Ensemble Models

Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.

Data Mining Lecture 11.

Latent Variables, Mixture Models and EM

CS 4/527: Artificial Intelligence

A “Holy Grail” of Machine Learing

Combining Base Learners

Hidden Markov Models Part 2: Algorithms

REMOTE SENSING Multispectral Image Classification

REMOTE SENSING Multispectral Image Classification

Subtraction The arithmetic we did so far was limited to unsigned (positive) integers. Today we’ll consider negative numbers and subtraction. The main problem.

Introduction to Data Mining, 2nd Edition

Repetition Structures

Statistical Models for Automatic Speech Recognition

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Lecture 18: Bagging and Boosting

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Ensemble learning.

Lecture 06: Bagging and Boosting

CSCI B609: “Foundations of Data Science”

Active Learning with Unbalanced Classes & Example-Generation Queries

Ensemble learning Reminder - Bagging of Trees Random Forest

Learning From Observed Data

The Bias-Variance Trade-Off

A task of induction to find patterns

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn

A task of induction to find patterns

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Re-active Learning: Active Learning with Re-labeling Christopher H. Lin University of Washington Mausam IIT Delhi Daniel S. Weld I’m going to talk to you today about a problem that we’re calling reactive learning – a generalization of active learning to the case of noisy labels that allows for the relabeling of examples. Ok, so why is generalizing active learning so that you can relabel example important?

*Speaker not paid by Oracle Corporation Typically, active learning assumes that labels come from a single oracle, right? *Speaker not paid by Oracle Corporation

CROWDSOURCING But these days, everybody is using crowdsourcing to label training data for their learning algorithms. And so we can no longer assume that labels come from a single annotator.

Human (Labeling) Mistakes Were Made Why? Because crowd workers make mistakes. Workers will label training data incorrectly. So when people crowdsource their training data,

Parrot Majority Vote Parrot Parakeet Parrot because humans make mistakes, instead of getting one label for each example, They ask multiple crowdworkers to label each example Parrot Parrot

Parakeet Parakeet Parrot Relabel? VS New label? So we must generalize active learning to allow relabeling And this relabeling causes us to Now we have to make a tradeoff that didn’t exist before. Should we relabel, or gather additional labels for an existing training example In other words, should we denoise the existing training set, or should we expand our training set with new example This is reactive learning. New label? Parakeet

MORE NOISY DATA LESS BETTER DATA That is, How do we best balance between more noisy data and less better data

MORE NOISY DATA LESS BETTER DATA [Sheng et al. 2008, Lin et al. 2014] I do want to note that this current paper of ours is not the first that has noticed this tradeoff, and We as well as others have considered this tradeoff before in a static learning setting. The biggest difference between the existing work and our current work is that instead of considering this tradeoff In a static learning setting, we dynamically decide as we are training, which examples are best [Sheng et al. 2008, Lin et al. 2014]

Re-active Learning Contributions Standard Active Learning Algorithms Fail Uncertainty Sampling [Lewis and Catlett 1994] Expected Error Reduction [Roy and McCallum 2001] Re-active Learning Algorithms Extensions of Uncertainty Sampling Impact Sampling Ok, so hopefully I’ve convinced you why reactive learning is an important problem. So now I’m going to tell you about the contributions we’ve made to reactive learning. Our first contribution, we show that In particular, I’m going to show you why uncertainty sampling and expected error reduction, Surprisingly a generalization of uncertainty sampling!

Standard active learning algorithms fail!

h* True Hypothesis Suppose we’re trying to learn Green diamonds and yellow circles Maybe enlarge examples

h h* Current Hypothesis Suppose we’re trying to learn Maybe enlarge examples

h h* Uncertainty Sampling [Lewis and Catlett (1994)] Consider uncertainty sampling extended to reactive learning. What does that mean? It means now instead of only allowing it… They don’t leverage the two sources of information about labels: classifier uncertainty and label uncertainty, and consequently get trapped Uncertainty Sampling [Lewis and Catlett (1994)]

h h* Suppose labeled many times already! For inductive purposes, suppose we’ve labeled them many times already. So because they’ve been labeled many times, they’ve converged to the correct labels. Suppose labeled many times already!

h h* Infinitely many times! Then we receive another label. But because we’ve labeled these points so many Uncertainty Sampling labels these two examples Infinitely many times!

Does not use all sources of information Fundamental Problem: Does not use all sources of information h h* So what is the fundamental problem here? If 100 workers have told you that… Uncertainty Sampling labels these two examples Infinitely many times!

Re-active Learning Contributions Standard Active Learning Algorithms Fail Uncertainty Sampling [Lewis and Catlett 1994] Expected Error Reduction [Roy and McCallum 2001] Re-active Learning Algorithms Extensions of Uncertainty Sampling Impact Sampling EER next

Expected Error Reduction (EER) [Roy and McCallum (2001)] Ok, so I’ve just told you about how uncertainty sampling, a standard active learning algorithm, doesn’t work But it’s not just US, other algorithms suffer from similar problems. Another common algorithm is Expected Error Reduction . Also suffers from infinite looping!

Re-active Learning Contributions Standard Active Learning Algorithms Fail Uncertainty Sampling [Lewis and Catlett 1994] Expected Error Reduction [Roy and McCallum 2001] Re-active Learning Algorithms Extensions of Uncertainty Sampling Impact Sampling Ok, so hopefully I’ve convinced you why reactive learning is an important problem. So now I’m going to tell you about the contributions we’ve made to reactive learning. Our first contribution, we show that In particular, I’m going to show you why uncertainty sampling and expected error reduction, Surprisingly a generalization of uncertainty sampling!

How to fix? Consider the aggregate label uncertainty! ML How to fix? Consider the aggregate label uncertainty! So I just told you about the first contribution of our work– understanding Now I’m going to talk about our first algorithmic contribution. Define label uncertainty!

How to fix? Consider the aggregate label uncertainty! ML h h* Now I’m going to tell you about our first two algorithmic contributions to reactive learning. TALK ABOUT NUMBER OF LABELS High # annotations = LOW UNCERTAINTY

How to fix? Consider the aggregate label uncertainty! ML Low # annotations = HIGH UNCERTAINTY h h* Now I’m going to tell you about our first two algorithmic contributions to reactive learning. High # annotations = LOW UNCERTAINTY

(1-α) α Alpha-weighted uncertainty sampling . Classifier uncertainty + Aggregate Label uncertainty For the purposes of this talk, We call this alpha-weighted uncertainty sampling

Fixed-Relabeling Uncertainty Sampling Pick new unlabeled example using classifier uncertainty Get a fixed number of labels for that example Weaknesses of the two methods: parameter choosing. So next, I’m going to tell you about a new algorithm that we’ve come up with For reactive learning that not only elegantly fixes the problems,

Re-active Learning Contributions Standard Active Learning Algorithms Fail Uncertainty Sampling [Lewis and Catlett 1994] Expected Error Reduction [Roy and McCallum 2001] Re-active Learning Algorithms Extensions of Uncertainty Sampling Impact Sampling Ok, so hopefully I’ve convinced you why reactive learning is an important problem. So now I’m going to tell you about the contributions we’ve made to reactive learning. Our first contribution, we show that In particular, I’m going to show you why uncertainty sampling and expected error reduction, Surprisingly a generalization of uncertainty sampling!

Impact (ψ) Sampling Both uncertainty sampling and EER starve examples. So we came up with a new algorithm that we’re calling impact sampling. Impact sampling works by seeing how much impact a label will have on the predictions of the resulting classifier. Thus, a point labeled many times will be unlikely to change the classifier. We’ll use psi to denote impact.

h Current Hypothesis Suppose again that we’re trying to learn a 1-d threshold that separates diamonds and circles

Labeled Labeled h Impact sampling algorithms work by computing the impact a new point will have on the algorithm. Suppose again that you are learning a threshold And you’ve currently labeled two points on the end.

h What is the impact of labeling this example? Labeled Labeled Let’s suppose we want to compute the impact of this example. In order to compute the impact of an example, we have to consider the impact of the two possible labels. What is the impact of labeling this example?

Impact of labeling this example a diamond Labeled Labeled h Impact sampling algorithms work by computing the impact a new point will have on the algorithm. Suppose again that you are learning a threshold And you’ve currently labeled two points on the end. Impact of labeling this example a diamond

Impact of labeling this example a diamond Labeled Labeled h So 5 examples is the impact Ψ (x) Impact of labeling this example a diamond

Impact of labeling this example a circle Labeled Labeled h Impact sampling algorithms work by computing the impact a new point will have on the algorithm. Suppose again that you are learning a threshold And you’ve currently labeled two points on the end. Impact of labeling this example a circle

Impact of labeling this example a circle Labeled Labeled h Impact sampling algorithms work by computing the impact a new point will have on the algorithm. Suppose again that you are learning a threshold And you’ve currently labeled two points on the end. Ψ (x) Impact of labeling this example a circle

Total Expected Impact of h Ψ (x) So now we can compute the total expected impact

Total Expected Impact of h Ψ (x) h So now we can compute the total expected impact Ψ (x)

Total Expected Impact of h Ψ (x) h So now we can compute the total expected impact Ψ (x) Ψ (x) = P(x = ) Ψ (x) + P(x = ) Ψ (x)

Ψ (x) = P(x = ) Ψ (x) + P(x = ) Ψ (x) Use classifier’s belief as prior. Bayesian update using annotations.

Assuming annotation accuracy > 0 Assuming annotation accuracy > 0.5: As # annotations (x) goes to infinity, Ψ(x) goes 0. In other words, If an example has many labels already, an additional label is highly unlikely to change the classifier.

Theorem In many noiseless settings, when relabeling is unnecessary, impact sampling = uncertainty sampling Now ostensibly, uncertainty sampling and impact sampling are optimizing for 2 completely different objectives.

Theorem In many noiseless settings, when relabeling is unnecessary, impact sampling = uncertainty sampling When relabeling is necessary: Now ostensibly, uncertainty sampling and impact sampling are optimizing for 2 completely different objectives.

Consider an example with the following labels: Now you may have noticed that impact sampling is myopic. In that doesn’t consider the effect of multiple labels Suppose we are using majority vote. Aggregated Label via majority vote

Before: After adding an additional label: NO CHANGE

Pseudolookahead Let r be the minimum number of labels to flip the aggregate label. In order to allow impact sampling to do some lookahead, we came up With the method of pseudolookahead.

Pseudolookahead Let r be the minimum number of labels to flip the aggregate label.

Pseudolookahead Let r be the minimum number of labels to flip the aggregate label. r = 3

Pseudolookahead Ψ (x) = Ψ (x) / r Redefine r

Ψ (x) = Ψ (x) / r Pseudolookahead Redefine Careful Optimism! r So we are essentially taking the future impact from labeling an example multiple times, and normalizing it by how long it will take. It’s like careful optimism. Careful Optimism!

Budget = 1000 Label Accuracy = 75% 10,30,50,70,90 Features

Gaussian (num features = 90) EER impact Alpha-uncertainty Fixed-uncertainty uncertainty We begin with the passive line. passive Gaussian (num features = 90)

Arrhythmia (num features = 279) impact uncertainty passive Arrhythmia (num features = 279)

Relation Extraction (num features = 1013 features) impact uncertainty passive Relation Extraction (num features = 1013 features)

Re-active Learning Contributions Standard Active Learning Algorithms Fail Uncertainty Sampling [Lewis and Catlett 1994] Expected Error Reduction [Roy and McCallum 2001] Re-active Learning Algorithms Extensions of Uncertainty Sampling Impact Sampling In particular, I’m going to show you why uncertainty sampling and expected error reduction,