A general agnostic active learning algorithm

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Evaluating Classifiers
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Learning with Online Constraints: Shifting Concepts and Active Learning Claire Monteleoni MIT CSAIL PhD Thesis Defense August 11th, 2006 Supervisor: Tommi.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Semi-Supervised Learning
Practical Online Active Learning for Classification Claire Monteleoni (MIT / UCSD) Matti Kääriäinen (University of Helsinki)
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
Visual Recognition Tutorial
Instructor : Dr. Saeed Shiry
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Machine Learning Week 2 Lecture 2.
The Nature of Statistical Learning Theory by V. Vapnik
Hypothesis testing Some general concepts: Null hypothesisH 0 A statement we “wish” to refute Alternative hypotesisH 1 The whole or part of the complement.
Active Learning of Binary Classifiers
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Active Perspectives on Computational Learning and Testing Liu Yang Slide 1.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probably Approximately Correct Model (PAC)
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Visual Recognition Tutorial
Experimental Evaluation
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Incorporating Unlabeled Data in the Learning Process
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
Machine Learning CSE 681 CH2 - Supervised Learning.
Hardness of Learning Halfspaces with Noise Prasad Raghavendra Advisor Venkatesan Guruswami.
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
Bayesian Networks Martin Bachler MLA - VO
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Learning from observations
Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
Concept Learning and The General-To Specific Ordering
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning with General Similarity Functions Maria-Florina Balcan.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Hierarchical Sampling for Active Learning Sanjoy Dasgupta and Daniel Hsu University of California, San Diego Session : Active Learning and Experimental.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
Computational Learning Theory
CH. 2: Supervised Learning
Generalization and adaptivity in stochastic convex optimization
Importance Weighted Active Learning
A general agnostic active learning algorithm
Semi-Supervised Learning
CSCI B609: “Foundations of Data Science”
Computational Learning Theory
Computational Learning Theory
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
Mathematical Foundations of BME
CSCI B609: “Foundations of Data Science”
Supervised machine learning: creating a model
Presentation transcript:

A general agnostic active learning algorithm Claire Monteleoni UC San Diego Joint work with Sanjoy Dasgupta and Daniel Hsu, UCSD.

Active learning Many machine learning applications, e.g. Image classification, object recognition Document/webpage classification Speech recognition Spam filtering Unlabeled data is abundant, but labels are expensive. Active learning is a useful model here. Allows for intelligent choices of which examples to label. Label complexity: the number of labeled examples required to learn via active learning. ! can be much lower than the sample complexity!

When is a label needed? Is a label query needed? Linearly separable case: There may not be a perfect linear separator (agnostic case): Either case: NO YES NO

Approach and contributions Start with one of the earliest, and simplest active learning schemes: selective sampling. Extend to the agnostic setting, and generalize, via reduction to supervised learning, making algorithm as efficient as the supervised version. Provide fallback guarantee: label complexity bound no worse than sample complexity of the supervised problem. Show significant reductions in label complexity (vs. sample complexity) for many families of hypothesis class. Techniques also yield an interesting, non-intuitive result: bypass classic active learning sampling problem.

PAC-like selective sampling framework PAC-like active learning model PAC-like selective sampling framework Framework due to [Cohn, Atlas & Ladner ‘94] Distribution D over X £ Y, X some input space, Y = {§1}. PAC-like case: no prior on hypotheses assumed (non-Bayesian). Given: stream (or pool) of unlabeled examples, x2X, drawn i.i.d. from marginal, DX over X. Learner may request labels on examples in the stream/pool. Oracle access to labels y2{§1} from conditional at x, DY | x . Constant cost per label. The error rate of any classifier h is measured on distribution D: err(h) = P(x, y)~D[h(x)  y] Goal: minimize number of labels to learn the concept (whp) to a fixed final error rate, , on input distribution.

Selective sampling algorithm Region of uncertainty [CAL ‘94]: subset of data space for which there exist hypotheses (in H) consistent with all previous data, that disagree. Example: hypothesis class, H = {linear separators}. Separable assumption. Algorithm: Selective sampling [Cohn, Atlas & Ladner ‘94] (orig. NIPS 1989): For each point in the stream, if point falls in region of uncertainty, request label. Easy to represent the region of uncertainty for certain, separable problems. BUT, in this work we address: - What about agnostic case? - General hypothesis classes? ! Reduction!

Agnostic active learning What if problem is not realizable (separable by some h 2 H)? ! Agnostic case: goal is to learn with error at most  + , where  is the best error rate (on D) of a hypothesis in H. Lower bound: (()2) labels [Kääriäinen ‘06]. [Balcan, Beygelzimer & Langford ‘06] prove general fallback guarantees, and label complexity bounds for some hypothesis classes and distributions for a computationally prohibitive scheme. Agnostic active learning via reduction: We extend selective sampling: simply querying for labels on points that are uncertain, to agnostic case: Re-defining uncertainty via reduction to supervised learning.

Algorithm: Initialize empty sets S,T. For each n 2 {1,…,m} Receive x » DX For each y? 2 {§1}, let hy? = LearnH(S [ {(x,y?)}, T). If (for either y? 2 {§1}, hy? does not exist, or err(h-y?, S [ T) - err(hy?, S [ T) > n) S Ã S [ {(x,y?)} %% S’s labels are guessed Else request y from oracle. T Ã T [ {(x, y)} %% T’s labels are queried Return hf = LearnH(S, T). Subroutine: supervised learning (with constraints): On inputs: A,B ½ X £ {§1} LearnH(A, B) returns h 2 H consistent with A and with minimum error on B (or nothing if not possible). err(h, A) returns empirical error of h 2 H on A. We divide the points seen so far into sets S, and T, both labeled: T = {points that were uncertain at the time, so we queried the oracle}. S = {points for which we were certain enough to guess the label}. Reduction to supervised learning: to decide whether to query for a label: Hallucinate each label y? in {+/-1} h+1 = h in H that is consistent with S[(x,+1), with minimum error on T. h-1 = h in H that is consistent with S[(x,-1), with minimum error on T. Only proceed if they both exist. If only one Exists, add that label on point to S. Compare error of h+1 and h-1 and request a label only if the difference is small: If |err(h+1) - err(h-1)| < , query oracle for y, add (x, y) to T. Otherwise (there is a large error difference), set y? to the label which yields the lower error. Add (x, y?) to S.

Bounds on label complexity Theorem (fallback guarantee): With high probability, algorithm returns a hypothesis in H with error at most  + , after requesting at most Õ((d/)(1 + /)) labels. Asympotically, the usual PAC sample complexity of supervised learning. Tighter label complexity bounds for hypothesis classes with constant disagreement coefficient,  (label complexity measure [Hanneke‘07]). Theorem ( label complexity): With high probability, algorithm returns a hypothesis with error at most  + , after requesting at most Õ(d(log2(1/)+ (/)2)) labels. If  ¼ , Õ(d log2(1/)). - Nearly matches lower bound of (()2), exactly matches , dep. - Better  dependence than known results, e.g. [BBL‘06]. - E.g. linear separators (uniform distr.): / d1/2, so Õ(d3/2(log2(1/)) labels.

Setting active learning threshold Need to instantiate n: threshold on how small the error difference between h+1 and h-1 must be in order for us to query a label. Remember: we query a label if |err(h+1, Sn[Tn) - err(h-1, Sn[Tn)| < n . To be used within the algorithm, it must depend on observable quantities. E.g. we do not observe the true (oracle) labels for x 2 S. To compare hypotheses error rates, the threshold, n, should relate empirical error to true error, e.g. via (iid) generalization bounds. However Sn [ Tn (though observable) is not an iid sample! Sn has made-up labels! Tn was filtered by active learning, so not iid from D! This is the classic active learning sampling problem.

Avoiding classic AL sampling problem S defines a realizable problem on a subset of the points: h* 2 H is consistent with all points in S (lemma). Perform error comparison (on S [ T) only on hypotheses consistent with S. Error differences can only occur in U: the subset of X for which there exist hypotheses consistent with S, that disagree. No need to compute U! T Å U is iid! (From DU: we requested every label from iid stream falling in U) S+ S- Just to conceptualize, consider halfspaces as hypothesis class. The algorithm induces a realizable problem on a subset of the data. Use that problem to limit the set of hypotheses over which to perform error comparison. The subset of data contributing to differences in error of hypotheses in this set, is actually an iid sample! SO we are able to bound error differences, using iid gen bounds! -T is all points without thick borders, and we’ve labeled all points so far. -U is any pt in X that can be labeled differently by two hypoths consistent with S (not drawn, but don’t need to compute it!) T intersect U is the set of points whose labels are Not unequivocally implied by consistency with S. It is an iid sample from the distribution confined to region of the data space in which hypotheses consistent with S can disagree. (I.e. the region of the dataspace that contributes to error differences for hypoths consistent with S.) It’s an iid sample based on how the alg works: label all “uncertain” points, in the iid stream. Everything in T was uncertain at the time. U

Experiments Hypothesis classes in R1: Thresholds: h*(x) = sign(x - 0.5) Intervals: h*(x) = I(x2[low, high]) p+ = Px»DX[h*(x) = +1] Number of label queries versus points received in stream. Red: supervised learning. Blue: random misclassification, Green: Tsybakov boundary noise model. =0.2 p+=0.1, =0.1 =0.1 p+=0.2, =0.1 =0.2 p+=0.1, =0 =0.1 p+=0.2, =0 =0

Experiments Interval in R1: Interval in R2 (Axis-parallel boxes): h*(x) = I(x2[0.4, 0.6]) h*(x) = I(x2[0.15, 0.85]2) Temporal breakdown of label request locations. Queries: 1-200, 201-400, 401-509. Label queries: 1-400: 0.5 1 Intervals: uniform input distribution, random misclassification rate 0.1. m=10,000 Boxes: uniform input distribution, random misclassification rate 0.01, m=1000 0.5 1 All label queries (1-2141). 0.2 0.4 0.6 0.8 1

Conclusions and future work First positive result in active learning that is for general concepts, distributions, and need not be computationally prohibitive. First positive answers to open problem [Monteleoni ‘06] on efficient active learning under arbitrary distributions (for concepts with efficient supervised learning algorithms minimizing absolute loss (ERM)). Surprising result, interesting technique: avoids canonical AL sampling problem! Future work: Currently we only analyze absolute 0-1 loss, which is hard to optimize for some concept classes (e.g. hardness of agnostic supervised learning of halfspaces). Analyzing a convex upper bound on 0-1 loss could lead to implementation via an SVM-variant. Algorithm is extremely simple: lazily check every uncertain point’s label. - For a specific concept classes and input distributions, apply more aggressive querying rules to tighten label complexity bounds. - For a general method though, is this the best one can hope to do?

And thanks to coauthors: Thank you! And thanks to coauthors: Sanjoy Dasgupta Daniel Hsu

Some analysis details Lemma (bounding error differences): with high probability, err(h, S[T) - err(h’, S[T) · errD(h) - errD(h’) + n2 + n(err(h, S[T)1/2 + err(h’, S[T)1/2) with n=Õ((d log n)/n)1/2), d=VCdim(H). High-level proof idea: h,h’ 2 H consistent with S make the same errors on S!, the truly labeled version, so: err(h, S[T) - err(h’, S[T) = err(h, S![T) - err(h’, S![T) S! [ T is an iid sample from D: it is simply the entire iid stream. So we can use a normalized uniform convergence bound [Vapnik & Chervonenkis ‘71] that relates empirical error on an iid sample to the true error rate, to bound error differences on S[T. So let n = 2n + n(err(h, S[T)1/2 + err(h’, S[T)1/2), which we can compute! Lemma: h* = arg minh 2 H err(h), is consistent with Sn, 8 n¸0. (Use lemma above and induction). Thus S is a realizable problem.