Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Slides:



Advertisements
Similar presentations
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
Advertisements

An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
Learning with Online Constraints: Shifting Concepts and Active Learning Claire Monteleoni MIT CSAIL PhD Thesis Defense August 11th, 2006 Supervisor: Tommi.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
A general agnostic active learning algorithm
Practical Online Active Learning for Classification Claire Monteleoni (MIT / UCSD) Matti Kääriäinen (University of Helsinki)
Chapter 4: Linear Models for Classification
Longin Jan Latecki Temple University
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples.
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A.
Machine Learning Week 2 Lecture 2.
Active Learning of Binary Classifiers
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probably Approximately Correct Model (PAC)
Vapnik-Chervonenkis Dimension
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Vapnik-Chervonenkis Dimension Part II: Lower and Upper bounds.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
SVM Support Vectors Machines
Inference about a Mean Part II
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Probability Population:
Ensemble Learning (2), Tree and Forest
Incorporating Unlabeled Data in the Learning Process
Active Learning for Class Imbalance Problem
Random Sampling, Point Estimation and Maximum Likelihood.
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
LECTURE 25 THURSDAY, 19 NOVEMBER STA291 Fall
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
CpSc 881: Machine Learning Evaluating Hypotheses.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Machine Learning Chapter 5. Evaluating Hypotheses
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Ensemble Methods in Machine Learning
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bayesian Inconsistency under Misspecification Peter Grünwald CWI, Amsterdam Extension of joint work with John Langford, TTI Chicago (COLT 2004)
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Zhipeng (Patrick) Luo December 6th, 2016
On Testing Dynamic Environments
Importance Weighted Active Learning
A general agnostic active learning algorithm
INF 5860 Machine learning for image classification
Computational Learning Theory
Computational Learning Theory
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
Supervised machine learning: creating a model
Presentation transcript:

Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Supervised learning Given access to labeled data (drawn iid from an unknown underlying distribution P), want to learn a classifier chosen from hypothesis class H, with misclassification rate < . Sample complexity characterized by d = VC dimension of H. If data is separable, need roughly d/  labeled samples.

Active learning In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label. What is the minimum number of labels needed to achieve the target error rate?

Our result A parameter which coarsely characterizes the label complexity of active learning in the separable setting

Can adaptive querying really help? [CAL92, D04]: Threshold functions on the real line h w (x) = 1(x ¸ w), H = {h w : w 2 R} Start with 1/  unlabeled points Binary search – need just log 1/  labels, from which the rest can be inferred! Exponential improvement in sample complexity. w +-

More general hypothesis classes For a general hypothesis class with VC dimension d, is a “generalized binary search” possible? Random choice of queriesd/  labels Perfect binary searchd log 1/  labels Where in this large range does the label complexity of active learning lie? We’ve already handled linear separators in 1-d…

Linear separators in R 2 For linear separators in R 1, need just log 1/  labels. But when H = {linear separators in R 2 }: some target hypotheses require 1/  labels to be queried! h3h3 h2h2 h0h0 h1h1  fraction of distribution Need 1/  labels to distinguish between h 0, h 1, h 2, …, h 1/  ! Consider any distribution over the circle in R 2.

A fuller picture For linear separators in R 2 : some bad target hypotheses which require 1/  labels, but “most” require just O(log 1/  ) labels… good bad

A view of the hypothesis space H = {linear separators in R 2 } All-positive hypothesis All-negative hypothesis Good region Bad regions

Geometry of hypothesis space H = any hypothesis class, of VC dimension d < 1. P = underlying distribution of data. (i) Non-Bayesian setting: no probability measure on H (ii) But there is a natural (pseudo) metric: d(h,h’) = P(h(x)  h’(x)) (iii) Each point x defines a cut through H h h’ H x

The learning process (h 0 = target hypothesis) Keep asking for labels until the diameter of the remaining version space is at most . h0h0 H

Searchability index Accuracy  Data distribution P Amount of unlabeled data Each hypothesis h 2 H has a “searchability index”  h   (h) / min(pos mass of h, neg mass of h), but never <   ·  (h) · 1, bigger is better  1/2 1/4 1/5  1/4 1/5 Example: linear separators in R 2, data on a circle: 1/3 All positive hypothesis H

Searchability index Accuracy  Data distribution P Amount of unlabeled data Each hypothesis h 2 H has a “searchability index”  (h) Searchability index lies in the range:  ·  (h) · 1 Upper bound. There is an active learning scheme which identifies any target hypothesis h 2 H (within accuracy ·  ) with a label complexity of at most: Lower bound. For any h 2 H, any active learning scheme for the neighborhood B(h,  (h)) has a label complexity of at least: [When  (h) À  : active learning helps a lot.]

Linear separators in R d Previous sample complexity results for active learning have focused on the following case: H = homogeneous (through the origin) linear separators in R d Data distributed uniformly over unit sphere [1] Query by committee [SOS92, FSST97] Bayesian setting: average-case over target hypotheses picked uniformly from the unit sphere [2] Perceptron-based active learner [DKM05] Non-Bayesian setting: worst-case over target hypotheses In either case: just (d log 1/  ) labels needed!

Example: linear separators in R d This sample complexity is realized by many schemes: [SOS92, FSST97] Query by committee [DKM05] Perceptron-based active learner Simplest of all, [CAL92]: pick a random point whose label is not completely certain (with respect to current version space) } as before H: {Homogeneous linear separators in R d }, P: uniform distribution  (h) is the same for all h, and is ¸ 1/8

Linear separators in R d Uniform distribution: Concentrated near the equator (any equator) + -

Linear separators in R d Instead: distribution P with a different vertical marginal: Result:  ¸ 1/32, provided amt of unlabeled data grows by … Do the schemes [CAL92, SOS92, FSST97, DKM05] achieve this label complexity? + - Say that for < 1, U(x)/ · P(x) · U(x) (U = uniform)

What next 1.Make this algorithmic! Linear separators: is some kind of “querying near current boundary” a reasonable approximation? 2.Nonseparable data Need a robust base learner! true boundary + -

Thanks For helpful discussions: Peter Bartlett Yoav Freund Adam Kalai John Langford Claire Monteleoni

Star-shaped configurations Hypothesis space: In the vicinity of the “bad” hypothesis h 0, we find a star structure: Data space: h3h3 h2h2 h1h1 h0h0 h0h0 h1h1 h2h2 h3h3 h 1/ 

Example: the 1-d line Searchability index lies in range:  ·  (h) · 1 Theorem: · # labels needed · Example: Threshold functions on the line w +- Result:  = 1/2 for any target hypothesis and any input distribution

Linear separators in R d Result:  =  (1) for most target hypotheses, but is  for the hypothesis that makes one slab +, the other -… the most “natural” one! origin Data lies on the rim of two slabs, distributed uniformly