Download presentation
Presentation is loading. Please wait.
Published byMelvyn Payne Modified over 9 years ago
1
Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego
2
Supervised learning Given access to labeled data (drawn iid from an unknown underlying distribution P), want to learn a classifier chosen from hypothesis class H, with misclassification rate < . Sample complexity characterized by d = VC dimension of H. If data is separable, need roughly d/ labeled samples.
3
Active learning In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label. What is the minimum number of labels needed to achieve the target error rate?
4
Our result A parameter which coarsely characterizes the label complexity of active learning in the separable setting
5
Can adaptive querying really help? [CAL92, D04]: Threshold functions on the real line h w (x) = 1(x ¸ w), H = {h w : w 2 R} Start with 1/ unlabeled points Binary search – need just log 1/ labels, from which the rest can be inferred! Exponential improvement in sample complexity. w +-
6
More general hypothesis classes For a general hypothesis class with VC dimension d, is a “generalized binary search” possible? Random choice of queriesd/ labels Perfect binary searchd log 1/ labels Where in this large range does the label complexity of active learning lie? We’ve already handled linear separators in 1-d…
7
Linear separators in R 2 For linear separators in R 1, need just log 1/ labels. But when H = {linear separators in R 2 }: some target hypotheses require 1/ labels to be queried! h3h3 h2h2 h0h0 h1h1 fraction of distribution Need 1/ labels to distinguish between h 0, h 1, h 2, …, h 1/ ! Consider any distribution over the circle in R 2.
8
A fuller picture For linear separators in R 2 : some bad target hypotheses which require 1/ labels, but “most” require just O(log 1/ ) labels… good bad
9
A view of the hypothesis space H = {linear separators in R 2 } All-positive hypothesis All-negative hypothesis Good region Bad regions
10
Geometry of hypothesis space H = any hypothesis class, of VC dimension d < 1. P = underlying distribution of data. (i) Non-Bayesian setting: no probability measure on H (ii) But there is a natural (pseudo) metric: d(h,h’) = P(h(x) h’(x)) (iii) Each point x defines a cut through H h h’ H x
11
The learning process (h 0 = target hypothesis) Keep asking for labels until the diameter of the remaining version space is at most . h0h0 H
12
Searchability index Accuracy Data distribution P Amount of unlabeled data Each hypothesis h 2 H has a “searchability index” h (h) / min(pos mass of h, neg mass of h), but never < · (h) · 1, bigger is better 1/2 1/4 1/5 1/4 1/5 Example: linear separators in R 2, data on a circle: 1/3 All positive hypothesis H
13
Searchability index Accuracy Data distribution P Amount of unlabeled data Each hypothesis h 2 H has a “searchability index” (h) Searchability index lies in the range: · (h) · 1 Upper bound. There is an active learning scheme which identifies any target hypothesis h 2 H (within accuracy · ) with a label complexity of at most: Lower bound. For any h 2 H, any active learning scheme for the neighborhood B(h, (h)) has a label complexity of at least: [When (h) À : active learning helps a lot.]
14
Linear separators in R d Previous sample complexity results for active learning have focused on the following case: H = homogeneous (through the origin) linear separators in R d Data distributed uniformly over unit sphere [1] Query by committee [SOS92, FSST97] Bayesian setting: average-case over target hypotheses picked uniformly from the unit sphere [2] Perceptron-based active learner [DKM05] Non-Bayesian setting: worst-case over target hypotheses In either case: just (d log 1/ ) labels needed!
15
Example: linear separators in R d This sample complexity is realized by many schemes: [SOS92, FSST97] Query by committee [DKM05] Perceptron-based active learner Simplest of all, [CAL92]: pick a random point whose label is not completely certain (with respect to current version space) } as before H: {Homogeneous linear separators in R d }, P: uniform distribution (h) is the same for all h, and is ¸ 1/8
16
Linear separators in R d Uniform distribution: Concentrated near the equator (any equator) + -
17
Linear separators in R d Instead: distribution P with a different vertical marginal: Result: ¸ 1/32, provided amt of unlabeled data grows by … Do the schemes [CAL92, SOS92, FSST97, DKM05] achieve this label complexity? + - Say that for < 1, U(x)/ · P(x) · U(x) (U = uniform)
18
What next 1.Make this algorithmic! Linear separators: is some kind of “querying near current boundary” a reasonable approximation? 2.Nonseparable data Need a robust base learner! true boundary + -
19
Thanks For helpful discussions: Peter Bartlett Yoav Freund Adam Kalai John Langford Claire Monteleoni
20
Star-shaped configurations Hypothesis space: In the vicinity of the “bad” hypothesis h 0, we find a star structure: Data space: h3h3 h2h2 h1h1 h0h0 h0h0 h1h1 h2h2 h3h3 h 1/
21
Example: the 1-d line Searchability index lies in range: · (h) · 1 Theorem: · # labels needed · Example: Threshold functions on the line w +- Result: = 1/2 for any target hypothesis and any input distribution
22
Linear separators in R d Result: = (1) for most target hypotheses, but is for the hypothesis that makes one slab +, the other -… the most “natural” one! origin Data lies on the rim of two slabs, distributed uniformly
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.