Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,

Similar presentations


Presentation on theme: "Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,"— Presentation transcript:

1 Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni, MIT

2 Dasgupta, Kalai & Monteleoni COLT 2005 Selective sampling, online constraints Selective sampling framework: Unlabeled examples, x t, are received one at a time. Learner makes a prediction at each time-step. A noiseless oracle to label y t, can be queried at a cost. Goal: minimize number of labels to reach error   is  the  error rate (w.r.t. the target) on the sampling distribution. Online constraints: Space: Learner cannot store all previously seen examples (and then perform batch learning). Time: Running time of learner’s belief update step should not scale with number of seen examples/mistakes.

3 Dasgupta, Kalai & Monteleoni COLT 2005 AC Milan v. Inter Milan

4 Dasgupta, Kalai & Monteleoni COLT 2005 Problem framework u vtvt tt Target: Current hypothesis: Error region: Assumptions: Separability u is through origin x~Uniform on S error rate: tt

5 Dasgupta, Kalai & Monteleoni COLT 2005 Related work Analysis under selective sampling model, of Query By Committee algorithm [Seung,Opper&Sompolinsky‘92] : Theorem [Freund,Seung,Shamir&Tishby ‘97]: Under selective sampling from the uniform, QBC can learn a half-space through the origin to generalization error , using Õ(d log 1/  ) labels. ! BUT: space required, and time complexity of the update both scale with number of seen mistakes!

6 Dasgupta, Kalai & Monteleoni COLT 2005 Related work Perceptron: a simple online algorithm: If y t  SGN(v t ¢ x t ), then: Filtering rule v t+1 = v t + y t x t Update step Distribution-free mistake bound O(1/  2 ), if exists margin . Theorem [Baum‘89]: Perceptron, given sequential labeled examples from the uniform distribution, can converge to generalization error  after Õ(d/  2 ) mistakes.

7 Dasgupta, Kalai & Monteleoni COLT 2005 Our contributions A lower bound for Perceptron in active learning context of  (1/  2 ) labels. A modified Perceptron update with a Õ(d log 1/  ) mistake bound. An active learning rule and a label bound of Õ(d log 1/  ). A bound of Õ(d log 1/  ) on total errors (labeled or not).

8 Dasgupta, Kalai & Monteleoni COLT 2005 Perceptron Perceptron update: v t+1 = v t + y t x t  error does not decrease monotonically. u vtvt xtxt v t+1

9 Dasgupta, Kalai & Monteleoni COLT 2005 Lower bound on labels for Perceptron Theorem 1: The Perceptron algorithm, using any active learning rule, requires  (1/  2 ) labels to reach generalization error  w.r.t. the uniform distribution. Proof idea: Lemma: For small  t, the Perceptron update will increase  t unless kv t k is large:  (1/sin  t ). But, kv t k growth rate: So need t ¸ 1/sin 2  t. Under uniform,  t /  t ¸ sin  t. u vtvt xtxt v t+1

10 Dasgupta, Kalai & Monteleoni COLT 2005 A modified Perceptron update Standard Perceptron update: v t+1 = v t + y t x t Instead, weight the update by “confidence” w.r.t. current hypothesis v t : v t+1 = v t + 2 y t |v t ¢ x t | x t (v 1 = y 0 x 0 ) (similar to update in [Blum et al.‘96] for noise-tolerant learning) Unlike Perceptron: Error decreases monotonically: cos(  t+1 ) = u ¢ v t+1 = u ¢ v t + 2 |v t ¢ x t ||u ¢ x t | ¸ u ¢ v t = cos(  t ) kv t k =1 (due to factor of 2)

11 Dasgupta, Kalai & Monteleoni COLT 2005 A modified Perceptron update Perceptron update: v t+1 = v t + y t x t Modified Perceptron update: v t+1 = v t + 2 y t |v t ¢ x t | x t u vtvt xtxt v t+1 vtvt

12 Dasgupta, Kalai & Monteleoni COLT 2005 Mistake bound Theorem 2: In the supervised setting, the modified Perceptron converges to generalization error  after Õ(d log 1/  ) mistakes. Proof idea: The exponential convergence follows from a multiplicative decrease in  t : On an update, !We lower bound 2|v t ¢ x t ||u ¢ x t |, with high probability, using our distributional assumption.

13 Dasgupta, Kalai & Monteleoni COLT 2005 Mistake bound a { k {x : |a ¢ x| · k} = Theorem 2: In the supervised setting, the modified Perceptron converges to generalization error  after Õ(d log 1/  ) mistakes. Lemma (band): For any fixed a: kak=1,  · 1 and for x~U on S: Apply to |v t ¢ x| and |u ¢ x| ) 2|v t ¢ x t ||u ¢ x t | is large enough in expectation (using size of  t ).

14 Dasgupta, Kalai & Monteleoni COLT 2005 Active learning rule vtvt stst u { Goal: Filter to label just those points in the error region. ! but  t, and thus  t unknown! Define labeling region: Tradeoff in choosing threshold s t : If too high, may wait too long for an error. If too low, resulting update is too small. makes constant. ! But  t unknown! Choose s t adaptively: Start high. Halve, if no error in R consecutive labels. L

15 Dasgupta, Kalai & Monteleoni COLT 2005 Label bound Theorem 3: In the active learning setting, the modified Perceptron, using the adaptive filtering rule, will converge to generalization error  after Õ(d log 1/  ) labels. Corollary: The total errors (labeled and unlabeled) will be Õ(d log 1/  ).

16 Dasgupta, Kalai & Monteleoni COLT 2005 Proof technique Proof outline: We show the following lemmas hold with sufficient probability: Lemma 1. s t does not decrease too quickly: Lemma 2. We query labels on a constant fraction of  t. Lemma 3. With constant probability the update is good. By algorithm, ~1/R labels are mistakes. 9 R = Õ(1). ) Can thus bound labels and total errors by mistakes.

17 Dasgupta, Kalai & Monteleoni COLT 2005 Proof technique Lemma 1. s t is large enough: Proof: (By contradiction) Let t be first time Then A halving event means we saw R labels with no mistakes, so Lemma 1a: For any particular i, this event happens w.p. · 3/4:

18 Dasgupta, Kalai & Monteleoni COLT 2005 Proof technique u vtvt stst Lemma 1a. Proof idea: Using this value of s t, band lemma in R d-1 gives constant probability of x 0 falling in appropriately defined band w.r.t. u 0. where: x 0 : component of x orthogonal to v t u 0 : component of u orthogonal to v t )

19 Dasgupta, Kalai & Monteleoni COLT 2005 Proof technique Lemma 2. We query labels on a constant fraction of  t. Proof: Assume Lemma 1 for lower bound on s t. Apply Lemma 1a and band lemma ) Lemma 3. With constant probability the update is good. Proof: Assuming Lemma 1, by Lemma 2, each error is labeled w. constant p. From mistake bound proof, each update is good (multiplicative decrease in error) w. constant p. Finally, solve for R: Every R labels there is at least 1 update or we halve s t, so There exists R = Õ(1) s.t.

20 Dasgupta, Kalai & Monteleoni COLT 2005 Summary of contributions samples mistakes labels total errors online? PAC complexity [Long‘03] [Long‘95] Perceptron [Baum‘97] QBC [FSST‘97] [DKM‘05] Õ(d/  )  (d/  ) Õ(d/  3 )  (1/  2 ) Õ(d/  2 )  (1/  2 ) Õ(d/  log 1/  )Õ(d  log 1/  )  Õ(d/  log 1/  )Õ(d  log 1/  )

21 Dasgupta, Kalai & Monteleoni COLT 2005 Conclusions and open problems Achieve optimal label-complexity for this problem unlike QBC, a fully online algorithm Matching bound on total errors (labeled and unlabeled). Future work: Relax distributional assumptions: Uniform is sufficient but not necessary for proof. Note: this bound is not possible under arbitrary distributions [Dasgupta‘04]. Relax separability assumption: Allow “margin” of tolerated error. Analyze margin version: for exponential convergence, without d dependence.

22 Dasgupta, Kalai & Monteleoni COLT 2005 Thank you!


Download ppt "Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,"

Similar presentations


Ads by Google