Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,

Slides:



Advertisements
Similar presentations
1/15 Agnostically learning halfspaces FOCS /15 Set X, F class of functions f: X! {0,1}. Efficient Agnostic Learner w.h.p. h: X! {0,1} poly(1/ )
Advertisements

Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.
Learning with Online Constraints: Shifting Concepts and Active Learning Claire Monteleoni MIT CSAIL PhD Thesis Defense August 11th, 2006 Supervisor: Tommi.
Power of Selective Memory. Slide 1 The Power of Selective Memory Shai Shalev-Shwartz Joint work with Ofer Dekel, Yoram Singer Hebrew University, Jerusalem.
1 Learning with continuous experts using Drifting Games work with Robert E. Schapire Princeton University work with Robert E. Schapire Princeton University.
Boosting Approach to ML
Online Scheduling with Known Arrival Times Nicholas G Hall (Ohio State University) Marc E Posner (Ohio State University) Chris N Potts (University of Southampton)
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
A general agnostic active learning algorithm
Practical Online Active Learning for Classification Claire Monteleoni (MIT / UCSD) Matti Kääriäinen (University of Helsinki)
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples.
The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.
Active Learning of Binary Classifiers
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Minimaxity & Admissibility Presenting: Slava Chernoi Lehman and Casella, chapter 5 sections 1-2,7.
Probably Approximately Correct Model (PAC)
Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Artificial Neural Networks
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Online Function Tracking with Generalized Penalties Marcin Bieńkowski Institute of Computer Science, University of Wrocław, Poland Stefan Schmid Deutsche.
Online Learning Algorithms
Dana Moshkovitz, MIT Joint work with Subhash Khot, NYU.
Incorporating Unlabeled Data in the Learning Process
SVM by Sequential Minimal Optimization (SMO)
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.
1 Multiplicative Weights Update Method Boaz Kaminer Andrey Dolgin Based on: Aurora S., Hazan E. and Kale S., “The Multiplicative Weights Update Method:
Universit at Dortmund, LS VIII
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Online Learning Yiling Chen. Machine Learning Use past observations to automatically learn to make better predictions or decisions in the future A large.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Chapter 2 Single Layer Feedforward Networks
Submodular Maximization with Cardinality Constraints Moran Feldman Based On Submodular Maximization with Cardinality Constraints. Niv Buchbinder, Moran.
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Diversity Loss in General Estimation of Distribution Algorithms J. L. Shapiro PPSN (Parallel Problem Solving From Nature) ’06 BISCuit 2 nd EDA Seminar.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning with General Similarity Functions Maria-Florina Balcan.
1 Chapter 5 Branch-and-bound Framework and Its Applications.
UCSpv: Principled Voting in UCS Rule Populations Gavin Brown, Tim Kovacs, James Marshall.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Chapter 2 Single Layer Feedforward Networks
Vitaly Feldman and Jan Vondrâk IBM Research - Almaden
Static Optimality and Dynamic Search Optimality in Lists and Trees
CH. 2: Supervised Learning
Importance Weighted Active Learning
Vapnik–Chervonenkis Dimension
A general agnostic active learning algorithm
Semi-Supervised Learning
CSCI B609: “Foundations of Data Science”
Online Learning Kernels
CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni, MIT

Dasgupta, Kalai & Monteleoni COLT 2005 Selective sampling, online constraints Selective sampling framework: Unlabeled examples, x t, are received one at a time. Learner makes a prediction at each time-step. A noiseless oracle to label y t, can be queried at a cost. Goal: minimize number of labels to reach error   is  the  error rate (w.r.t. the target) on the sampling distribution. Online constraints: Space: Learner cannot store all previously seen examples (and then perform batch learning). Time: Running time of learner’s belief update step should not scale with number of seen examples/mistakes.

Dasgupta, Kalai & Monteleoni COLT 2005 AC Milan v. Inter Milan

Dasgupta, Kalai & Monteleoni COLT 2005 Problem framework u vtvt tt Target: Current hypothesis: Error region: Assumptions: Separability u is through origin x~Uniform on S error rate: tt

Dasgupta, Kalai & Monteleoni COLT 2005 Related work Analysis under selective sampling model, of Query By Committee algorithm [Seung,Opper&Sompolinsky‘92] : Theorem [Freund,Seung,Shamir&Tishby ‘97]: Under selective sampling from the uniform, QBC can learn a half-space through the origin to generalization error , using Õ(d log 1/  ) labels. ! BUT: space required, and time complexity of the update both scale with number of seen mistakes!

Dasgupta, Kalai & Monteleoni COLT 2005 Related work Perceptron: a simple online algorithm: If y t  SGN(v t ¢ x t ), then: Filtering rule v t+1 = v t + y t x t Update step Distribution-free mistake bound O(1/  2 ), if exists margin . Theorem [Baum‘89]: Perceptron, given sequential labeled examples from the uniform distribution, can converge to generalization error  after Õ(d/  2 ) mistakes.

Dasgupta, Kalai & Monteleoni COLT 2005 Our contributions A lower bound for Perceptron in active learning context of  (1/  2 ) labels. A modified Perceptron update with a Õ(d log 1/  ) mistake bound. An active learning rule and a label bound of Õ(d log 1/  ). A bound of Õ(d log 1/  ) on total errors (labeled or not).

Dasgupta, Kalai & Monteleoni COLT 2005 Perceptron Perceptron update: v t+1 = v t + y t x t  error does not decrease monotonically. u vtvt xtxt v t+1

Dasgupta, Kalai & Monteleoni COLT 2005 Lower bound on labels for Perceptron Theorem 1: The Perceptron algorithm, using any active learning rule, requires  (1/  2 ) labels to reach generalization error  w.r.t. the uniform distribution. Proof idea: Lemma: For small  t, the Perceptron update will increase  t unless kv t k is large:  (1/sin  t ). But, kv t k growth rate: So need t ¸ 1/sin 2  t. Under uniform,  t /  t ¸ sin  t. u vtvt xtxt v t+1

Dasgupta, Kalai & Monteleoni COLT 2005 A modified Perceptron update Standard Perceptron update: v t+1 = v t + y t x t Instead, weight the update by “confidence” w.r.t. current hypothesis v t : v t+1 = v t + 2 y t |v t ¢ x t | x t (v 1 = y 0 x 0 ) (similar to update in [Blum et al.‘96] for noise-tolerant learning) Unlike Perceptron: Error decreases monotonically: cos(  t+1 ) = u ¢ v t+1 = u ¢ v t + 2 |v t ¢ x t ||u ¢ x t | ¸ u ¢ v t = cos(  t ) kv t k =1 (due to factor of 2)

Dasgupta, Kalai & Monteleoni COLT 2005 A modified Perceptron update Perceptron update: v t+1 = v t + y t x t Modified Perceptron update: v t+1 = v t + 2 y t |v t ¢ x t | x t u vtvt xtxt v t+1 vtvt

Dasgupta, Kalai & Monteleoni COLT 2005 Mistake bound Theorem 2: In the supervised setting, the modified Perceptron converges to generalization error  after Õ(d log 1/  ) mistakes. Proof idea: The exponential convergence follows from a multiplicative decrease in  t : On an update, !We lower bound 2|v t ¢ x t ||u ¢ x t |, with high probability, using our distributional assumption.

Dasgupta, Kalai & Monteleoni COLT 2005 Mistake bound a { k {x : |a ¢ x| · k} = Theorem 2: In the supervised setting, the modified Perceptron converges to generalization error  after Õ(d log 1/  ) mistakes. Lemma (band): For any fixed a: kak=1,  · 1 and for x~U on S: Apply to |v t ¢ x| and |u ¢ x| ) 2|v t ¢ x t ||u ¢ x t | is large enough in expectation (using size of  t ).

Dasgupta, Kalai & Monteleoni COLT 2005 Active learning rule vtvt stst u { Goal: Filter to label just those points in the error region. ! but  t, and thus  t unknown! Define labeling region: Tradeoff in choosing threshold s t : If too high, may wait too long for an error. If too low, resulting update is too small. makes constant. ! But  t unknown! Choose s t adaptively: Start high. Halve, if no error in R consecutive labels. L

Dasgupta, Kalai & Monteleoni COLT 2005 Label bound Theorem 3: In the active learning setting, the modified Perceptron, using the adaptive filtering rule, will converge to generalization error  after Õ(d log 1/  ) labels. Corollary: The total errors (labeled and unlabeled) will be Õ(d log 1/  ).

Dasgupta, Kalai & Monteleoni COLT 2005 Proof technique Proof outline: We show the following lemmas hold with sufficient probability: Lemma 1. s t does not decrease too quickly: Lemma 2. We query labels on a constant fraction of  t. Lemma 3. With constant probability the update is good. By algorithm, ~1/R labels are mistakes. 9 R = Õ(1). ) Can thus bound labels and total errors by mistakes.

Dasgupta, Kalai & Monteleoni COLT 2005 Proof technique Lemma 1. s t is large enough: Proof: (By contradiction) Let t be first time Then A halving event means we saw R labels with no mistakes, so Lemma 1a: For any particular i, this event happens w.p. · 3/4:

Dasgupta, Kalai & Monteleoni COLT 2005 Proof technique u vtvt stst Lemma 1a. Proof idea: Using this value of s t, band lemma in R d-1 gives constant probability of x 0 falling in appropriately defined band w.r.t. u 0. where: x 0 : component of x orthogonal to v t u 0 : component of u orthogonal to v t )

Dasgupta, Kalai & Monteleoni COLT 2005 Proof technique Lemma 2. We query labels on a constant fraction of  t. Proof: Assume Lemma 1 for lower bound on s t. Apply Lemma 1a and band lemma ) Lemma 3. With constant probability the update is good. Proof: Assuming Lemma 1, by Lemma 2, each error is labeled w. constant p. From mistake bound proof, each update is good (multiplicative decrease in error) w. constant p. Finally, solve for R: Every R labels there is at least 1 update or we halve s t, so There exists R = Õ(1) s.t.

Dasgupta, Kalai & Monteleoni COLT 2005 Summary of contributions samples mistakes labels total errors online? PAC complexity [Long‘03] [Long‘95] Perceptron [Baum‘97] QBC [FSST‘97] [DKM‘05] Õ(d/  )  (d/  ) Õ(d/  3 )  (1/  2 ) Õ(d/  2 )  (1/  2 ) Õ(d/  log 1/  )Õ(d  log 1/  )  Õ(d/  log 1/  )Õ(d  log 1/  )

Dasgupta, Kalai & Monteleoni COLT 2005 Conclusions and open problems Achieve optimal label-complexity for this problem unlike QBC, a fully online algorithm Matching bound on total errors (labeled and unlabeled). Future work: Relax distributional assumptions: Uniform is sufficient but not necessary for proof. Note: this bound is not possible under arbitrary distributions [Dasgupta‘04]. Relax separability assumption: Allow “margin” of tolerated error. Analyze margin version: for exponential convergence, without d dependence.

Dasgupta, Kalai & Monteleoni COLT 2005 Thank you!