Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))

Slides:

Advertisements

Similar presentations

1/15 Agnostically learning halfspaces FOCS /15 Set X, F class of functions f: X! {0,1}. Efficient Agnostic Learner w.h.p. h: X! {0,1} poly(1/ )

Advertisements

Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin

Reductions to the Noisy Parity Problem TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A Vitaly Feldman Parikshit.

LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.

On-line learning and Boosting

An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.

BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.

Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.

FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.

Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua On Agnostic Boosting and Parity Learning.

Games of Prediction or Things get simpler as Yoav Freund Banter Inc.

Foundations of Adversarial Learning Daniel Lowd, University of Washington Christopher Meek, Microsoft Research Pedro Domingos, University of Washington.

A general agnostic active learning algorithm

Longin Jan Latecki Temple University

Introduction to Boosting Slides Adapted from Che Wanxiang( 车万翔 ) at HIT, and Robin Dhamankar of Many thanks!

Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR) TexPoint fonts used in EMF. Read the TexPoint manual before.

Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)

Probably Approximately Correct Learning Yongsub Lim Applied Algorithm Laboratory KAIST.

A Quarter-Century of Efficient Learnability Rocco Servedio Columbia University Valiant 60 th Birthday Symposium Bethesda, Maryland May 30, 2009.

Learning, testing, and approximating halfspaces Rocco Servedio Columbia University DIMACS-RUTCOR Jan 2009.

Active Learning of Binary Classifiers

Computational Learning Theory

2D1431 Machine Learning Boosting.

Probably Approximately Correct Model (PAC)

Exact Learning of Boolean Functions with Queries Lisa Hellerstein Polytechnic University Brooklyn, NY AMS Short Course on Statistical Learning Theory,

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

Adaboost and its application

Fourier Analysis and Boolean Function Learning Jeff Jackson Duquesne University

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)

Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.

PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp

Machine Learning CS 165B Spring 2012

Machine Learning Algorithms in Computational Learning Theory

CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CS 391L: Machine Learning: Ensembles

Benk Erika Kelemen Zsolt

Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.

Learning and smoothed analysis Adam Kalai Microsoft Research Cambridge, MA Shang-Hua Teng* University of Southern California Alex Samorodnitsky* Hebrew.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

Learnability of DNF with Representation-Specific Queries Liu Yang Joint work with Avrim Blum & Jaime Carbonell Carnegie Mellon University 1© Liu Yang 2012.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Harmonic Analysis in Learning Theory Jeff Jackson Duquesne University.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Machine Learning: Ensemble Methods

Reading: R. Schapire, A brief introduction to boosting

Computational Learning Theory

The Boosting Approach to Machine Learning

The Boosting Approach to Machine Learning

ECE 5424: Introduction to Machine Learning

CSCI B609: “Foundations of Data Science”

Computational Learning Theory

Learning, testing, and approximating halfspaces

Computational Learning Theory

including joint work with:

Computational Learning Theory Eric Xing Lecture 5, August 13, 2010

Model Combination.

A Quarter-Century of Efficient Learnability

Lecture 14 Learning Inductive inference

CS 391L: Machine Learning: Ensembles

Presentation transcript:

Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))

Outline PAC Learning, Agnostic Learning, Boosting Boosting Algorithm and applications (Some) history of boosting algorithms

Learning... PAC Learning [Valiant '84] Learning from examples e.g. halfspaces Learning from membership queries. eg. DNF (x 1 ٨ x 7 ٨ x 13 ) ۷ (x 2 ٨ x 7 ) ۷ (x 3 ٨ x 8 ٨ x 10 ) Agnostic Learning [Haussler '92, Kearns, Schapire, Sellie '94] No assumptions about correctness of labels Challenging Noise Model

Different Learning Scenarios PAC Learning Random classification noise PAC learning Agnostic Learning

Boosting... Weak learning → Strong Learning (accuracy 51% → accuracy 99%) Great for proving theory results [Schapire '89] Potential-based algorithms work great in practice e.g Adaboost [Freund and Schapire '95] But, suffer in the presence of noise.

Boosting at work... H = h

Boosting at work... H = a 1 h 1 + a 2 h 2 H = h

Boosting at work... H = a 1 h 1 + a 2 h 2 H = h 1 H = a 1 h 1 + a 2 h 2 + a 3 h

Boosting at work... H = a 1 h 1 + a 2 h 2 H = h 1 H = a 1 h 1 + a 2 h 2 + a 3 h

Boosting Algorithms Repeat Find a weakly accurate hypothesis Change weights of examples Take “weighted majority” of weak classifiers to obtain a highly accurate hypothesis Guaranteed to work when labelling is correct.

Our Work New simple boosting algorithm Better Theoretical Guarantees – agnostic setting Does not change weights of examples Application: Simplifies agnostic learning results

Independent Work Distribution-specific agnostic boosting. - Feldman (ICS 2010)

Agnostic Learning Assumption that data is perfectly labelled is unrealistic Make no assumption on how the data is labelled Goal: Try to fit as well as best from a concept class

Agnostic Learning Instance Space X Distribution μ over X f : X → [-1, 1] (labelling function) D = (μ, f) over X × {-1, 1} f(x) = E (x, y)~D [y | x] Oracle Access

Agnostic Learning err D (c) = Pr (x, y)~D [c(x) != y] cor(c, D) = E (x, y)~D [c(x)y] = E x ~ D [c(x) f(x)] opt = max c є C cor(c, D) Find h satisfying cor(h, D) ≥ opt - ε PAC Learning is the special case when labels match exactly with a concept in C (f є C), (opt = 1)

Key def: Weak Agnostic Learning opt may be as low as 0 Algorithm W is a (γ,ε 0 )-weak agnostic learner for distribution μ if for every f: X → [-1, 1] and access to D = (μ, f) it outputs a hypothesis w such cor(w,D) ≥ γ opt – ε 0 Weak learner for specific distribution gives strong learner for same distribution

Boosting Theorem If C is (γ, ε 0 )-weakly agnostically learnable under distribution μ over X, then AgnosticBoost returns hypothesis h such that cor(h) ≥ opt – ε – (ε 0 /γ) The boosting algorithm makes O(1/(γε) 2 ) calls to weak learner

Boosting Algorithm Input: (x 1, y 1 ),...., (x mT, y mT ) 1. Let H 0 = 0 2. For t = 1, …, T Relabel (x (t-1)m+1, y (t-1)m+1 ), …, (x tm, y tm ) Let g t be the output of weak learner W on relabelled data by weights w t (x,y) h t = Either g t or – sign(H t-1 ) γ t = 1/m ∑ i h t (x i ) y i w t (x i,y i ) H t = H t-1 + γ t h t

Boosting Algorithm Input: (x 1, y 1 ),...., (x mT, y mT ) 1. Let H 0 = 0 2. For t = 1, …, T Relabel (x (t-1)m+1, y (t-1)m+1 ), …, (x tm, y tm ) Let g t be the output of weak learner W on relabelled data by weights w t (x,y) h t = Either g t or – sign(H t-1 ) γ t = 1/m ∑ i h t (x i ) y i w t (x i,y i ) H t = H t-1 + γ t h t Relabelling idea: [Kalai K Mansour '09]

Potential Function φ(z) = 1 – z if z ≤ 0 φ(z) = e -z otherwise Ф(H,D) = E (x,y)~D [φ(yH(x))] Ф(H 0,D) = 1; Ф ≥ 0 (same potential function as Madaboost)

Proof Strategy Relabelling weights: w(x, y) = -φ'(yH(x)) R D, w – draw (x,y) from D return (x, y) with probability (1+w(x,y))/2 else return (x, -y) cor(h, R D, w ) = E (x, y)~D [h(x) y w(x, y)]

Analysis Sketch... φ(yH(x))-φ(yH(x) + γ(yh(x))) ≥ γh(x)y(-φ'(yH(x)))- γ 2 /2 Taking expectations we get the required result. Lemma: For any x and δ in reals |φ(x+ δ) - φ(x) – φ'(x) δ| ≤ δ 2 /2 Let H: X → R, h: X → [-1, 1], γ є R and distribution μ over X. Relabel according to w(x, y) = - φ'(yH(x)) Φ(H, D) – Φ(H + γ h, D) ≥ γ cor(h, R D, w ) – γ 2 /2

Analysis Sketch... Only relabelling data points correctly classified by h – so advantage of c only increases … For distribution D over X × {-1, 1} and c, h : X → {-1, 1} and relabelling function w(x,y) satisfying w(x, -h(x))=1 cor(c,R D,w ) – cor(h,R D,w ) ≥ cor(c, D) – cor(h, D)

Analysis Sketch … Suppose cor(sign(H t ),D) ≤ opt – ε – (ε 0 /γ) and let c є C achieve opt Relabel using w(x,y) = - φ'(yH t (x)).. and hence cor(c, R D, w ) – cor(sign(H t ), R D, w ) ≥ ε + ( ε 0 / γ ) If cor(c, R D, w ) ≥ ε/2 + ε 0 / γ … weak learning else cor(-sign(H t ), R D, w ) ≥ ε/2

Analysis Sketch … Can always find a hypothesis h satisfying.. cor(h, R D, w ) ≥ (εγ/ 2) Reduces potential at least by Ω( (εγ) 2 ) In less that T = O(1 /(εγ) 2 ) steps it must be the case that cor (sign(H t ), D) ≥ opt – ε - (ε 0 /γ)

Applications Finding low degree Fourier coefficients is a weak learner for halfspaces under the uniform distribution. [Klivans, O'Donnell, Servedio '04] Get agnostic halfspace algorithm [Kalai, Klivans, Mansour, Servedio '05] Goldreich-Levin/Kushilevitz-Mansour algorithm for parities is a weak learner for decision trees. Get agnostic decision tree learning algorithm [Gopalan, Kalai, Klivans '08]

Applications Agnostically learning C under a fixed distribution μ gives PAC learning of disjunctions of C, under same distribution μ [Kalai, K, Mansour '09] Agnostically learning decision trees gives a PAC learning algorithm for DNF. [Jackson '95]

(Some) History of Boosting Algorithms Adaboost [Freund and Schapire '95] works in practice! Also simple, adaptive and has other nice properties. Random noise worsens performance considerably Madaboost [Domingo and Watanabe '00] corrects for this somewhat by limiting penalty for wrong labels.

(Some) History of Boosting Algorithms Random Classification Noise: Boosting using branching programs. [Kalai and Servedi '03] No potential-based boosting algorithm. [Long and Servedio '09]

(Some) History of Boosting Algorithms Agnostic Boosting [Ben-David, Long, Mansour '01] - different definition of weak learner - no direct comparison, full boosting not possible Agnostic Boosting and Parity Learning. [Kalai, Mansour and Verbin '08] - uses branching programs - give algorithm to learn parity with noise under all distributions in time 2 O(n/log n)

Conclusion Simple potential-function based boosting algorithm for agnostic learning “Right” definition of weak agnostic learning Boosting without changing distributions Applications: simplifies agnostic learning algorithms for halfspaces and decision trees, a different way to view PAC-learning DNFs

Boosting W is a γ-weak learner for concept class C for all distributions μ Can W be used as a black-box to strongly learn C under every distribution μ? [Kearns and Valiant '88] Yes – boosting. [Schapire '89, Freund '90, Freund and Schapire '95]

Agnostic Learning Halfspaces Low-degree algorithm is a (1/n d, ε 0 )-agnostic weak- learner for halfspaces under uniform distribution over the boolean cube. [Klivans-O'Donnell-Servedio '04] Halfspaces over boolean cube are agnostically learnable using examples only for constant ε. [Kalai-Klivans-Mansour- Servedio '05]

Learning Decision Trees Kushilevitz-Mansour (Goldreich-Levin) algorithm for learning parities using membership queries is a (1/ t, ε 0 )-agnostic weak learner for decision trees. [Kushilevitz-Mansour '91] Decision trees can be agnostically learned under uniform distribution using membership queries. [Gopalan- Kalai-Klivans '08]

PAC Learning DNFs Theorem: If C is agnostically learnable under distribution μ then disjunctions of C is PAC-learnable under μ. [Kalai-K-Mansour '09] DNFs are PAC-learnable under uniform distribution using membership queries. [Jackson 95]

PAC Learning Instance Space X Distribution μ over X Concept class C c є C Oracle access

PAC Learning Output hypothesis h error = Pr x ~ μ [h(x) != c(x)] With high probability (1–δ) h is approximately (error ≤ ε) correct

PAC Learning Instance space X, distribution μ over X, concept class C over X cor μ (h, c) = E x ~ μ [h(x) c(x)] = 1 – 2 err μ (h, c) C is PAC learnable under distribution μ, if for all c є C, ε, δ > 0, exists algorithm A which in polytime outputs h such that cor μ (h, c) ≥ 1- ε with probability at least 1 – δ strongly err D (h, c) ≤ ε/2

Weak PAC Learning Instance space X, distribution μ over X, concept class C over X Algorithm W is a γ-weak PAC learner for distribution μ if it outputs a hypothesis w such cor μ (w, c) ≥ γ err μ (w, c) ≤ (1– γ)/2 : somewhat better than random guessing

Key def: Weak Agnostic Learning opt = min c є C Pr (x, y) ~D [ c(x) != y] cor(c, D) = E (x, y) ~D [ c(x) y] = E x ~ μ [c(x) f(x)] cor(C, D) = max c є C cor(c, D) = 1 – 2 opt opt may be as low as 0 Algorithm W is a (γ,ε 0 )-weak agnostic learner for distribution μ if for every f: X → [-1, 1] and access to D = (μ, f) it outputs a hypothesis w such cor(w,D) ≥ γ cor(C,D) – ε 0

$\gamma$ → γ $\times$ → × $\mu$ → μ $\epsilon$ → ε $\gequal$ → ≥ $\sum$ → ∑ $\phi$ → φ $\Bigphi$ → Ф $\lequal$ → ≤ $\delta$ → δ $\in$ → ∈ $\wedge$ → ٨ $\vee$ → ۷