Presentation is loading. Please wait.

Presentation is loading. Please wait.

Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))

Similar presentations


Presentation on theme: "Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))"— Presentation transcript:

1 Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))

2 Outline PAC Learning, Agnostic Learning, Boosting Boosting Algorithm and applications (Some) history of boosting algorithms

3 Learning... PAC Learning [Valiant '84] Learning from examples e.g. halfspaces Learning from membership queries. eg. DNF (x 1 ٨ x 7 ٨ x 13 ) ۷ (x 2 ٨ x 7 ) ۷ (x 3 ٨ x 8 ٨ x 10 ) Agnostic Learning [Haussler '92, Kearns, Schapire, Sellie '94] No assumptions about correctness of labels Challenging Noise Model + + + + ++ + + + - - - - - - - - - - -

4 Different Learning Scenarios PAC Learning Random classification noise PAC learning Agnostic Learning + + + + + + + + + + + + + + + + + + + + - - - - -- - - - - - - - - - - - - - - -- - -- -- - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - + + + + + + + + + + + + ++ + + + + + + + + + + + + + - ---- -- -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

5 Boosting... Weak learning → Strong Learning (accuracy 51% → accuracy 99%) Great for proving theory results [Schapire '89] Potential-based algorithms work great in practice e.g Adaboost [Freund and Schapire '95] But, suffer in the presence of noise.

6 Boosting at work... H = h 1 + + + + + + + + + + + + + + + + + + + + - - - - -- - - - - - - - - - - - - - - -- - -- -- - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - +-

7 Boosting at work... H = a 1 h 1 + a 2 h 2 H = h 1 + + + + + + + + + + + + + + + + + + + + - - - - -- - - - - - - - - - - - - - - -- - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + -

8 Boosting at work... H = a 1 h 1 + a 2 h 2 H = h 1 H = a 1 h 1 + a 2 h 2 + a 3 h 3 + + + + + + + + + + + + + + + + + + + + - - - - -- - - - - - - - - - - - - - - -- - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + -

9 Boosting at work... H = a 1 h 1 + a 2 h 2 H = h 1 H = a 1 h 1 + a 2 h 2 + a 3 h 3 + + + + + + + + + + + + + + + + + + + + - - - - -- - - - - - - - - - - - - - - -- - -- -- - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - -

10 Boosting Algorithms Repeat Find a weakly accurate hypothesis Change weights of examples Take “weighted majority” of weak classifiers to obtain a highly accurate hypothesis Guaranteed to work when labelling is correct.

11 Our Work New simple boosting algorithm Better Theoretical Guarantees – agnostic setting Does not change weights of examples Application: Simplifies agnostic learning results

12 Independent Work Distribution-specific agnostic boosting. - Feldman (ICS 2010)

13 Agnostic Learning Assumption that data is perfectly labelled is unrealistic Make no assumption on how the data is labelled Goal: Try to fit as well as best from a concept class

14 Agnostic Learning Instance Space X Distribution μ over X f : X → [-1, 1] (labelling function) D = (μ, f) over X × {-1, 1} f(x) = E (x, y)~D [y | x] Oracle Access + - + + + + + + + + + + + + + + + + + + + + + + + + ++ -- - - - - - - - - - - - - - - - - - - - - - - - - - -

15 Agnostic Learning err D (c) = Pr (x, y)~D [c(x) != y] cor(c, D) = E (x, y)~D [c(x)y] = E x ~ D [c(x) f(x)] opt = max c є C cor(c, D) Find h satisfying cor(h, D) ≥ opt - ε + - + + + + + + + + + + + + + + + + + + + + + + + + ++ -- - - - - - - - - - - - - - - - - - - - - - - - - - - PAC Learning is the special case when labels match exactly with a concept in C (f є C), (opt = 1)

16 Key def: Weak Agnostic Learning opt may be as low as 0 Algorithm W is a (γ,ε 0 )-weak agnostic learner for distribution μ if for every f: X → [-1, 1] and access to D = (μ, f) it outputs a hypothesis w such cor(w,D) ≥ γ opt – ε 0 Weak learner for specific distribution gives strong learner for same distribution

17 Boosting Theorem If C is (γ, ε 0 )-weakly agnostically learnable under distribution μ over X, then AgnosticBoost returns hypothesis h such that cor(h) ≥ opt – ε – (ε 0 /γ) The boosting algorithm makes O(1/(γε) 2 ) calls to weak learner

18 Boosting Algorithm Input: (x 1, y 1 ),...., (x mT, y mT ) 1. Let H 0 = 0 2. For t = 1, …, T Relabel (x (t-1)m+1, y (t-1)m+1 ), …, (x tm, y tm ) Let g t be the output of weak learner W on relabelled data by weights w t (x,y) h t = Either g t or – sign(H t-1 ) γ t = 1/m ∑ i h t (x i ) y i w t (x i,y i ) H t = H t-1 + γ t h t

19 Boosting Algorithm Input: (x 1, y 1 ),...., (x mT, y mT ) 1. Let H 0 = 0 2. For t = 1, …, T Relabel (x (t-1)m+1, y (t-1)m+1 ), …, (x tm, y tm ) Let g t be the output of weak learner W on relabelled data by weights w t (x,y) h t = Either g t or – sign(H t-1 ) γ t = 1/m ∑ i h t (x i ) y i w t (x i,y i ) H t = H t-1 + γ t h t Relabelling idea: [Kalai K Mansour '09]

20 Potential Function φ(z) = 1 – z if z ≤ 0 φ(z) = e -z otherwise Ф(H,D) = E (x,y)~D [φ(yH(x))] Ф(H 0,D) = 1; Ф ≥ 0 (same potential function as Madaboost)

21 Proof Strategy Relabelling weights: w(x, y) = -φ'(yH(x)) R D, w – draw (x,y) from D return (x, y) with probability (1+w(x,y))/2 else return (x, -y) cor(h, R D, w ) = E (x, y)~D [h(x) y w(x, y)] + + +-+- +-+- + +-+- + -+-+ -+-+ -+-+ - -+-+ - - - - -+-+ - - + + + + + + + - - - - - - - - - - - + + + + + + + - - - - - - - - - -

22 Analysis Sketch... φ(yH(x))-φ(yH(x) + γ(yh(x))) ≥ γh(x)y(-φ'(yH(x)))- γ 2 /2 Taking expectations we get the required result. Lemma: For any x and δ in reals |φ(x+ δ) - φ(x) – φ'(x) δ| ≤ δ 2 /2 Let H: X → R, h: X → [-1, 1], γ є R and distribution μ over X. Relabel according to w(x, y) = - φ'(yH(x)) Φ(H, D) – Φ(H + γ h, D) ≥ γ cor(h, R D, w ) – γ 2 /2

23 Analysis Sketch... Only relabelling data points correctly classified by h – so advantage of c only increases … For distribution D over X × {-1, 1} and c, h : X → {-1, 1} and relabelling function w(x,y) satisfying w(x, -h(x))=1 cor(c,R D,w ) – cor(h,R D,w ) ≥ cor(c, D) – cor(h, D)

24 Analysis Sketch … Suppose cor(sign(H t ),D) ≤ opt – ε – (ε 0 /γ) and let c є C achieve opt Relabel using w(x,y) = - φ'(yH t (x)).. and hence cor(c, R D, w ) – cor(sign(H t ), R D, w ) ≥ ε + ( ε 0 / γ ) If cor(c, R D, w ) ≥ ε/2 + ε 0 / γ … weak learning else cor(-sign(H t ), R D, w ) ≥ ε/2

25 Analysis Sketch … Can always find a hypothesis h satisfying.. cor(h, R D, w ) ≥ (εγ/ 2) Reduces potential at least by Ω( (εγ) 2 ) In less that T = O(1 /(εγ) 2 ) steps it must be the case that cor (sign(H t ), D) ≥ opt – ε - (ε 0 /γ)

26 Applications Finding low degree Fourier coefficients is a weak learner for halfspaces under the uniform distribution. [Klivans, O'Donnell, Servedio '04] Get agnostic halfspace algorithm [Kalai, Klivans, Mansour, Servedio '05] Goldreich-Levin/Kushilevitz-Mansour algorithm for parities is a weak learner for decision trees. Get agnostic decision tree learning algorithm [Gopalan, Kalai, Klivans '08]

27 Applications Agnostically learning C under a fixed distribution μ gives PAC learning of disjunctions of C, under same distribution μ [Kalai, K, Mansour '09] Agnostically learning decision trees gives a PAC learning algorithm for DNF. [Jackson '95]

28 (Some) History of Boosting Algorithms Adaboost [Freund and Schapire '95] works in practice! Also simple, adaptive and has other nice properties. Random noise worsens performance considerably Madaboost [Domingo and Watanabe '00] corrects for this somewhat by limiting penalty for wrong labels.

29 (Some) History of Boosting Algorithms Random Classification Noise: Boosting using branching programs. [Kalai and Servedi '03] No potential-based boosting algorithm. [Long and Servedio '09]

30 (Some) History of Boosting Algorithms Agnostic Boosting [Ben-David, Long, Mansour '01] - different definition of weak learner - no direct comparison, full boosting not possible Agnostic Boosting and Parity Learning. [Kalai, Mansour and Verbin '08] - uses branching programs - give algorithm to learn parity with noise under all distributions in time 2 O(n/log n)

31 Conclusion Simple potential-function based boosting algorithm for agnostic learning “Right” definition of weak agnostic learning Boosting without changing distributions Applications: simplifies agnostic learning algorithms for halfspaces and decision trees, a different way to view PAC-learning DNFs

32 Boosting W is a γ-weak learner for concept class C for all distributions μ Can W be used as a black-box to strongly learn C under every distribution μ? [Kearns and Valiant '88] Yes – boosting. [Schapire '89, Freund '90, Freund and Schapire '95]

33 Agnostic Learning Halfspaces Low-degree algorithm is a (1/n d, ε 0 )-agnostic weak- learner for halfspaces under uniform distribution over the boolean cube. [Klivans-O'Donnell-Servedio '04] Halfspaces over boolean cube are agnostically learnable using examples only for constant ε. [Kalai-Klivans-Mansour- Servedio '05]

34 Learning Decision Trees Kushilevitz-Mansour (Goldreich-Levin) algorithm for learning parities using membership queries is a (1/ t, ε 0 )-agnostic weak learner for decision trees. [Kushilevitz-Mansour '91] Decision trees can be agnostically learned under uniform distribution using membership queries. [Gopalan- Kalai-Klivans '08]

35 PAC Learning DNFs Theorem: If C is agnostically learnable under distribution μ then disjunctions of C is PAC-learnable under μ. [Kalai-K-Mansour '09] DNFs are PAC-learnable under uniform distribution using membership queries. [Jackson 95]

36 PAC Learning Instance Space X Distribution μ over X Concept class C c є C Oracle access - + - - - - - -- - - - - - - - - -- - - - - - - - - - + + + + + + + + + + + + + + +

37 PAC Learning Output hypothesis h error = Pr x ~ μ [h(x) != c(x)] With high probability (1–δ) h is approximately (error ≤ ε) correct - + - - - - - -- - - - - - - - - -- - - - - - - - - - + + + + + + + + + + + + + + +

38 PAC Learning Instance space X, distribution μ over X, concept class C over X cor μ (h, c) = E x ~ μ [h(x) c(x)] = 1 – 2 err μ (h, c) C is PAC learnable under distribution μ, if for all c є C, ε, δ > 0, exists algorithm A which in polytime outputs h such that cor μ (h, c) ≥ 1- ε with probability at least 1 – δ strongly err D (h, c) ≤ ε/2

39 Weak PAC Learning Instance space X, distribution μ over X, concept class C over X Algorithm W is a γ-weak PAC learner for distribution μ if it outputs a hypothesis w such cor μ (w, c) ≥ γ err μ (w, c) ≤ (1– γ)/2 : somewhat better than random guessing

40 Key def: Weak Agnostic Learning opt = min c є C Pr (x, y) ~D [ c(x) != y] cor(c, D) = E (x, y) ~D [ c(x) y] = E x ~ μ [c(x) f(x)] cor(C, D) = max c є C cor(c, D) = 1 – 2 opt opt may be as low as 0 Algorithm W is a (γ,ε 0 )-weak agnostic learner for distribution μ if for every f: X → [-1, 1] and access to D = (μ, f) it outputs a hypothesis w such cor(w,D) ≥ γ cor(C,D) – ε 0

41 $\gamma$ → γ $\times$ → × $\mu$ → μ $\epsilon$ → ε $\gequal$ → ≥ $\sum$ → ∑ $\phi$ → φ $\Bigphi$ → Ф $\lequal$ → ≤ $\delta$ → δ $\in$ → ∈ $\wedge$ → ٨ $\vee$ → ۷


Download ppt "Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))"

Similar presentations


Ads by Google