Download presentation
Presentation is loading. Please wait.
Published byJohan Tanuwidjaja Modified over 6 years ago
1
CSCI B609: βFoundations of Data Scienceβ
Lecture 13/14: Gradient Descent, Boosting and Learning from Experts Slides at Grigory Yaroslavtsev
2
Constrained Convex Optimization
Non-convex optimization is NP-hard: π π₯ π β π₯ π 2 =0 ββπ: π₯ π β 0,1 Knapsack: Minimize π π π π₯ π Subject to: π π€ π π₯ π β€π Convex optimization can often be solved by ellipsoid algorithm in ππππ¦(π) time, but too slow
3
Convex multivariate functions
Convexity: βπ₯,π¦β β π : π π₯ β₯π π¦ + π₯ βπ¦ π»f(y) βπ₯,π¦β β π , 0β€πβ€1: π ππ₯+ 1βπ π¦ β€ππ π₯ + 1βπ π(π¦) If higher derivatives exist: π π₯ =π π¦ +π»π π¦ β
π₯βπ¦ + π₯βπ¦ π π» 2 π π₯ π₯βπ¦ +β¦ π» 2 π π₯ ππ = π 2 π π π₯ π π π₯ π is the Hessian matrix π is convex iff itβs Hessian is positive semidefinite, π¦ π π» 2 ππ¦β₯0 for all π¦.
4
Examples of convex functions
β π -norm is convex for 1β€πβ€β: ππ₯+ 1βπ π¦ π β€ ππ₯ π βπ π¦ π =π π₯ π + 1 βπ π¦ π π π₯ = log ( π π₯ 1 + π π₯ 2 +β― π π₯ π ) max π₯ 1 ,β¦, π₯ π β€π π₯ β€ max π₯ 1 ,β¦, π₯ π + log π π π₯ = π₯ π π΄π₯ where π΄ is a p.s.d. matrix, π» 2 π=π΄ Examples of constrained convex optimization: (Linear equations with p.s.d. constraints): minimize: π₯ π π΄π₯β π π π₯ (solution satisfies π΄π₯=π) (Least squares regression): Minimize: π΄π₯βπ = x T A T A x β2 Ax T b+ b T b
5
Constrained Convex Optimization
General formulation for convex π and a convex set πΎ: πππππππ§π: π π₯ subject to: π₯βπΎ Example (SVMs): Data: π 1 ,β¦, π π β β π labeled by π¦ 1 ,β¦, π¦ π β β1,1 (spam / non-spam) Find a linear model: πβ
π π β₯1β π π is spam πβ
π π β€β1β π π is non-spam βπ:1 β π¦ π π π π β€0 More robust version: πππππππ§π: π πΏππ π 1βπ π¦ π π π +π π 2 E.g. hinge loss Loss(t)=max(0,t) Another regularizer: π π 1 (favors sparse solutions)
6
Gradient Descent for Constrained Convex Optimization
(Projection): π₯βπΎβπ¦βπΎ y= argmin zβπΎ π§βπ₯ 2 Easy to compute for β
: π¦=π₯/ π₯ 2 2 Let π»π π₯ 2 β€πΊ, max π₯,π¦βπΎ π₯βπ¦ 2 β€π·. Let π= 4 π· 2 πΊ 2 π 2 Gradient descent (gradient + projection oracles): Let π=π·/πΊ π Repeat for π=0, β¦, π: π¦ (π+1) = π₯ (π) βππ»π π₯ π π₯ (π+1) = projection of π¦ (π+1) on πΎ Output π§= 1 π π π₯ (π)
7
Gradient Descent for Constrained Convex Optimization
π₯ π+1 β π₯ β β€ π¦ π+1 β π₯ β = π₯ π β π₯ β βππ»π π₯ π = π₯ π β π₯ β π 2 π»π π₯ π β2ππ»π π₯ π β
π₯ π β π₯ β Using definition of πΊ: π»π π₯ π β
π₯ π β π₯ β β€ 1 2π π₯ π β π₯ β β π₯ π+1 β π₯ β π 2 πΊ 2 π π₯ π βπ π₯ β β€ 1 2π π₯ π β π₯ β β π₯ π+1 β π₯ β π 2 πΊ 2 Sum over π=1,β¦, π: π=1 π π π₯ π βπ π₯ β β€ 1 2π π₯ 0 β π₯ β β π₯ π β π₯ β ππ 2 πΊ 2
8
Gradient Descent for Constrained Convex Optimization
π=1 π π π₯ π βπ π₯ β β€ 1 2π π₯ 0 β π₯ β β π₯ π β π₯ β π₯ 0 β π₯ β β π₯ π β π₯ β ππ 2 πΊ 2 π 1 π π π₯ π β€ 1 π π π π₯ π : π 1 π π π₯ π βπ π₯ β β€ π· 2 2ππ + π 2 πΊ 2 Set π= π· πΊ π β RHS β€ π·πΊ π β€π
9
Online Gradient Descent
Gradient descent works in a more general case: πβ sequence of convex functions π 1 , π 2 β¦, π π At step π need to output π₯ (π) βπΎ Let π₯ β be the minimizer of π π π (π€) Minimize regret: π π π π₯ π β π π ( π₯ β ) Same analysis as before works in online case.
10
Stochastic Gradient Descent
(Expected gradient oracle): returns π such that πΌ π π =π»π π₯ . Example: for SVM pick randomly one term from the loss function. Let π π be the gradient returned at step i Let π π = π π π π₯ be the function used in the i-th step of OGD Let π§= 1 π π π₯ (π) and π₯ β be the minimizer of π.
11
Stochastic Gradient Descent
Thm. πΌ π π§ β€π π₯ β + π·πΊ π where πΊ is an upper bound of any gradient output by oracle. π π§ βπ π₯ β β€ 1 π π (π π₯ π βπ( π₯ β )) (convexity) β€ 1 π π π»π π₯ π π₯ π β π₯ β = 1 π π πΌ [ π π π ( π₯ π β π₯ β )] (grad. oracle) = 1 π π πΌ[ π π ( π₯ π )β π π (π₯ β )] = 1 π πΌ[ π π π ( π₯ π )β π π (π₯ β )] πΌ[] = regret of OGD , always β€π
12
VC-dim of combinations of concepts
For π concepts β 1 ,β¦, β π + a Boolean function π: πππ π π β 1 ,β¦ β π ={π₯βπ:π β 1 π₯ ,β¦ β π π₯ =1} Ex: π» = lin. separators, π=π΄ππ· / π=πππππππ‘π¦ For a concept class π» + a Boolean function π: πΆππ π΅ π,π π» ={πππ π π β 1 ,β¦ β π : β π βπ»} Lem. If ππΆ-dim(π»)=π
then for any π: ππΆ-dim πΆππ π΅ π,π π» β€π(ππ
log(ππ
))
13
VC-dim of combinations of concepts
Lem. If ππΆ-dim(π»)=π then for any π: ππΆ-dim πΆππ π΅ π,π π» β€π(ππ
log(ππ
)) Let π=ππΆβdim πΆππ π΅ π,π π» ββ set π of π points shattered by πΆππ π΅ π,π π» Sauerβs lemma β β€ π π ways of labeling π by π» Each labeling in πΆππ π΅ π,π π» determined by π labelings of π by π» β β€ (π π
) π = π ππ
labelings 2 π β€ π ππ
βπβ€ππ
log π βπβ€2ππ
log ππ
14
Back to the batch setting
Classification problem Instance space π: 0,1 π
or β π
(feature vectors) Classification: come up with a mapping πβ{0,1} Formalization: Assume there is a probability distribution π· over π π β = βtarget conceptβ (set π β βπ of positive instances) Given labeled i.i.d. samples from π· produce πβπ Goal: have π agree with π β over distribution π· Minimize: ππ π π· π = Pr π· [π Ξ π β ] ππ π π· π = βtrueβ or βgeneralizationβ error
15
Boosting Strong learner: succeeds with prob. β₯1 βπ
Weak learner: succeeds with prob. β₯ 1 2 +πΎ Boosting (informal): weak learner that works under any distribution β strong learner Idea: run weak leaner π¨ on sample π under reweightings focusing on misclassified examples
16
VC-dim β€π( π π VCβdim π» log ( π π VCβdim π» ))
Boosting (cont.) π» = class of hypothesis produced by π¨ Apply majority rule to β 1 ,β¦, β π π βΌπ»: VC-dim β€π( π π VCβdim π» log ( π π VCβdim π» )) Algorithm: Given π=( π₯ 1 ,β¦, π₯ π ) set π€ π =1 in π=( π€ 1 ,β¦, π€ π ) For π‘=1,β¦, π π do: Call weak learner on π,π β hypothesis β π‘ For misclassified π₯ π multiply π€ π by πΌ=( πΎ)/( 1 2 βπΎ) Output: MAJ( β 1 , β¦, β π π )
17
Boosting: analysis Def (πΈ-weak learner on sample): For labeled examples π₯ π weighted by π€ π with weight of correct β₯ πΎ π=1 π π€ π Thm. If π΄ is πΎ-weak learner on π β for π π =π 1 πΎ 2 πππ π boosting achieves 0 error on π. Proof. m = # mistakes of the final classifier Each was misclassified β₯ π π 2 times β weight β₯ πΌ π π /2 Total weight β₯π πΌ π π /2 Total weight at π‘=W t π π‘+1 β€ πΌ 1 2 βπΎ πΎ π π‘ = 1+2πΎ π(π‘)
18
Boosting: analysis (cont.)
π 0 =πβπ π π β€π 1+2πΎ π π π πΌ π π /2 β€π π π β€π 1+2πΎ π π πΌ=( πΎ)/( 1 2 βπΎ) =(1+2πΎ)/(1β2πΎ) πβ€π 1β2πΎ π π / πΎ π π /2 = π 1β4 πΎ 2 π π /2 1βπ₯β€ π βπ₯ β πβ€π π β2 πΎ 2 π π β π π =π 1 πΎ 2 log π Comments: Applies even if the weak learners are adversarial VC-dim bounds βπ= π 1 π ππΆβdimβ‘(π») πΎ 2
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.