CSCI B609: “Foundations of Data Science”

Slides:



Advertisements
Similar presentations
Linear Regression.
Advertisements

On-line learning and Boosting
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
CMPUT 466/551 Principal Source: CMU
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
The loss function, the normal equation,
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Linear Regression  Using a linear function to interpolate the training set  The most popular criterion: Least squares approach  Given the training set:
Probably Approximately Correct Model (PAC)
Vapnik-Chervonenkis Dimension
The Perceptron Algorithm (Dual Form) Given a linearly separable training setand Repeat: until no mistakes made within the for loop return:
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Adaboost and its application
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Unconstrained Optimization Problem
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Online Learning Algorithms
Support Vector Machines
Collaborative Filtering Matrix Factorization Approach
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
Mathematical formulation XIAO LIYING. Mathematical formulation.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
Benk Erika Kelemen Zsolt
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
Linear Models for Classification
Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Regularized Least-Squares and Convex Optimization.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Support vector machines
Large Margin classifiers
Dan Roth Department of Computer and Information Science
Introduction to Machine Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
ECE 5424: Introduction to Machine Learning
Vapnik–Chervonenkis Dimension
Robust Optimization and Applications in Machine Learning
Boosting Nearest-Neighbor Classifier for Character Recognition
CSCI B609: “Foundations of Data Science”
Collaborative Filtering Matrix Factorization Approach
The
Logistic Regression & Parallel SGD
Online Learning Kernels
Computational Learning Theory
CSCI B609: “Foundations of Data Science”
Support vector machines
Computational Learning Theory
CSCI B609: “Foundations of Data Science”
CSCI B609: “Foundations of Data Science”
The loss function, the normal equation,
Support vector machines
Mathematical Foundations of BME Reza Shadmehr
Softmax Classifier.
Support vector machines
CIS 700: “algorithms for Big Data”
Chapter 6. Large Scale Optimization
Presentation transcript:

CSCI B609: “Foundations of Data Science” Lecture 13/14: Gradient Descent, Boosting and Learning from Experts Slides at http://grigory.us/data-science-class.html Grigory Yaroslavtsev http://grigory.us

Constrained Convex Optimization Non-convex optimization is NP-hard: 𝑖 𝑥 𝑖 2 1 − 𝑥 𝑖 2 =0 ⇔∀𝑖: 𝑥 𝑖 ∈ 0,1 Knapsack: Minimize 𝑖 𝑐 𝑖 𝑥 𝑖 Subject to: 𝑖 𝑤 𝑖 𝑥 𝑖 ≤𝑊 Convex optimization can often be solved by ellipsoid algorithm in 𝑝𝑜𝑙𝑦(𝑛) time, but too slow

Convex multivariate functions Convexity: ∀𝑥,𝑦∈ ℝ 𝑛 : 𝑓 𝑥 ≥𝑓 𝑦 + 𝑥 −𝑦 𝛻f(y) ∀𝑥,𝑦∈ ℝ 𝑛 , 0≤𝜆≤1: 𝑓 𝜆𝑥+ 1−𝜆 𝑦 ≤𝜆𝑓 𝑥 + 1−𝜆 𝑓(𝑦) If higher derivatives exist: 𝑓 𝑥 =𝑓 𝑦 +𝛻𝑓 𝑦 ⋅ 𝑥−𝑦 + 𝑥−𝑦 𝑇 𝛻 2 𝑓 𝑥 𝑥−𝑦 +… 𝛻 2 𝑓 𝑥 𝑖𝑗 = 𝜕 2 𝑓 𝜕 𝑥 𝑖 𝜕 𝑥 𝑗 is the Hessian matrix 𝑓 is convex iff it’s Hessian is positive semidefinite, 𝑦 𝑇 𝛻 2 𝑓𝑦≥0 for all 𝑦.

Examples of convex functions ℓ 𝑝 -norm is convex for 1≤𝑝≤∞: 𝜆𝑥+ 1−𝜆 𝑦 𝑝 ≤ 𝜆𝑥 𝑝 + 1 −𝜆 𝑦 𝑝 =𝜆 𝑥 𝑝 + 1 −𝜆 𝑦 𝑝 𝑓 𝑥 = log ( 𝑒 𝑥 1 + 𝑒 𝑥 2 +⋯ 𝑒 𝑥 𝑛 ) max 𝑥 1 ,…, 𝑥 𝑛 ≤𝑓 𝑥 ≤ max 𝑥 1 ,…, 𝑥 𝑛 + log 𝑛 𝑓 𝑥 = 𝑥 𝑇 𝐴𝑥 where 𝐴 is a p.s.d. matrix, 𝛻 2 𝑓=𝐴 Examples of constrained convex optimization: (Linear equations with p.s.d. constraints): minimize: 1 2 𝑥 𝑇 𝐴𝑥− 𝑏 𝑇 𝑥 (solution satisfies 𝐴𝑥=𝑏) (Least squares regression): Minimize: 𝐴𝑥−𝑏 2 2 = x T A T A x −2 Ax T b+ b T b

Constrained Convex Optimization General formulation for convex 𝑓 and a convex set 𝐾: 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒: 𝑓 𝑥 subject to: 𝑥∈𝐾 Example (SVMs): Data: 𝑋 1 ,…, 𝑋 𝑁 ∈ ℝ 𝑛 labeled by 𝑦 1 ,…, 𝑦 𝑁 ∈ −1,1 (spam / non-spam) Find a linear model: 𝑊⋅ 𝑋 𝑖 ≥1⇒ 𝑋 𝑖 is spam 𝑊⋅ 𝑋 𝑖 ≤−1⇒ 𝑋 𝑖 is non-spam ∀𝑖:1 − 𝑦 𝑖 𝑊 𝑋 𝑖 ≤0 More robust version: 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒: 𝑖 𝐿𝑜𝑠𝑠 1−𝑊 𝑦 𝑖 𝑋 𝑖 +𝜆 𝑊 2 E.g. hinge loss Loss(t)=max(0,t) Another regularizer: 𝜆 𝑊 1 (favors sparse solutions)

Gradient Descent for Constrained Convex Optimization (Projection): 𝑥∉𝐾→𝑦∈𝐾 y= argmin z∈𝐾 𝑧−𝑥 2 Easy to compute for ⋅ 2 2 : 𝑦=𝑥/ 𝑥 2 2 Let 𝛻𝑓 𝑥 2 ≤𝐺, max 𝑥,𝑦∈𝐾 𝑥−𝑦 2 ≤𝐷. Let 𝑇= 4 𝐷 2 𝐺 2 𝜖 2 Gradient descent (gradient + projection oracles): Let 𝜂=𝐷/𝐺 𝑇 Repeat for 𝑖=0, …, 𝑇: 𝑦 (𝑖+1) = 𝑥 (𝑖) −𝜂𝛻𝑓 𝑥 𝑖 𝑥 (𝑖+1) = projection of 𝑦 (𝑖+1) on 𝐾 Output 𝑧= 1 𝑇 𝑖 𝑥 (𝑖)

Gradient Descent for Constrained Convex Optimization 𝑥 𝑖+1 − 𝑥 ∗ 2 2 ≤ 𝑦 𝑖+1 − 𝑥 ∗ 2 2 = 𝑥 𝑖 − 𝑥 ∗ −𝜂𝛻𝑓 𝑥 𝑖 2 2 = 𝑥 𝑖 − 𝑥 ∗ 2 2 + 𝜂 2 𝛻𝑓 𝑥 𝑖 2 2 −2𝜂𝛻𝑓 𝑥 𝑖 ⋅ 𝑥 𝑖 − 𝑥 ∗ Using definition of 𝐺: 𝛻𝑓 𝑥 𝑖 ⋅ 𝑥 𝑖 − 𝑥 ∗ ≤ 1 2𝜂 𝑥 𝑖 − 𝑥 ∗ 2 2 − 𝑥 𝑖+1 − 𝑥 ∗ 2 2 + 𝜂 2 𝐺 2 𝑓 𝑥 𝑖 −𝑓 𝑥 ∗ ≤ 1 2𝜂 𝑥 𝑖 − 𝑥 ∗ 2 2 − 𝑥 𝑖+1 − 𝑥 ∗ 2 2 + 𝜂 2 𝐺 2 Sum over 𝑖=1,…, 𝑇: 𝑖=1 𝑇 𝑓 𝑥 𝑖 −𝑓 𝑥 ∗ ≤ 1 2𝜂 𝑥 0 − 𝑥 ∗ 2 2 − 𝑥 𝑇 − 𝑥 ∗ 2 2 + 𝑇𝜂 2 𝐺 2

Gradient Descent for Constrained Convex Optimization 𝑖=1 𝑇 𝑓 𝑥 𝑖 −𝑓 𝑥 ∗ ≤ 1 2𝜂 𝑥 0 − 𝑥 ∗ 2 2 − 𝑥 𝑇 − 𝑥 ∗ 2 2 𝑥 0 − 𝑥 ∗ 2 2 − 𝑥 𝑇 − 𝑥 ∗ 2 2 + 𝑇𝜂 2 𝐺 2 𝑓 1 𝑇 𝑖 𝑥 𝑖 ≤ 1 𝑇 𝑖 𝑓 𝑥 𝑖 : 𝑓 1 𝑇 𝑖 𝑥 𝑖 −𝑓 𝑥 ∗ ≤ 𝐷 2 2𝜂𝑇 + 𝜂 2 𝐺 2 Set 𝜂= 𝐷 𝐺 𝑇 ⇒ RHS ≤ 𝐷𝐺 𝑇 ≤𝜖

Online Gradient Descent Gradient descent works in a more general case: 𝑓→ sequence of convex functions 𝑓 1 , 𝑓 2 …, 𝑓 𝑇 At step 𝑖 need to output 𝑥 (𝑖) ∈𝐾 Let 𝑥 ∗ be the minimizer of 𝑖 𝑓 𝑖 (𝑤) Minimize regret: 𝑖 𝑓 𝑖 𝑥 𝑖 − 𝑓 𝑖 ( 𝑥 ∗ ) Same analysis as before works in online case.

Stochastic Gradient Descent (Expected gradient oracle): returns 𝑔 such that 𝔼 𝑔 𝑔 =𝛻𝑓 𝑥 . Example: for SVM pick randomly one term from the loss function. Let 𝑔 𝑖 be the gradient returned at step i Let 𝑓 𝑖 = 𝑔 𝑖 𝑇 𝑥 be the function used in the i-th step of OGD Let 𝑧= 1 𝑇 𝑖 𝑥 (𝑖) and 𝑥 ∗ be the minimizer of 𝑓.

Stochastic Gradient Descent Thm. 𝔼 𝑓 𝑧 ≤𝑓 𝑥 ∗ + 𝐷𝐺 𝑇 where 𝐺 is an upper bound of any gradient output by oracle. 𝑓 𝑧 −𝑓 𝑥 ∗ ≤ 1 𝑇 𝑖 (𝑓 𝑥 𝑖 −𝑓( 𝑥 ∗ )) (convexity) ≤ 1 𝑇 𝑖 𝛻𝑓 𝑥 𝑖 𝑥 𝑖 − 𝑥 ∗ = 1 𝑇 𝑖 𝔼 [ 𝑔 𝑖 𝑇 ( 𝑥 𝑖 − 𝑥 ∗ )] (grad. oracle) = 1 𝑇 𝑖 𝔼[ 𝑓 𝑖 ( 𝑥 𝑖 )− 𝑓 𝑖 (𝑥 ∗ )] = 1 𝑇 𝔼[ 𝑖 𝑓 𝑖 ( 𝑥 𝑖 )− 𝑓 𝑖 (𝑥 ∗ )] 𝔼[] = regret of OGD , always ≤𝜖

VC-dim of combinations of concepts For 𝒌 concepts ℎ 1 ,…, ℎ 𝒌 + a Boolean function 𝑓: 𝑐𝑜𝑚 𝑏 𝑓 ℎ 1 ,… ℎ 𝒌 ={𝑥∈𝑋:𝑓 ℎ 1 𝑥 ,… ℎ 𝒌 𝑥 =1} Ex: 𝐻 = lin. separators, 𝑓=𝐴𝑁𝐷 / 𝑓=𝑀𝑎𝑗𝑜𝑟𝑖𝑡𝑦 For a concept class 𝐻 + a Boolean function 𝑓: 𝐶𝑂𝑀 𝐵 𝑓,𝒌 𝐻 ={𝑐𝑜𝑚 𝑏 𝑓 ℎ 1 ,… ℎ 𝒌 : ℎ 𝑖 ∈𝐻} Lem. If 𝑉𝐶-dim(𝐻)=𝒅 then for any 𝑓: 𝑉𝐶-dim 𝐶𝑂𝑀 𝐵 𝑓,𝒌 𝐻 ≤𝑂(𝒌𝒅 log(𝒌𝒅))

VC-dim of combinations of concepts Lem. If 𝑉𝐶-dim(𝐻)=𝑑 then for any 𝑓: 𝑉𝐶-dim 𝐶𝑂𝑀 𝐵 𝑓,𝒌 𝐻 ≤𝑂(𝒌𝒅 log(𝒌𝒅)) Let 𝑛=𝑉𝐶−dim 𝐶𝑂𝑀 𝐵 𝑓,𝒌 𝐻 ⇒∃ set 𝑆 of 𝑛 points shattered by 𝐶𝑂𝑀 𝐵 𝑓,𝒌 𝐻 Sauer’s lemma ⇒ ≤ 𝑛 𝑑 ways of labeling 𝑆 by 𝐻 Each labeling in 𝐶𝑂𝑀 𝐵 𝑓,𝒌 𝐻 determined by 𝒌 labelings of 𝑆 by 𝐻 ⇒ ≤ (𝑛 𝒅 ) 𝒌 = 𝑛 𝒌𝒅 labelings 2 𝑛 ≤ 𝑛 𝒌𝒅 ⇒𝑛≤𝒌𝒅 log 𝑛 ⇒𝑛≤2𝒌𝒅 log 𝒌𝒅

Back to the batch setting Classification problem Instance space 𝑋: 0,1 𝒅 or ℝ 𝒅 (feature vectors) Classification: come up with a mapping 𝑋→{0,1} Formalization: Assume there is a probability distribution 𝐷 over 𝑋 𝒄 ∗ = “target concept” (set 𝒄 ∗ ⊆𝑋 of positive instances) Given labeled i.i.d. samples from 𝐷 produce 𝒉⊆𝑋 Goal: have 𝒉 agree with 𝒄 ∗ over distribution 𝐷 Minimize: 𝑒𝑟 𝑟 𝐷 𝒉 = Pr 𝐷 [𝒉 Δ 𝒄 ∗ ] 𝑒𝑟 𝑟 𝐷 𝒉 = “true” or “generalization” error

Boosting Strong learner: succeeds with prob. ≥1 −𝜖 Weak learner: succeeds with prob. ≥ 1 2 +𝛾 Boosting (informal): weak learner that works under any distribution ⇒ strong learner Idea: run weak leaner 𝑨 on sample 𝑆 under reweightings focusing on misclassified examples

VC-dim ≤𝑂( 𝒕 𝟎 VC−dim 𝐻 log ( 𝒕 𝟎 VC−dim 𝐻 )) Boosting (cont.) 𝐻 = class of hypothesis produced by 𝑨 Apply majority rule to ℎ 1 ,…, ℎ 𝒕 𝟎 ∼𝐻: VC-dim ≤𝑂( 𝒕 𝟎 VC−dim 𝐻 log ( 𝒕 𝟎 VC−dim 𝐻 )) Algorithm: Given 𝑆=( 𝑥 1 ,…, 𝑥 𝑛 ) set 𝑤 𝑖 =1 in 𝒘=( 𝑤 1 ,…, 𝑤 𝑛 ) For 𝑡=1,…, 𝒕 𝟎 do: Call weak learner on 𝑆,𝒘 ⇒ hypothesis ℎ 𝑡 For misclassified 𝑥 𝑖 multiply 𝑤 𝑖 by 𝛼=( 1 2 +𝛾)/( 1 2 −𝛾) Output: MAJ( ℎ 1 , …, ℎ 𝒕 𝟎 )

Boosting: analysis Def (𝜸-weak learner on sample): For labeled examples 𝑥 𝑖 weighted by 𝑤 𝑖 with weight of correct ≥ 1 2 +𝛾 𝑖=1 𝑛 𝑤 𝑖 Thm. If 𝐴 is 𝛾-weak learner on 𝑆 ⇒ for 𝒕 𝟎 =𝑂 1 𝛾 2 𝑙𝑜𝑔 𝑛 boosting achieves 0 error on 𝑆. Proof. m = # mistakes of the final classifier Each was misclassified ≥ 𝒕 𝟎 2 times ⇒ weight ≥ 𝛼 𝒕 𝟎 /2 Total weight ≥𝑚 𝛼 𝒕 𝟎 /2 Total weight at 𝑡=W t 𝑊 𝑡+1 ≤ 𝛼 1 2 −𝛾 + 1 2 +𝛾 𝑊 𝑡 = 1+2𝛾 𝑊(𝑡)

Boosting: analysis (cont.) 𝑊 0 =𝑛⇒𝑊 𝒕 𝟎 ≤𝑛 1+2𝛾 𝒕 𝟎 𝑚 𝛼 𝒕 𝟎 /2 ≤𝑊 𝒕 𝟎 ≤𝑛 1+2𝛾 𝒕 𝟎 𝛼=( 1 2 +𝛾)/( 1 2 −𝛾) =(1+2𝛾)/(1−2𝛾) 𝑚≤𝑛 1−2𝛾 𝒕 𝟎 /2 1+2𝛾 𝒕 𝟎 /2 = 𝑛 1−4 𝛾 2 𝒕 𝟎 /2 1−𝑥≤ 𝑒 −𝑥 ⇒ 𝑚≤𝑛 𝑒 −2 𝛾 2 𝒕 𝟎 ⇒ 𝒕 𝟎 =𝑂 1 𝛾 2 log 𝑛 Comments: Applies even if the weak learners are adversarial VC-dim bounds ⇒𝑛= 𝑂 1 𝜖 𝑉𝐶−dim⁡(𝐻) 𝛾 2