Optimization in Machine Learning

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Regularization David Kauchak CS 451 – Fall 2013.

An Introduction of Support Vector Machine

Support Vector Machines

SVM—Support Vector Machines

Support vector machine

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Lecture 14 – Neural Networks

Support Vector Machines (and Kernel Methods in general)

Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)

Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.

Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.

Optimization Theory Primal Optimization Problem subject to: Primal Optimal Value:

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Large-scale Classification and Regression Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

An Introduction to Support Vector Machines Martin Law.

Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Machine Learning Queens College Lecture 13: SVM Again.

M Machine Learning F# and Accord.net. Alena Dzenisenka Software architect at Luxoft Poland Member of F# Software Foundation Board of Trustees Researcher.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

An Introduction to Support Vector Machines (M. Law)

1 A fast algorithm for learning large scale preference relations Vikas C. Raykar and Ramani Duraiswami University of Maryland College Park Balaji Krishnapuram.

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

M Machine Learning F# and Accord.net.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

Support Vector Machine Slides from Andrew Moore and Mingyue Tan.

Neural networks and support vector machines

Support vector machines

PREDICT 422: Practical Machine Learning

Deep Feedforward Networks

Machine Learning & Deep Learning

Dhruv Batra Georgia Tech

ECE 5424: Introduction to Machine Learning

Multiplicative updates for L1-regularized regression

Dan Roth Department of Computer and Information Science

Empirical risk minimization

Perceptrons Lirong Xia.

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Jan Rupnik Jozef Stefan Institute

Perceptrons Support-Vector Machines

An Introduction to Support Vector Machines

An Introduction to Support Vector Machines

Support Vector Machines

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.

CS 2750: Machine Learning Support Vector Machines

Hyperparameters, bias-variance tradeoff, validation

Large Scale Support Vector Machines

Overview of Machine Learning

Support vector machines

Machine Learning Week 3.

Deep Learning for Non-Linear Control

Support Vector Machines and Kernels

The loss function, the normal equation,

Support vector machines

CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis

Mathematical Foundations of BME Reza Shadmehr

Support vector machines

Empirical risk minimization

Linear Discrimination

Primal Sparse Max-Margin Markov Networks

CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)

Introduction to Neural Networks

SVMs for Document Ranking

Perceptrons Lirong Xia.

Support Vector Machines

Presentation transcript:

Optimization in Machine Learning Wenlin Chen

Machine Learning A science of getting computers to learn without being explicitly programmed Everyone uses it dozens of times a day without even knowing it

Search Engine Spam Filter Movie Recommendation Face Detection

Image Captioning Try this yourself at http://deeplearning.cs.toronto.edu/

Visual Question Answering Try it at http://visualqa.csail.mit.edu/

The matches are still on-going… Go Game 2:0 AlphaGo vs. Lee Sedol The matches are still on-going… Historical moment!!!

Machine Learning Model Input Output an email not spam The model predicts the label of the input A model is a mathematical function that contains parameters Model training/learning: tune the parameters so that the model fits the training data Numerical optimization comes into play!

Machine Learning Process : vector of word frequency Feature extraction Training Data (email1, not_spam) (email2, spam) (email3, spam) …… (email97, not_spam) (email98, spam) (email99, not_spam) (email100, spam) : feature vector : label If the data are images, the feature vector would be the pixel intensities latex code: 1. \begin{aligned} &(x_1,y_1)\\ &(x_2,y_2)\\ &\cdots\\ &(x_{100},y_{100}) \end{aligned} 2. y_i\approx f(x_i) ~~~\forall i Model training Learning a function

Empirical Risk Minimization Loss function measures the “difference” between and latex code: L(f(x_i),y_i) \underset{f}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N L(f(x_i),y_i) Goal: minimize the loss on the training data The training of most machine learning algorithms is in this format. Their difference lies in f and L Solve the optimization with gradient descent

Linear Classification Training data d-dimensional space HAM SPAM latex: 1.D=\{(x_1,y_1),\cdots,(x_n,y_n)\} x_i\in\mathcal{R}^d output 1 -1

An indicator whether is predicted correctly Linear Models An indicator whether is predicted correctly latex code: f(x_i)=w^\top x_i+b y_i f(x_i) Correct Wrong

Not convex, not differentiable, not continuous Loss Functions Not convex, not differentiable, not continuous Zero-one Loss latex code: L\left(f(x_i),y_i\right) = L\left(y_if(x_i)\right) z_i=y_if(x_i) 3. \begin{aligned} &L(z)=1 \text{ if } z>=0\\ &L(z)=0 \text{ otherwise} \end{aligned} 1. \begin{aligned} &L(f(x_i),y_i) = 0 \text{ if } y_if(x_i)>=0\\ &L(f(x_i),y_i) = 1 \text{ otherwise } Classification error

Loss Functions Hinge loss: Logistic loss: Logistic regression Support vector machine both are continuous and convex latex code: L(z)=\log_2(1+e^{-z}) L(z)=\max(0,1-z) Upper bound of zero-one loss Not differentiable Minimizing upper bound also minimizes the classification error

e.g. a feature dominates the prediction Regularization A predefined constant Regularization Why regularization? latex +\lambda \Omega(f) \Omega(f)=\|w\|_2=\sum_{i=1}^d w_i^2 \Omega(f)=\|w\|_1=\sum_{i=1}^d |w_i| Prevent overfitting e.g. a feature dominates the prediction L2 norm: Both convex L1 norm: Not differentiable Induce sparsity

Support Vector Machines (SVM) Hinge loss: L2 norm: Linear model: latex: \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N \max\left(0,1-y_i (w^\top x + b)\right) + \lambda \|w\|_2 \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ C\sum_{i=1}^N \max\left(0,1-y_i (w^\top x + b)\right) + \frac{1}{2}\|w\|_2 C=\frac{1}{2\lambda N}

Support Vector Machines (SVM) latex code \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ C\sum_{i=1}^N \xi_i + \frac{1}{2}\|w\|_2 \xi_i = \max\left(0,1-y_i (w^\top x + b)\right) \xi_i \geq 1-y_i (w^\top x + b)

Primal Form of SVM Lagrangian latex code 1. \begin{aligned} \mathcal{L}(w,b,\xi,\alpha,\beta) =& C\sum_{i=1}^N \xi_i + \frac{1}{2}\|w\|_2 \\ &- \sum_i \alpha_i\{y_i(w^\top x+b)-1+\xi_i\} - \sum_i \beta_i \xi_i \end{aligned} 2. \alpha_i\geq 0 3. \beta_i\geq 0 4. \underset{\alpha,\beta}{\textrm{maximize}} ~\underset{w,b,\xi}{\textrm{minimize}} ~~~\mathcal{L}(w,b,\xi,\alpha,\beta) Lagrangian

Quadratic programming Dual Form of SVM latex code \frac{\partial \mathcal{L}}{\partial w}=\frac{\partial \mathcal{L}}{\partial b}=\frac{\partial \mathcal{L}}{\partial \xi}=0 w=\sum_i \alpha_i y_i x_i \sum_i \alpha_i y_i = 0 \beta_i = C - \alpha_i \underset{\alpha,\beta}{\textrm{maximize}}~~~ \sum_i \alpha_i - \frac{1}{2}\alpha_i \alpha_j y_i y_j \langle x_i,x_j \rangle 0\leq \alpha_i\leq C ~~\forall i Quadratic programming

Nonlinear Classification Kernel Trick latex code f(x)=w^\top x+b =\sum_i \alpha_i y_i \langle x_i,x\rangle + b =\sum_i \alpha_i y_i K(x_i,x) + b K(x_i,x)=\langle x_i,x\rangle K(x_i,x)=e^(\frac{-\|x_i-x\|^2}{2h}) RBF kernel Nonlinear Classification

Content-based Recommender System Movies Interstellar The Martian Titanic Cinderella Sharon ? not_like like Jason Tom Users Goal: predict the question mark A binary classification problem! Use logistic regression or SVM Training data: Existing tuple {(user, movie), like_or_not} Possible features for (user, movie): Genre: romantic, sci-fi, action … User profile: gender, age, interest …

Learning to Rank : a score function : a set of pair (i,j) where i is more preferable than j : a score function latex code: 1. \underset{f}{\textrm{minimize}} ~~ \sum_{(i,j)\in S} L\left(f(x_i)-f(x_j)\right) + \lambda \|w\|_2 Hinge loss: After optimization, becomes a score function that respects the preference specified in S Rank all instances by

Feature Selection Why select features? Efficiency: #features could be millions for some applications Interpretability: explore which features are related to the outcome Feature selection makes the data speak for itself about which feature is unrelated For example, you put “height” into the feature vector when predicting a person’s salary You want the data to tell you whether “height” is related to a person’s salary

Feature selection: learns a sparse weight vector feature i Feature selection: learns a sparse weight vector latex code: 1. =\sum_{i=1}^d w_i x^{(i)} + b 2. \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N L\left(f(x_i),y_i\right) + \lambda \|w\|_1 3. \|w\|_1=\sum_{i=1}^d |w_i| L1 norm

Why L1 Induces Sparsity latex code: 1. \|w\|_2=\sum_{i=1}^d w_i^2

How to Learn w Option 1: sub-gradient Option 2: Not differentiable at 0 Option 1: sub-gradient Option 2: latex code: w_i=w_i^+-w_i^- \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N L\left(f(x_i),y_i\right) + \lambda \sum_d [w_d^+ + w_d^-] w_d^+>0 \text{ and } w_d^->0

Projected Gradient Descent Box constraints At each iteration Do gradient descent on obj set any negative values to 0 (1,3) (-2,1) (0,1)

Multi-class Classification Next move? Many possible actions! latex code: One vs. rest

Multi-class Classification Next move? Many possible actions! latex code: 1. y=\arg\max_i ~f_i(x) One vs. rest Trained separately

Softmax Regression Problem: All become very large Objective unbounded latex code: \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i f_{y_i}(x_i) \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \left(f_{y_i}(x_i) - \max_k\{f_k(x_i)\} \right) y_i\in \{1,2,\cdots,K\} \max_k\{f_k(x_i)\} \approx \log\left(\sum_k e^{f_k(x_i)}\right) Normalization hard max soft max

Softmax

Maximize log likelihood Softmax Regression Convex latex code: 1. \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \left(f_{y_i}(x_i) - \log\left(\sum_k e^{f_k(x_i)}\right) \right) 2. \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \log\left(\frac{e^{f_{y_i}(x_i)}}{\sum_k e^{f_k(x_i)}}\right) 3. p(y=y_i|x_i) = \frac{e^{f_{y_i}(x_i)}}{\sum_k e^{f_k(x_i)}} 4. \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \log\left(p(y=y_i|x_i) \right) Maximize log likelihood

Neural Networks input hidden1 hidden2 output episcia latex h_k = \sigma(\sum_j W_{k,j}x_j) y=\arg\max_k f_k(x) f_k(x) = \sigma(\sum_j W^{\prime \prime}_{k,j}h^\prime_j)

Thanks for your time!