Optimization in Machine Learning

Optimization in Machine Learning
Wenlin Chen

Machine Learning A science of getting computers to learn without being explicitly programmed Everyone uses it dozens of times a day without even knowing it

Search Engine Spam Filter Movie Recommendation Face Detection

Image Captioning Try this yourself at

Visual Question Answering
Try it at

The matches are still on-going…
Go Game 2:0 AlphaGo vs. Lee Sedol The matches are still on-going… Historical moment!!!

Machine Learning Model
Input Output an not spam The model predicts the label of the input A model is a mathematical function that contains parameters Model training/learning: tune the parameters so that the model fits the training data Numerical optimization comes into play!

Machine Learning Process
: vector of word frequency Feature extraction Training Data ( 1, not_spam) ( 2, spam) ( 3, spam) …… ( 97, not_spam) ( 98, spam) ( 99, not_spam) ( 100, spam) : feature vector : label If the data are images, the feature vector would be the pixel intensities latex code: 1. \begin{aligned} &(x_1,y_1)\\ &(x_2,y_2)\\ &\cdots\\ &(x_{100},y_{100}) \end{aligned} 2. y_i\approx f(x_i) ~~~\forall i Model training Learning a function

Empirical Risk Minimization
Loss function measures the “difference” between and latex code: L(f(x_i),y_i) \underset{f}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N L(f(x_i),y_i) Goal: minimize the loss on the training data The training of most machine learning algorithms is in this format. Their difference lies in f and L Solve the optimization with gradient descent

Linear Classification
Training data d-dimensional space HAM SPAM latex: 1.D=\{(x_1,y_1),\cdots,(x_n,y_n)\} x_i\in\mathcal{R}^d output 1 -1

An indicator whether is predicted correctly
Linear Models An indicator whether is predicted correctly latex code: f(x_i)=w^\top x_i+b y_i f(x_i) Correct Wrong

Not convex, not differentiable, not continuous
Loss Functions Not convex, not differentiable, not continuous Zero-one Loss latex code: L\left(f(x_i),y_i\right) = L\left(y_if(x_i)\right) z_i=y_if(x_i) 3. \begin{aligned} &L(z)=1 \text{ if } z>=0\\ &L(z)=0 \text{ otherwise} \end{aligned} 1. \begin{aligned} &L(f(x_i),y_i) = 0 \text{ if } y_if(x_i)>=0\\ &L(f(x_i),y_i) = 1 \text{ otherwise } Classification error

Loss Functions Hinge loss: Logistic loss:
Logistic regression Support vector machine both are continuous and convex latex code: L(z)=\log_2(1+e^{-z}) L(z)=\max(0,1-z) Upper bound of zero-one loss Not differentiable Minimizing upper bound also minimizes the classification error

e.g. a feature dominates the prediction
Regularization A predefined constant Regularization Why regularization? latex +\lambda \Omega(f) \Omega(f)=\|w\|_2=\sum_{i=1}^d w_i^2 \Omega(f)=\|w\|_1=\sum_{i=1}^d |w_i| Prevent overfitting e.g. a feature dominates the prediction L2 norm: Both convex L1 norm: Not differentiable Induce sparsity

Support Vector Machines (SVM)
Hinge loss: L2 norm: Linear model: latex: \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N \max\left(0,1-y_i (w^\top x + b)\right) + \lambda \|w\|_2 \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ C\sum_{i=1}^N \max\left(0,1-y_i (w^\top x + b)\right) + \frac{1}{2}\|w\|_2 C=\frac{1}{2\lambda N}

Support Vector Machines (SVM)
latex code \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ C\sum_{i=1}^N \xi_i + \frac{1}{2}\|w\|_2 \xi_i = \max\left(0,1-y_i (w^\top x + b)\right) \xi_i \geq 1-y_i (w^\top x + b)

Primal Form of SVM Lagrangian latex code 1. \begin{aligned}
\mathcal{L}(w,b,\xi,\alpha,\beta) =& C\sum_{i=1}^N \xi_i + \frac{1}{2}\|w\|_2 \\ &- \sum_i \alpha_i\{y_i(w^\top x+b)-1+\xi_i\} - \sum_i \beta_i \xi_i \end{aligned} 2. \alpha_i\geq 0 3. \beta_i\geq 0 4. \underset{\alpha,\beta}{\textrm{maximize}} ~\underset{w,b,\xi}{\textrm{minimize}} ~~~\mathcal{L}(w,b,\xi,\alpha,\beta) Lagrangian

Quadratic programming
Dual Form of SVM latex code \frac{\partial \mathcal{L}}{\partial w}=\frac{\partial \mathcal{L}}{\partial b}=\frac{\partial \mathcal{L}}{\partial \xi}=0 w=\sum_i \alpha_i y_i x_i \sum_i \alpha_i y_i = 0 \beta_i = C - \alpha_i \underset{\alpha,\beta}{\textrm{maximize}}~~~ \sum_i \alpha_i - \frac{1}{2}\alpha_i \alpha_j y_i y_j \langle x_i,x_j \rangle 0\leq \alpha_i\leq C ~~\forall i Quadratic programming

Nonlinear Classification
Kernel Trick latex code f(x)=w^\top x+b =\sum_i \alpha_i y_i \langle x_i,x\rangle + b =\sum_i \alpha_i y_i K(x_i,x) + b K(x_i,x)=\langle x_i,x\rangle K(x_i,x)=e^(\frac{-\|x_i-x\|^2}{2h}) RBF kernel Nonlinear Classification

Content-based Recommender System
Movies Interstellar The Martian Titanic Cinderella Sharon ? not_like like Jason Tom Users Goal: predict the question mark A binary classification problem! Use logistic regression or SVM Training data: Existing tuple {(user, movie), like_or_not} Possible features for (user, movie): Genre: romantic, sci-fi, action … User profile: gender, age, interest …

Learning to Rank : a score function
: a set of pair (i,j) where i is more preferable than j : a score function latex code: 1. \underset{f}{\textrm{minimize}} ~~ \sum_{(i,j)\in S} L\left(f(x_i)-f(x_j)\right) + \lambda \|w\|_2 Hinge loss: After optimization, becomes a score function that respects the preference specified in S Rank all instances by

Feature Selection Why select features?
Efficiency: #features could be millions for some applications Interpretability: explore which features are related to the outcome Feature selection makes the data speak for itself about which feature is unrelated For example, you put “height” into the feature vector when predicting a person’s salary You want the data to tell you whether “height” is related to a person’s salary

Feature selection: learns a sparse weight vector
feature i Feature selection: learns a sparse weight vector latex code: 1. =\sum_{i=1}^d w_i x^{(i)} + b 2. \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N L\left(f(x_i),y_i\right) + \lambda \|w\|_1 3. \|w\|_1=\sum_{i=1}^d |w_i| L1 norm

Why L1 Induces Sparsity latex code: 1. \|w\|_2=\sum_{i=1}^d w_i^2

How to Learn w Option 1: sub-gradient Option 2:
Not differentiable at 0 Option 1: sub-gradient Option 2: latex code: w_i=w_i^+-w_i^- \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N L\left(f(x_i),y_i\right) + \lambda \sum_d [w_d^+ + w_d^-] w_d^+>0 \text{ and } w_d^->0

Projected Gradient Descent
Box constraints At each iteration Do gradient descent on obj set any negative values to 0 (1,3) (-2,1) (0,1)

Multi-class Classification
Next move? Many possible actions! latex code: One vs. rest

Multi-class Classification
Next move? Many possible actions! latex code: 1. y=\arg\max_i ~f_i(x) One vs. rest Trained separately

Softmax Regression Problem: All become very large Objective unbounded
latex code: \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i f_{y_i}(x_i) \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \left(f_{y_i}(x_i) - \max_k\{f_k(x_i)\} \right) y_i\in \{1,2,\cdots,K\} \max_k\{f_k(x_i)\} \approx \log\left(\sum_k e^{f_k(x_i)}\right) Normalization hard max soft max

Softmax

Maximize log likelihood
Softmax Regression Convex latex code: 1. \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \left(f_{y_i}(x_i) - \log\left(\sum_k e^{f_k(x_i)}\right) \right) 2. \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \log\left(\frac{e^{f_{y_i}(x_i)}}{\sum_k e^{f_k(x_i)}}\right) 3. p(y=y_i|x_i) = \frac{e^{f_{y_i}(x_i)}}{\sum_k e^{f_k(x_i)}} 4. \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \log\left(p(y=y_i|x_i) \right) Maximize log likelihood

Neural Networks input hidden1 hidden2 output episcia latex
h_k = \sigma(\sum_j W_{k,j}x_j) y=\arg\max_k f_k(x) f_k(x) = \sigma(\sum_j W^{\prime \prime}_{k,j}h^\prime_j)

Thanks for your time!

Optimization in Machine Learning

Similar presentations

Presentation on theme: "Optimization in Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimization in Machine Learning

Similar presentations

Presentation on theme: "Optimization in Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback