Download presentation
Presentation is loading. Please wait.
1
Optimization in Machine Learning
Wenlin Chen
2
Machine Learning A science of getting computers to learn without being explicitly programmed Everyone uses it dozens of times a day without even knowing it
3
Search Engine Spam Filter Movie Recommendation Face Detection
4
Image Captioning Try this yourself at
5
Visual Question Answering
Try it at
6
The matches are still on-going…
Go Game 2:0 AlphaGo vs. Lee Sedol The matches are still on-going… Historical moment!!!
7
Machine Learning Model
Input Output an not spam The model predicts the label of the input A model is a mathematical function that contains parameters Model training/learning: tune the parameters so that the model fits the training data Numerical optimization comes into play!
8
Machine Learning Process
: vector of word frequency Feature extraction Training Data ( 1, not_spam) ( 2, spam) ( 3, spam) …… ( 97, not_spam) ( 98, spam) ( 99, not_spam) ( 100, spam) : feature vector : label If the data are images, the feature vector would be the pixel intensities latex code: 1. \begin{aligned} &(x_1,y_1)\\ &(x_2,y_2)\\ &\cdots\\ &(x_{100},y_{100}) \end{aligned} 2. y_i\approx f(x_i) ~~~\forall i Model training Learning a function
9
Empirical Risk Minimization
Loss function measures the “difference” between and latex code: L(f(x_i),y_i) \underset{f}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N L(f(x_i),y_i) Goal: minimize the loss on the training data The training of most machine learning algorithms is in this format. Their difference lies in f and L Solve the optimization with gradient descent
10
Linear Classification
Training data d-dimensional space HAM SPAM latex: 1.D=\{(x_1,y_1),\cdots,(x_n,y_n)\} x_i\in\mathcal{R}^d output 1 -1
11
An indicator whether is predicted correctly
Linear Models An indicator whether is predicted correctly latex code: f(x_i)=w^\top x_i+b y_i f(x_i) Correct Wrong
12
Not convex, not differentiable, not continuous
Loss Functions Not convex, not differentiable, not continuous Zero-one Loss latex code: L\left(f(x_i),y_i\right) = L\left(y_if(x_i)\right) z_i=y_if(x_i) 3. \begin{aligned} &L(z)=1 \text{ if } z>=0\\ &L(z)=0 \text{ otherwise} \end{aligned} 1. \begin{aligned} &L(f(x_i),y_i) = 0 \text{ if } y_if(x_i)>=0\\ &L(f(x_i),y_i) = 1 \text{ otherwise } Classification error
13
Loss Functions Hinge loss: Logistic loss:
Logistic regression Support vector machine both are continuous and convex latex code: L(z)=\log_2(1+e^{-z}) L(z)=\max(0,1-z) Upper bound of zero-one loss Not differentiable Minimizing upper bound also minimizes the classification error
14
e.g. a feature dominates the prediction
Regularization A predefined constant Regularization Why regularization? latex +\lambda \Omega(f) \Omega(f)=\|w\|_2=\sum_{i=1}^d w_i^2 \Omega(f)=\|w\|_1=\sum_{i=1}^d |w_i| Prevent overfitting e.g. a feature dominates the prediction L2 norm: Both convex L1 norm: Not differentiable Induce sparsity
15
Support Vector Machines (SVM)
Hinge loss: L2 norm: Linear model: latex: \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N \max\left(0,1-y_i (w^\top x + b)\right) + \lambda \|w\|_2 \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ C\sum_{i=1}^N \max\left(0,1-y_i (w^\top x + b)\right) + \frac{1}{2}\|w\|_2 C=\frac{1}{2\lambda N}
16
Support Vector Machines (SVM)
latex code \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ C\sum_{i=1}^N \xi_i + \frac{1}{2}\|w\|_2 \xi_i = \max\left(0,1-y_i (w^\top x + b)\right) \xi_i \geq 1-y_i (w^\top x + b)
17
Primal Form of SVM Lagrangian latex code 1. \begin{aligned}
\mathcal{L}(w,b,\xi,\alpha,\beta) =& C\sum_{i=1}^N \xi_i + \frac{1}{2}\|w\|_2 \\ &- \sum_i \alpha_i\{y_i(w^\top x+b)-1+\xi_i\} - \sum_i \beta_i \xi_i \end{aligned} 2. \alpha_i\geq 0 3. \beta_i\geq 0 4. \underset{\alpha,\beta}{\textrm{maximize}} ~\underset{w,b,\xi}{\textrm{minimize}} ~~~\mathcal{L}(w,b,\xi,\alpha,\beta) Lagrangian
18
Quadratic programming
Dual Form of SVM latex code \frac{\partial \mathcal{L}}{\partial w}=\frac{\partial \mathcal{L}}{\partial b}=\frac{\partial \mathcal{L}}{\partial \xi}=0 w=\sum_i \alpha_i y_i x_i \sum_i \alpha_i y_i = 0 \beta_i = C - \alpha_i \underset{\alpha,\beta}{\textrm{maximize}}~~~ \sum_i \alpha_i - \frac{1}{2}\alpha_i \alpha_j y_i y_j \langle x_i,x_j \rangle 0\leq \alpha_i\leq C ~~\forall i Quadratic programming
19
Nonlinear Classification
Kernel Trick latex code f(x)=w^\top x+b =\sum_i \alpha_i y_i \langle x_i,x\rangle + b =\sum_i \alpha_i y_i K(x_i,x) + b K(x_i,x)=\langle x_i,x\rangle K(x_i,x)=e^(\frac{-\|x_i-x\|^2}{2h}) RBF kernel Nonlinear Classification
20
Content-based Recommender System
Movies Interstellar The Martian Titanic Cinderella Sharon ? not_like like Jason Tom Users Goal: predict the question mark A binary classification problem! Use logistic regression or SVM Training data: Existing tuple {(user, movie), like_or_not} Possible features for (user, movie): Genre: romantic, sci-fi, action … User profile: gender, age, interest …
21
Learning to Rank : a score function
: a set of pair (i,j) where i is more preferable than j : a score function latex code: 1. \underset{f}{\textrm{minimize}} ~~ \sum_{(i,j)\in S} L\left(f(x_i)-f(x_j)\right) + \lambda \|w\|_2 Hinge loss: After optimization, becomes a score function that respects the preference specified in S Rank all instances by
22
Feature Selection Why select features?
Efficiency: #features could be millions for some applications Interpretability: explore which features are related to the outcome Feature selection makes the data speak for itself about which feature is unrelated For example, you put “height” into the feature vector when predicting a person’s salary You want the data to tell you whether “height” is related to a person’s salary
23
Feature selection: learns a sparse weight vector
feature i Feature selection: learns a sparse weight vector latex code: 1. =\sum_{i=1}^d w_i x^{(i)} + b 2. \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N L\left(f(x_i),y_i\right) + \lambda \|w\|_1 3. \|w\|_1=\sum_{i=1}^d |w_i| L1 norm
24
Why L1 Induces Sparsity latex code: 1. \|w\|_2=\sum_{i=1}^d w_i^2
25
How to Learn w Option 1: sub-gradient Option 2:
Not differentiable at 0 Option 1: sub-gradient Option 2: latex code: w_i=w_i^+-w_i^- \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N L\left(f(x_i),y_i\right) + \lambda \sum_d [w_d^+ + w_d^-] w_d^+>0 \text{ and } w_d^->0
26
Projected Gradient Descent
Box constraints At each iteration Do gradient descent on obj set any negative values to 0 (1,3) (-2,1) (0,1)
27
Multi-class Classification
Next move? Many possible actions! latex code: One vs. rest
28
Multi-class Classification
Next move? Many possible actions! latex code: 1. y=\arg\max_i ~f_i(x) One vs. rest Trained separately
29
Softmax Regression Problem: All become very large Objective unbounded
latex code: \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i f_{y_i}(x_i) \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \left(f_{y_i}(x_i) - \max_k\{f_k(x_i)\} \right) y_i\in \{1,2,\cdots,K\} \max_k\{f_k(x_i)\} \approx \log\left(\sum_k e^{f_k(x_i)}\right) Normalization hard max soft max
30
Softmax
31
Maximize log likelihood
Softmax Regression Convex latex code: 1. \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \left(f_{y_i}(x_i) - \log\left(\sum_k e^{f_k(x_i)}\right) \right) 2. \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \log\left(\frac{e^{f_{y_i}(x_i)}}{\sum_k e^{f_k(x_i)}}\right) 3. p(y=y_i|x_i) = \frac{e^{f_{y_i}(x_i)}}{\sum_k e^{f_k(x_i)}} 4. \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \log\left(p(y=y_i|x_i) \right) Maximize log likelihood
32
Neural Networks input hidden1 hidden2 output episcia latex
h_k = \sigma(\sum_j W_{k,j}x_j) y=\arg\max_k f_k(x) f_k(x) = \sigma(\sum_j W^{\prime \prime}_{k,j}h^\prime_j)
33
Thanks for your time!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.