Optimization in Machine Learning Wenlin Chen
Machine Learning A science of getting computers to learn without being explicitly programmed Everyone uses it dozens of times a day without even knowing it
Search Engine Spam Filter Movie Recommendation Face Detection
Image Captioning Try this yourself at http://deeplearning.cs.toronto.edu/
Visual Question Answering Try it at http://visualqa.csail.mit.edu/
The matches are still on-going… Go Game 2:0 AlphaGo vs. Lee Sedol The matches are still on-going… Historical moment!!!
Machine Learning Model Input Output an email not spam The model predicts the label of the input A model is a mathematical function that contains parameters Model training/learning: tune the parameters so that the model fits the training data Numerical optimization comes into play!
Machine Learning Process : vector of word frequency Feature extraction Training Data (email1, not_spam) (email2, spam) (email3, spam) …… (email97, not_spam) (email98, spam) (email99, not_spam) (email100, spam) : feature vector : label If the data are images, the feature vector would be the pixel intensities latex code: 1. \begin{aligned} &(x_1,y_1)\\ &(x_2,y_2)\\ &\cdots\\ &(x_{100},y_{100}) \end{aligned} 2. y_i\approx f(x_i) ~~~\forall i Model training Learning a function
Empirical Risk Minimization Loss function measures the “difference” between and latex code: L(f(x_i),y_i) \underset{f}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N L(f(x_i),y_i) Goal: minimize the loss on the training data The training of most machine learning algorithms is in this format. Their difference lies in f and L Solve the optimization with gradient descent
Linear Classification Training data d-dimensional space HAM SPAM latex: 1.D=\{(x_1,y_1),\cdots,(x_n,y_n)\} x_i\in\mathcal{R}^d output 1 -1
An indicator whether is predicted correctly Linear Models An indicator whether is predicted correctly latex code: f(x_i)=w^\top x_i+b y_i f(x_i) Correct Wrong
Not convex, not differentiable, not continuous Loss Functions Not convex, not differentiable, not continuous Zero-one Loss latex code: L\left(f(x_i),y_i\right) = L\left(y_if(x_i)\right) z_i=y_if(x_i) 3. \begin{aligned} &L(z)=1 \text{ if } z>=0\\ &L(z)=0 \text{ otherwise} \end{aligned} 1. \begin{aligned} &L(f(x_i),y_i) = 0 \text{ if } y_if(x_i)>=0\\ &L(f(x_i),y_i) = 1 \text{ otherwise } Classification error
Loss Functions Hinge loss: Logistic loss: Logistic regression Support vector machine both are continuous and convex latex code: L(z)=\log_2(1+e^{-z}) L(z)=\max(0,1-z) Upper bound of zero-one loss Not differentiable Minimizing upper bound also minimizes the classification error
e.g. a feature dominates the prediction Regularization A predefined constant Regularization Why regularization? latex +\lambda \Omega(f) \Omega(f)=\|w\|_2=\sum_{i=1}^d w_i^2 \Omega(f)=\|w\|_1=\sum_{i=1}^d |w_i| Prevent overfitting e.g. a feature dominates the prediction L2 norm: Both convex L1 norm: Not differentiable Induce sparsity
Support Vector Machines (SVM) Hinge loss: L2 norm: Linear model: latex: \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N \max\left(0,1-y_i (w^\top x + b)\right) + \lambda \|w\|_2 \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ C\sum_{i=1}^N \max\left(0,1-y_i (w^\top x + b)\right) + \frac{1}{2}\|w\|_2 C=\frac{1}{2\lambda N}
Support Vector Machines (SVM) latex code \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ C\sum_{i=1}^N \xi_i + \frac{1}{2}\|w\|_2 \xi_i = \max\left(0,1-y_i (w^\top x + b)\right) \xi_i \geq 1-y_i (w^\top x + b)
Primal Form of SVM Lagrangian latex code 1. \begin{aligned} \mathcal{L}(w,b,\xi,\alpha,\beta) =& C\sum_{i=1}^N \xi_i + \frac{1}{2}\|w\|_2 \\ &- \sum_i \alpha_i\{y_i(w^\top x+b)-1+\xi_i\} - \sum_i \beta_i \xi_i \end{aligned} 2. \alpha_i\geq 0 3. \beta_i\geq 0 4. \underset{\alpha,\beta}{\textrm{maximize}} ~\underset{w,b,\xi}{\textrm{minimize}} ~~~\mathcal{L}(w,b,\xi,\alpha,\beta) Lagrangian
Quadratic programming Dual Form of SVM latex code \frac{\partial \mathcal{L}}{\partial w}=\frac{\partial \mathcal{L}}{\partial b}=\frac{\partial \mathcal{L}}{\partial \xi}=0 w=\sum_i \alpha_i y_i x_i \sum_i \alpha_i y_i = 0 \beta_i = C - \alpha_i \underset{\alpha,\beta}{\textrm{maximize}}~~~ \sum_i \alpha_i - \frac{1}{2}\alpha_i \alpha_j y_i y_j \langle x_i,x_j \rangle 0\leq \alpha_i\leq C ~~\forall i Quadratic programming
Nonlinear Classification Kernel Trick latex code f(x)=w^\top x+b =\sum_i \alpha_i y_i \langle x_i,x\rangle + b =\sum_i \alpha_i y_i K(x_i,x) + b K(x_i,x)=\langle x_i,x\rangle K(x_i,x)=e^(\frac{-\|x_i-x\|^2}{2h}) RBF kernel Nonlinear Classification
Content-based Recommender System Movies Interstellar The Martian Titanic Cinderella Sharon ? not_like like Jason Tom Users Goal: predict the question mark A binary classification problem! Use logistic regression or SVM Training data: Existing tuple {(user, movie), like_or_not} Possible features for (user, movie): Genre: romantic, sci-fi, action … User profile: gender, age, interest …
Learning to Rank : a score function : a set of pair (i,j) where i is more preferable than j : a score function latex code: 1. \underset{f}{\textrm{minimize}} ~~ \sum_{(i,j)\in S} L\left(f(x_i)-f(x_j)\right) + \lambda \|w\|_2 Hinge loss: After optimization, becomes a score function that respects the preference specified in S Rank all instances by
Feature Selection Why select features? Efficiency: #features could be millions for some applications Interpretability: explore which features are related to the outcome Feature selection makes the data speak for itself about which feature is unrelated For example, you put “height” into the feature vector when predicting a person’s salary You want the data to tell you whether “height” is related to a person’s salary
Feature selection: learns a sparse weight vector feature i Feature selection: learns a sparse weight vector latex code: 1. =\sum_{i=1}^d w_i x^{(i)} + b 2. \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N L\left(f(x_i),y_i\right) + \lambda \|w\|_1 3. \|w\|_1=\sum_{i=1}^d |w_i| L1 norm
Why L1 Induces Sparsity latex code: 1. \|w\|_2=\sum_{i=1}^d w_i^2
How to Learn w Option 1: sub-gradient Option 2: Not differentiable at 0 Option 1: sub-gradient Option 2: latex code: w_i=w_i^+-w_i^- \underset{w\in \mathcal{R}^d,b\in \mathcal{R}}{\textrm{minimize}} ~~ \frac{1}{N}\sum_{i=1}^N L\left(f(x_i),y_i\right) + \lambda \sum_d [w_d^+ + w_d^-] w_d^+>0 \text{ and } w_d^->0
Projected Gradient Descent Box constraints At each iteration Do gradient descent on obj set any negative values to 0 (1,3) (-2,1) (0,1)
Multi-class Classification Next move? Many possible actions! latex code: One vs. rest
Multi-class Classification Next move? Many possible actions! latex code: 1. y=\arg\max_i ~f_i(x) One vs. rest Trained separately
Softmax Regression Problem: All become very large Objective unbounded latex code: \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i f_{y_i}(x_i) \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \left(f_{y_i}(x_i) - \max_k\{f_k(x_i)\} \right) y_i\in \{1,2,\cdots,K\} \max_k\{f_k(x_i)\} \approx \log\left(\sum_k e^{f_k(x_i)}\right) Normalization hard max soft max
Softmax
Maximize log likelihood Softmax Regression Convex latex code: 1. \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \left(f_{y_i}(x_i) - \log\left(\sum_k e^{f_k(x_i)}\right) \right) 2. \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \log\left(\frac{e^{f_{y_i}(x_i)}}{\sum_k e^{f_k(x_i)}}\right) 3. p(y=y_i|x_i) = \frac{e^{f_{y_i}(x_i)}}{\sum_k e^{f_k(x_i)}} 4. \underset{w_k\in \mathcal{R}^d}{\textrm{maximize}} \sum_i \log\left(p(y=y_i|x_i) \right) Maximize log likelihood
Neural Networks input hidden1 hidden2 output episcia latex h_k = \sigma(\sum_j W_{k,j}x_j) y=\arg\max_k f_k(x) f_k(x) = \sigma(\sum_j W^{\prime \prime}_{k,j}h^\prime_j)
Thanks for your time!