Learning by Loss Minimization
Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:
Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:
Example: Regression Examples:
Example: Regression Examples: Function:
Example: Regression Examples: Function: How to find ?
Loss Functions Least Squares: Least absolute deviations
Open Questions How to choose the model function? How to choose the loss function? How to minimize the loss function?
Example: Binary Classification
Support Vector Machines (SVMs) Binary classification can be viewed as the task of separating classes in feature space:
Support Vector Machines (SVMs)
Its sign is the predicted label right label
Support Vector Machines (SVMs)
Other losses ?
Can minimize using Stochastic subGradient Decent (SGD)
Constant
Can minimize using Stochastic subGradient Decent (SGD)
Papers Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter 2011 The Tradeoffs of Large Scale Learning, Léon Bottou and Olivier Bousquet 2011 Stochastic Gradient Descent Tricks, Léon Bottou 2012
Non-Linear SVMs Datasets that are linearly separable: 0 x
Non-Linear SVMs Datasets that are linearly separable: Datasets that are NOT linearly separable? 0 x 0 x
Non-Linear SVMs Datasets that are linearly separable: Datasets that are NOT linearly separable? Mapping to other (here higher) dimensions: 0 x 0 x 0 x2x2 x
What should be the mapping? 1 3
1 3
10
What should be the mapping in general?
Support Vector Machines (SVMs) The Lagrangian dual: Where the classifier is:
Support Vector Machines (SVMs) The Lagrangian dual: Where the classifier is:
Support Vector Machines (SVMs) The Lagrangian dual: Where the classifier is:
Support Vector Machines (SVMs) The Lagrangian dual:
Support Vector Machines (SVMs)
Support Vector Machines (SVMs) Primal with Kernels (Chapelle 06)
Popular Choices for Kernels Polynomial (homogenous) kernel: Polynomial (inhomogenous) kernel: Gaussian Radial Basis Function (RBF) kernel:
One-vs-One: trains classifiers, each one to classify between two classes and classify by majority. Multiclass ?
One-vs-One: trains classifiers, each one to classify between two classes and classify by majority. One-vs-All: train classifiers, each one to classify between one class and all other classes and classify by majority. Multiclass ?
One-vs-One: trains classifiers, each one to classify between two classes and classify by majority. One-vs-All: train classifiers, each one to classify between one class and all other classes and classify by majority. Multiclass (Crammer and Singer): train one-vs-all classifiers jointly. Multiclass ?
Multiclass (Crammer and Singer): train one-vs-all classifiers jointly.
Right class response
Multiclass (Crammer and Singer): train one-vs-all classifiers jointly. Wrong class that got the largest response
Complex labels – Structured Prediction
How to choose C or sigma for Gaussian kernel or … ?
How to evaluate performance ?
Neural Nets = Deep Learning