Optimization Tutorial

Slides:



Advertisements
Similar presentations
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Advertisements

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Linear Regression.
Classification / Regression Support Vector Machines
Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.
Efficient Large-Scale Structured Learning
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Separating Hyperplanes
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Computer vision: models, learning and inference
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Support Vector Machines
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.
Linear Regression  Using a linear function to interpolate the training set  The most popular criterion: Least squares approach  Given the training set:
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Support Vector Machine (SVM) Classification
Unconstrained Optimization Problem
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.
Trading Convexity for Scalability Marco A. Alvarez CS7680 Department of Computer Science Utah State University.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Collaborative Filtering Matrix Factorization Approach
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Learning with large datasets Machine Learning Large scale machine learning.
Biointelligence Laboratory, Seoul National University
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Machine learning optimization Usman Roshan. Machine learning Two components: – Modeling – Optimization Modeling – Generative: we assume a probabilistic.
Presented by: Mingkui Tan, Li Wang, Ivor W. Tsang School of Computer Engineering June 21-24, ICML2010 Haifa, Israel Learning Sparse SVM.
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
Support vector machines
PREDICT 422: Practical Machine Learning
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Multiplicative updates for L1-regularized regression
Lecture 04: Logistic Regression
Dan Roth Department of Computer and Information Science
Lecture 07: Soft-margin SVM
Geometrical intuition behind the dual problem
10701 / Machine Learning.
Jan Rupnik Jozef Stefan Institute
Probabilistic Models for Linear Regression
Statistical Learning Dong Liu Dept. EEIS, USTC.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Collaborative Filtering Matrix Factorization Approach
Lecture 07: Soft-margin SVM
Logistic Regression & Parallel SGD
PEGASOS Primal Estimated sub-GrAdient Solver for SVM
Kai-Wei Chang University of Virginia
Lecture 08: Soft-margin SVM
Support vector machines
Lecture 07: Soft-margin SVM
Support vector machines
CS639: Data Management for Data Science
Recap: Naïve Bayes classifier
Presentation transcript:

Optimization Tutorial Pritam Sukumar & Daphne Tsatsoulis CS 546: Machine Learning for Natural Language Processing

What is Optimization? Find the minimum or maximum of an objective function given a set of constraints:

Why Do We Care? Linear Classification Maximum Likelihood K-Means

Prefer Convex Problems Local (non global) minima and maxima:

Convex Functions and Sets

Important Convex Functions

Convex Optimization Problem

Lagrangian Dual

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Gradient Descent

Single Step Illustration

Full Gradient Descent Illustration Elipses are level curves (function is quadratic). If we start on ---- line –gradient goes towards minimum (middle), if not, bounce back and forth

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Newton’s Method Inverse Hessian Gradient Need Hessian to be invertible --- positive definite Inverse Hessian Gradient

Newton’s Method Picture

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Subgradient Descent Motivation

Subgradient Descent – Algorithm

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012) A brief introduction to Stochastic gradient descent

Online learning and optimization Goal of machine learning : Minimize expected loss given samples This is Stochastic Optimization Assume loss function is convex

Batch (sub)gradient descent for ML Process all examples together in each step Entire training set examined at each step Very slow when n is very large

Stochastic (sub)gradient descent “Optimize” one example at a time Choose examples randomly (or reorder and choose in order) Learning representative of example distribution

Stochastic (sub)gradient descent Equivalent to online learning (the weight vector w changes with every example) Convergence guaranteed for convex functions (to local minimum)

Hybrid! Stochastic – 1 example per iteration Batch – All the examples! Sample Average Approximation (SAA): Sample m examples at each step and perform SGD on them Allows for parallelization, but choice of m based on heuristics

SGD - Issues Convergence very sensitive to learning rate ( ) (oscillations near solution due to probabilistic nature of sampling) Might need to decrease with time to ensure the algorithm converges eventually Basically – SGD good for machine learning with large data sets!

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Problem Formulation

New Points

Limited Memory Quasi-Newton Methods

Limited Memory BFGS

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012) Paper presents the formulation of the dual coordinate descent for Logistic Regression Learning

Coordinate descent Minimize along each coordinate direction in turn. Repeat till minimum is found One complete cycle of coordinate descent is the same as gradient descent In some cases, analytical expressions available: Example: Dual form of SVM! Otherwise, numerical methods needed for each iteration

Dual coordinate descent Coordinate descent applied to the dual problem Commonly used to solve the dual problem for SVMs Allows for application of the Kernel trick Coordinate descent for optimization In this paper: Dual logistic regression and optimization using coordinate descent

Dual form of SVM SVM Dual form

Dual form of LR LR: Dual form (we let )

Coordinate descent for dual LR Along each coordinate direction:

Coordinate descent for dual LR No analytical expression available Use numerical optimization (Newton’s method/bisection method/BFGS/…)to iterate along each direction Beware of log! Beware of logarithmic issues

Coordinate descent for dual ME Maximum Entropy (ME) is extension of LR to multi-class problems In each iteartion, solve in two levels: Outer level – Consider block of variables at a time Each block has all labels and one example Inner level – Subproblem solved by dual coordinate descent Can also be solved similar to online CRF (exponentiated gradient methods)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012)

Dual Coordinate Descent First Order Methods: Gradient Descent Newton’s Method Introduction to Convex Optimization for Machine Learning, John Duchi, UC Berkeley, Tutorial, 2009 Subgradient Descent Stochastic Gradient Descent Stochastic Optimization for Machine Learning, Nathan Srebro and Ambuj Tewari, presented at ICML'10 Trust Regions Trust Region Newton method for large-scale logistic regression, C.-J. Lin, R. C. Weng, and S. S. Keerthi, Journal of Machine Learning Research, 2008 Dual Coordinate Descent Dual Coordinate Descent Methods for logistic regression and maximum entropy models, H.-F. Yu, F.-L. Huang, and C.-J. Lin, . Machine Learning Journal, 2011 Linear Classification Recent Advances of Large-scale linear classification, G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Proceedings of the IEEE, 100(2012) Paper presents survey of recent linear methods (most of which we have covered) and compares performance to nonlinear classifiers

Large scale linear classification NLP (usually) has large number of features, examples Nonlinear classifiers (including kernel methods) more accurate, but slow

Large scale linear classification Linear classifiers less accurate, but at least an order of magnitude faster Loss in accuracy lower with increase in number of examples Speed usually dependent on more than algorithm order Memory/disk capacity Parallelizability

Large scale linear classification Choice of optimization method depends on: Data property Number of examples, features Sparsity Formulation of problem Differentiability Convergence properties Primal vs dual Low order vs high order methods

Comparison of performance Performance gap goes down with increase in number of features Training, testing time for linear classifiers is much faster Liblinear vs libsvm

Thank you! Questions?