Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.

Slides:

Advertisements

Similar presentations

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.

Advertisements

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.

Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

Support vector machine

Optimization Tutorial

Supervised Learning Recap

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Separating Hyperplanes

Machine learning optimization

Chapter 2: Lasso for linear models

Distributed Optimization with Arbitrary Local Solvers

Proximal Methods for Sparse Hierarchical Dictionary Learning Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach Presented by Bo Chen,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Support Vector Machines

Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.

Numerical Optimization

1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.

1-norm Support Vector Machines Good for Feature Selection  Solve the quadratic program for some : min s. t.,, denotes where or membership. Equivalent.

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.

Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

Support Vector Machine (SVM) Classification

Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.

Active Set Support Vector Regression

Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000

Lecture 10: Support Vector Machines

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Support Vector Machines

Collaborative Filtering Matrix Factorization Approach

Online Dictionary Learning for Sparse Coding International Conference on Machine Learning, 2009 Julien Mairal, Francis Bach, Jean Ponce and Guillermo Sapiro.

Parallel Coordinate Descent for L1-Regularized Loss Minimization

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.

Benk Erika Kelemen Zsolt

Online Learning for Collaborative Filtering

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.

Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Logistic Regression William Cohen.

Logistic Regression & Elastic Net

An Efficient Algorithm for a Class of Fused Lasso Problems Jun Liu, Lei Yuan, and Jieping Ye Computer Science and Engineering The Biodesign Institute Arizona.

Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent

The role of optimization in machine learning

Data Driven Resource Allocation for Distributed Learning

Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides

Multiplicative updates for L1-regularized regression

Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides

Dan Roth Department of Computer and Information Science

Generalization and adaptivity in stochastic convex optimization

Jan Rupnik Jozef Stefan Institute

Probabilistic Models for Linear Regression

Collaborative Filtering Matrix Factorization Approach

Large Scale Support Vector Machines

PEGASOS Primal Estimated sub-GrAdient Solver for SVM

Biointelligence Laboratory, Seoul National University

CS5321 Numerical Optimization

CRISP: Consensus Regularized Selection based Prediction

Support Vector Machine I

CS639: Data Management for Data Science

Primal Sparse Max-Margin Markov Networks

Presentation transcript:

Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington

Very important to machine learning Our focus is constrained convex optimization Number of constraints can be very large! Optimization Models of data Optimal model

Sparse regression Many constraints in dual problem Classification Example

Choices for Scaling Optimization Stochastic methods Parallelization Subject of this talk: Active Sets

Active Set Motivation Important fact

Convex objective f Convex Optimization with Active Sets Feasible set

1. Choose set of constraints 2. Set x to minimize objective subject to chosen constraints Convex Optimization with Active Sets x Then repeat...

2. Set x to minimize objective subject to chosen constraints 1. Choose set of constraints Algorithm converges when x is feasible! x Convex Optimization with Active Sets

Limitations of Active Sets Until x is feasible do Propose active set of important constraints x ← Minimizer of objective s.t. only active set How many iterations to expect? x is infeasible until convergence Which constraints are important? How many constraints to choose? When to terminate subproblem?

Blitz 1. Update y to be extreme feasible point on segment [y,x] Feasible point y Minimizer subject to no constraints x

x y 2. Select top k constraints with boundaries closest to y Blitz

x y 3. Set x to minimize objective subject to selected constraints And repeat…

x 1. Update y to be extreme feasible point on segment [y,x] y Blitz

2. Choose top k constraints with boundaries closest to y y 3. Set x to minimize objective subject to selected constraints x When x = y, Blitz converges! Blitz

Blitz Intuition The key to Blitz is its y-update yx

Blitz Intuition The key to Blitz is its y-update If y update is large, Blitz is near convergence If y update is small, xyy

If y update is large, Blitz is near convergence If y update is small, then violated constraint greatly improves x next iteration xy Blitz Intuition y x must improve significantly

Main Theorem Theorem 2.1

Active Set Size for Linear Convergence Corollary 2.2

Constraint Screening Corollary 2.3

Tuning Algorithmic Parameters Theory guides choice of: Active set size Subproblem termination criteria Best fixed Tuned using theory

Recap Blitz is an active set algorithm that: Selects theoretically justified active sets to maximize guaranteed progress Applies theoretical analysis to guide choice of algorithm parameters Discards constraints proven to be irrelevant during optimization

Empirical Evaluation

Experiment Overview Apply Blitz to L1-regularized loss minimization Dual is a constrained problem Optimizing subject to active set corresponds to solving primal problem over subset of variables

Single Machine, Data in Memory Relative Suboptimality Time (s) Active Sets Experiment with high-dimensional RCV1 dataset No Prioritization ProxNewt CD L1_LR LIBLINEAR GLMNET Blitz

Limited Memory Setting Data cannot always fit in memory Active set methods require only a subset of data at each iteration to solve subproblem Set-up: –1 pass over data to load active set –Solve subproblem with active set in memory –Repeat

Limited Memory Setting Relative Suboptimality Time (s) Experiment with12 GB Webspam dataset and 1 GB memory Prioritized Memory Usage No Prioritization AdaGrad_1.0 AdaGrad_10.0 AdaGrad_100.0 CD Strong Rule Blitz

Distributed Setting With > 1 machine, communication is costly Blitz subproblems require communication for only active set features Set-up: –Solve with synchronous bulk gradient descent –Prioritize communication using active sets

Distributed Setting Experiment with Criteo CTR dataset and 16 machines Relative Suboptimality No Prioritization Prioritized Communication Time (min) Gradient Descent KKT Filter Blitz

Takeaways Active sets are effective at exploiting structure! We have introduced Blitz, an active sets algorithm that –Provides novel, useful theoretical guarantees –Is very fast in practice Future work –Extensions to larger variety of problems –Modifications such as constraint sampling Thanks!

References Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1–106, Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, Fan, R. E., Chen, P. H., and Lin, C. J. Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6: 1889–1918, Fercoq, O. and Richtárik, P. Accelerated, parallel and proximal coordinate descent. Technical Report arXiv: , Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, Ghaoui, L. E., Viallon, V., and Rabbani, T. Safe feature elimination for the lasso and sparse supervised learning problems. Pacific Journal of Optimization, 8(4):667– 698, Kim, H. and Park, H. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM Journal on Matrix Analysis and Applications, 30(2):713–730, Kim, S. J., Koh, K., Lustig, M., Boyd, S., and Gorinevsky, D. An interior-point method for large-scale L1-regularized least squares. IEEE Journal on Selected Top- ics in Signal Processing, 1(4):606–617, Koh, K., Kim, S. J., and Boyd, S. An interior-point method for large-scale L1-regularized logistic regression. Journal of Machine Learning Research, 8:519–1555, Li, M., Smola, A., and Andersen, D. G. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems 27, Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., and Tibshirani, R. J. Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society, Series B, 74(2):245–266, Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Re- search, 6:1453–1484, Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543–2596, Yuan, G. X., Ho, C. H., and Lin, C. J. An improved GLMNET for L1-regularized logistic regression. Journal of Machine Learning Research, 13:1999–2030, 2012.

Active Set Algorithm 1.Until x is feasible do 2.Propose active set of important constraints 3.x ← Minimizer of objective s.t. only active set

Computing y Update Computing y update = 1D optimization problem Worst case, can be solved with bisection method For linear case, solution is simpler Requires considering all constraints

Single Machine, Data in Memory Support Set Recall Active Sets Experiment with high-dimensional RCV1 dataset No Prioritization ProxNewt CD L1_LR LIBLINEAR GLMNET Blitz Time (s) Support Set Precision