Scalable training of L1-regularized log-linear models

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Nonnegative Matrix Factorization with Sparseness Constraints S. Race MA591R.
Curved Trajectories towards Local Minimum of a Function Al Jimenez Mathematics Department California Polytechnic State University San Luis Obispo, CA
Zhen Lu CPACT University of Newcastle MDC Technology Reduced Hessian Sequential Quadratic Programming(SQP)
Optimization with Constraints
Linear Regression.
Optimization.
Regularization David Kauchak CS 451 – Fall 2013.
Engineering Optimization
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Siddharth Choudhary.  Refines a visual reconstruction to produce jointly optimal 3D structure and viewing parameters  ‘bundle’ refers to the bundle.
Optimization Tutorial
Optimization of thermal processes
Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.
The loss function, the normal equation,
1cs542g-term Notes  Assignment 1 due tonight ( me by tomorrow morning)
Numerical Optimization
Function Optimization Newton’s Method. Conjugate Gradients
Nonlinear Optimization for Optimal Control
Methods For Nonlinear Least-Square Problems
Unconstrained Optimization Problem
Advanced Topics in Optimization
Why Function Optimization ?
Optimization Methods One-Dimensional Unconstrained Optimization
Collaborative Filtering Matrix Factorization Approach
ENCI 303 Lecture PS-19 Optimization 2
Scalable training of L1-regularized log-linear models
Mathematical formulation XIAO LIYING. Mathematical formulation.
1 Unconstrained Optimization Objective: Find minimum of F(X) where X is a vector of design variables We may know lower and upper bounds for optimum No.
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.
Biointelligence Laboratory, Seoul National University
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 10 Minimization or Maximization of Functions.
Logistic Regression William Cohen.
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
Optimization in Engineering Design 1 Introduction to Non-Linear Optimization.
Regularized Least-Squares and Convex Optimization.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
Bounded Nonlinear Optimization to Fit a Model of Acoustic Foams
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Multiplicative updates for L1-regularized regression
Computational Optimization
Boosting and Additive Trees (2)
Local Search Algorithms
Probabilistic Models for Linear Regression
Click to edit Master title style
Statistical Learning Dong Liu Dept. EEIS, USTC.
NESTA: A Fast and Accurate First-Order Method for Sparse Recovery
Chapter 14.
CS5321 Numerical Optimization
Non-linear Least-Squares
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
Chapter 10. Numerical Solutions of Nonlinear Systems of Equations
10701 / Machine Learning Today: - Cross validation,
Instructor :Dr. Aamer Iqbal Bhatti
~ Least Squares example
Generally Discriminant Analysis
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
~ Least Squares example
Sparse Principal Component Analysis
CS639: Data Management for Data Science
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
Section 3: Second Order Methods
Presentation transcript:

Scalable training of L1-regularized log-linear models Galen Andrew (Joint work with Jianfeng Gao) ICML, 2007

Minimizing regularized loss Many parametric ML models are trained by minimizing a function of the form: is a loss function quantifying “fit to the data” Negative log-likelihood of training data Distance from decision boundary of incorrect examples If zero is a reasonable “default” parameter value, we can use where is a norm, penalizing large vectors, and C is a constant Galen Andrew Microsoft Research

Types of norms A norm precisely defines “size” of a vector 2 1 3 1 2 3 Contours of L2-norm in 2D Contours of L1-norm in 2D Galen Andrew Microsoft Research

L1 and L2 Gradients of L2- and L1-norm 2 1 3 1 2 3 “Negative gradient” of L1-norm (direction of steepest descent) points toward coordinate axes Negative gradient of L2-norm always points directly toward 0 Galen Andrew Microsoft Research

L1 induces sparse models 1-D slice of L1-regularized objective x Sharp bend causes optimal value at x = 0 Galen Andrew Microsoft Research

L1 induces sparse models At global optimum, many parameters have value exactly zero L2 would give small, nonzero values Thus L1 does continuous feature selection More interpretable, computationally manageable models C parameter tunes sparsity/accuracy tradeoff In our experiments, only 1.5% of feats remain Galen Andrew Microsoft Research

A nasty property of L1 The sharp bend at zero is also a problem: Objective is non-differentiable Cannot solve with standard gradient-based methods Non-differentiable at sharp bend (gradient undefined) Galen Andrew Microsoft Research

Digression: Newton’s method To optimize a function f: Form 2nd-order Taylor expansion around x0 Jump to minimum: (Actually, line search in direction of xnew) Repeat Sort of an ideal. In practice, H is too large ( ) Galen Andrew Microsoft Research

Limited-Memory Quasi-Newton Approximate H-1 with a low-rank matrix built from changes to gradient in recent iterations Approximate H-1 and not H, so no need to invert the matrix or solve linear system! Most popular L-M Q-N method: L-BFGS Storage and computation are O(# vars) Very good theoretical convergence properties Empirically, best method for training large-scale log-linear models with L2 (Malouf ‘02, Minka ‘03) Galen Andrew Microsoft Research

Orthant-Wise Limited-memory Quasi-Newton algorithm Our algorithm (OWL-QN) uses the fact that L1 is differentiable on any given orthant In fact, it is linear, so it doesn’t affect Hessian Galen Andrew Microsoft Research

OWL-QN (cont.) For a given orthant defined by the objective can be written The Hessian of f is determined by loss alone Can use gradient of loss at previous iterations to estimate Hessian of objective on any orthant Constrain steps to not cross orthant boundaries Linear function of w Hessian = 0 Galen Andrew Microsoft Research

OWL-QN (cont.) Choose an orthant Find Quasi-Newton quadratic approximation to objective on orthant Jump to minimum of quadratic (Actually, line search in direction of minimum) Project back onto orthant Repeat steps 1-4 until convergence Galen Andrew Microsoft Research

Choosing an orthant to explore We use the orthant… in which the current point sits into which the direction of steepest descent points (Computing direction of steepest descent given the gradient of the loss is easy; see the paper for details.) Galen Andrew Microsoft Research

Toy example One iteration of OWL-QN: Find vector of steepest descent Choose orthant Find L-BFGS quadratic approximation Jump to minimum Project back onto orthant Update Hessian approximation using gradient of loss alone Galen Andrew Microsoft Research

Notes Variables added/subtracted from model as orthant boundaries are hit A variable can change signs in two iterations Glossing over some details: Line search with projection at each iteration Convenient to expand notion of “orthant” to constrain some variables at zero See paper for complete details In paper we prove convergence to optimum Galen Andrew Microsoft Research

Experiments We ran experiments with the parse re-ranking model of Charniak & Johnson (2005) Start with a set of candidate parses for each sentence (produced by a baseline parser) Train a log-linear model to select the correct one Model uses ~1.2M features of a parse Train on Sections 2-19 of PTB (36K sentences with 50 parses each) Fit C to max. F-meas on Sec. 20-21 (4K sent.) Galen Andrew Microsoft Research

Training methods compared Compared OWL-QN with three other methods Kazama & Tsujii (2003) paired variable formulation for L1 implemented with AlgLib’s L-BFGS-B L2 with our own implementation of L-BFGS (on which OWL-QN is based) L2 with AlgLib’s implementation of L-BFGS K&T turns L1 into differentiable problem with bound constraints and twice the variables Similar to Goodman’s 2004 method, but with L-BFGS-B instead of GIS Galen Andrew Microsoft Research

Comparison Methodology For each problem (L1 and L2) Run both algorithms until value nearly constant Report time to reach within 1% of best value We also report num. of function evaluations Implementation independent comparison Function evaluation dominates runtime Results reported with chosen value of C L-BFGS memory parameter = 5 for all runs Galen Andrew Microsoft Research

Results # func. evals Func eval time L-BFGS dir time Other time Total time OWL-QN 54 707 (97.7) 10.4 (1.4) 6.9 (1.0) 724 K&T (AlgLib’s L-BFGS-B) > 946 16043 (91.2) 1555 (8.8) > 17600 (L2) with our L-BFGS 109 1400 (97.7) 22.4 (1.5) 10 (0.7) 1433 AlgLib’s L-BFGS 107 1384 (83.4) 276 (16.6) 1660 Number of function evaluations and CPU time in seconds to reach within 1% of best value found. Figures in parentheses are percentage of total time. Galen Andrew Microsoft Research

Notes: Our L-BFGS and AlgLib’s are comparable, so OWL-QN and K&T with AlgLib is fair comparison In terms of function evaluations and raw time, OWL-QN orders of magnitude faster than K&T The most expensive step of OWL-QN is computing L-BFGS direction (not projections, computing steepest descent vector, etc.) Optimizing L1 objective with OWL-QN is twice as fast as optimizing L2 with L-BFGS Galen Andrew Microsoft Research

Objective value during training L1, OWL-QN L1, Kazama & Tsujii L2, Our L-BFGS L2, AlgLib’s L-BFGS Galen Andrew Microsoft Research

Sparsity during training Both algorithms start with ~5% of features, then gradually prune them away OWL-QN finds sparse models quickly OWL-QN Kazama & Tsujii Galen Andrew Microsoft Research

Extensions For ACL paper, ran on 3 very different log-linear NLP models with up to 8M features CMM sequence model for POS tagging Reranking log-linear model for LM adaptation Semi-CRF for Chinese word segmentation Can use any smooth convex loss We’ve also tried least-squares (LASSO regression) A small change allows non-convex loss Only local minimum guaranteed Galen Andrew Microsoft Research

Software download We’ve released c++ OWL-QN source User can specify arbitrary convex smooth loss Also included are standalone trainer for L1 logistic regression and least-squares (LASSO) Please visit my webpage for download (Find with search engine of your choice) Galen Andrew Microsoft Research

Thanks! Galen Andrew Microsoft Research

Summary L1 regularization produces sparse, interpretable models that generalize well Orthant-Wise Limited-memory Quasi-Newton, based on L-BFGS, efficiently minimizes L1-regularized loss with millions of variables Faster even than L2 regularization in our experiments Source code available for download Galen Andrew Microsoft Research