StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent Tyler B. Johnson and Carlos Guestrin University of Washington
Coordinate descent Simple and good optimization algorithm Fast in practice Understood with theory No learning rate or other parameters 😃
Lasso objective Solution is sparse—majority of weights equal 0
Nonnegative Lasso objective
Nonnegative Lasso objective StingyCD can also solve normal Lasso Also straightforward to extend to linear SVM
Inside an iteration of CD Residuals vector For chosen coordinate, compute
Major drawback of CD “Zero updates” Zero updates are wasteful! Due to sparsity, zero updates are very common! Computing gradient requires time
StingyCD Skip updates guaranteed to be zero Skip condition requires just constant time
Geometry of a zero update Residuals vector
StingyCD StingyCD makes 3 simple changes to CD
Change 1: Reference residuals vector Reference updated infrequently (once every several epochs)
Change 1: Reference residuals vector
Change 1: Reference residuals vector
Change 1: Reference residuals vector
Change 1: Reference residuals vector
Change 1: Reference residuals vector
Change 1: Reference residuals vector
Change 1: Reference residuals vector
Change 2: Track reference distance
Change 2: Track reference distance
Change 2: Track reference distance
Change 2: Track reference distance
Change 2: Track reference distance
Change 2: Track reference distance
Change 2: Track reference distance
Change 3: Threshold reference distance
Change 3: Threshold reference distance
Summary of StingyCD changes Before each iteration, check skip condition Constant time \ Constant time
Reference update trade-off
Scheduling reference updates Relative time to converge
StingyCD empirical performance Time (s) Relative suboptimality CD CD + Safe screening StingyCD
Skipping more updates with StingyCD+ \
Skipping more updates with StingyCD+ \
Skipping more updates with StingyCD+ \
Probability of useful update StingyCD+ models the probability each update is useful (i.e. nonzero) Efficiently compute probability with lookup table
StingyCD+ empirical performance Time (s) Relative suboptimality CD CD + Safe screening StingyCD StingyCD+ \
StingyCD+ empirical performance Time (s) Relative suboptimality CD CD + Safe screening StingyCD StingyCD+ \
Combining StingyCD+ with other methods Popular sparse logistic regression algorithms: Approximate proximal newton Working set algorithms Both rely on Lasso subproblem solvers Compare CD, StingyCD+ as subproblem solvers \
Sparse logistic regression results Time (s) Relative suboptimality CD ProxNewt CD ProxNewt w/ Working Sets StingyCD+ ProxNewt StingyCD+ ProxNewt w/ Working Sets \
Sparse logistic regression results Time (min) Relative suboptimality CD ProxNewt CD ProxNewt w/ Working Sets StingyCD+ ProxNewt StingyCD+ ProxNewt w/ Working Sets \
Takeaways Thank you! StingyCD makes simple changes to CD Avoids wasteful computation Further gains possible with relaxations Can combine with other methods Future directions Extend to more problem settings Apply ”stingy updates” to other algorithms Thank you!