Recitation4 for BigData Jay Gu Feb LASSO and Coordinate Descent
A numerical example N = 50 P = 200 # Non zero coefficients = 5 X ~ normal (0, I) beta_1, beta_2, beta_3 ~ normal (1, 2) sigma ~ normal(0, 0.1*I) Y = Xbeta + sigma Split training vs testing: 80/20 Generate some synthetic data:
Practicalities Standardize your data: Center X, Y remove the intercept Scale X to have unit norm at each column fair regularization for all covariates Warm start. Run Lambdas from large to small, Starting from the largest lambda to be max(X’y) Guarantees to have zero support size.
Algorithm Ridge Regression: Closed form solution. LASSO: Iterative algorithms: Subgradient Descent Generalized Gradient Methods (ISTA) Accelerated Generalized Gradient Methods (FSTA) Coordinate Descent
Subdifferentials Coordinate Descent Slides from Ryan Tibshirani F12/slides/06-sg-method.pdf F12/slides/25-coord-desc.pdf
Coordintate Descent: always find global optimum? Convex and differentiable? Yes Convex and non-differentiable? No
Convex but separable non-differentiable parts? Yes. Proof:
CD for Linear Regression
Rate of Convergence? Assuming gradient is Lipchitz continuous. Subgradient Descent: 1/sqrt(k) Gradient Descent: 1/k Optimal rate for first order methods: 1/(k^2) Coordinate Descent: – Only know for some special cases
Summary: Coordinate Descent Good for large P No tuning parameter In practice converge much faster than the optimal first order methods Only applies to certain cases Unknown convergence rate for general function classes