Download presentation
Presentation is loading. Please wait.
Published byYahir Lumb Modified over 9 years ago
1
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1
2
Homework 1 Check course website! Some coding required Some plotting required – I recommend Matlab Has supplementary datasets Submit via Moodle (due Jan 20 th @5pm) 2
3
Recap: Complete Pipeline Training DataModel Class(es) Loss Function Cross Validation & Model SelectionProfit! 3
4
Different Model Classes? Cross Validation & Model Selection Option 1: SVMs vs ANNs vs LR vs LS Option 2: Regularization 4
5
Notation L0 Norm – # of non-zero entries L1 Norm – Sum of absolute values L2 Norm & Squared L2 Norm – Sum of squares – Sqrt(sum of squares) L-infinity Norm – Max absolute value 5
6
Notation Part 2 Minimizing Squared Loss – Regression – Least-Squares – (Unless Otherwise Stated) E.g., Logistic Regression = Log Loss 6
7
Ridge Regression aka L2-Regularized Regression Trades off model complexity vs training loss Each choice of λ a “model class” – Will discuss the further later Regularization Training Loss 7
8
PersonAge>10Male?Height > 55” Alice101 Bob010 Carol000 Dave111 Erin101 Frank011 Gena000 Harold111 Irene100 John011 Kelly101 Larry111 Test 0.7401 0.2441 -0.1745 0.7122 0.2277 -0.1967 0.6197 0.1765 -0.2686 0.4124 0.0817 -0.4196 0.1801 0.0161 -0.5686 0.0001 0.0000 -0.6666 Train w b … Larger Lambda 8
9
Updated Pipeline Training DataModel Class Loss Function Cross Validation & Model SelectionProfit! Choosing λ! 9
10
PersonAge>10Male ? Height > 55” Alice1010.910.890.830.750.67 Bob0100.420.450.500.580.67 Carol0000.170.260.420.500.67 Dave1111.161.060.910.830.67 Erin1010.910.890.830.790.67 Frank0110.420.450.500.540.67 Gena0000.170.270.420.500.67 Harold1111.161.060.910.830.67 Irene1000.910.890.830.790.67 John0110.420.450.500.540.67 Kelly1010.910.890.830.790.67 Larry1111.161.060.910.830.67 Test Train Model Score w/ Increasing Lambda Best test error 10
11
25 dimensional space Randomly generated linear response function + noise Choice of Lambda Depends on Training Size 11
12
Recap: Ridge Regularization Ridge Regression: – L2 Regularized Least-Squares Large λ more stable predictions – Less likely to overfit to training data – Too large λ underfit Works with other loss – Hinge Loss, Log Loss, etc. 12
13
Model Class Interpretation This is not a model class! – At least not what we’ve discussed... An optimization procedure – Is there a connection? 13
14
Norm Constrained Model Class c=1 c=2 c=3 Visualization Seems to correspond to lambda… 14
15
Lagrange Multipliers http://en.wikipedia.org/wiki/Lagrange_multiplier Omitting b & 1 training data for simplicity 15
16
Norm Constrained Model Class Training: Observation about Optimality: Lagrangian: Optimality Implication of Lagrangian: Claim: Solving Lagrangian Solves Norm-Constrained Training Problem Two Conditions Must Be Satisfied At Optimality . Satisfies First Condition! Omitting b & 1 training data for simplicity 16 http://en.wikipedia.org/wiki/Lagrange_multiplier
17
Norm Constrained Model Class Training: Observation about Optimality: Lagrangian: Optimality Implication of Lagrangian: Claim: Solving Lagrangian Solves Norm-Constrained Training Problem Two Conditions Must Be Satisfied At Optimality . Satisfies First Condition! Optimality Implication of Lagrangian: Satisfies 2 nd Condition! Omitting b & 1 training data for simplicity 17 http://en.wikipedia.org/wiki/Lagrange_multiplier
18
Lagrangian: Norm Constrained Model Class Training: L2 Regularized Training: Omitting b & 1 training data for simplicity Lagrangian = Norm Constrained Training: Lagrangian = L2 Regularized Training: – Hold λ fixed – Equivalent to solving Norm Constrained! – For some c 18 http://en.wikipedia.org/wiki/Lagrange_multiplier
19
Recap #2: Ridge Regularization Ridge Regression: – L2 Regularized Least-Squares = Norm Constrained Model Large λ more stable predictions – Less likely to overfit to training data – Too large λ underfit Works with other loss – Hinge Loss, Log Loss, etc. 19
20
Hallucinating Data Points Instead hallucinate D data points? Omitting b & for simplicity Identical to Regularization! Unit vector along d-th Dimension 20
21
Extension: Multi-task Learning 2 prediction tasks: – Spam filter for Alice – Spam filter for Bob Limited training data for both… – … but Alice is similar to Bob 21
22
Extension: Multi-task Learning Two Training Sets – N relatively small Option 1: Train Separately Omitting b & for simplicity Both models have high error. 22
23
Extension: Multi-task Learning Two Training Sets – N relatively small Option 2: Train Jointly Omitting b & for simplicity Doesn’t accomplish anything! (w & v don’t depend on each other) 23
24
Multi-task Regularization Prefer w & v to be “close” – Controlled by γ – Tasks similar Larger γ helps! – Tasks not identical γ not too large Standard Regularization Multi-task Regularization Training Loss Test Loss (Task 2) 24
25
Lasso L1-Regularized Least-Squares 25
26
L1 Regularized Least Squares L2: L1: vs = = Omitting b & for simplicity 26
27
Subgradient (sub-differential) Differentiable: L1: Continuous range for w=0! Omitting b & for simplicity 27
28
L1 Regularized Least Squares L2: L1: Omitting b & for simplicity 28
29
Lagrange Multipliers Omitting b & 1 training data for simplicity Solutions tend to be at corners! 29 http://en.wikipedia.org/wiki/Lagrange_multiplier
30
Sparsity w is sparse if mostly 0’s: – Small L0 Norm Why not L0 Regularization? – Not continuous! L1 induces sparsity – And is continuous! Omitting b & for simplicity 30
31
Why is Sparsity Important? Computational / Memory Efficiency – Store 1M numbers in array – Store 2 numbers per non-zero (Index, Value) pairs E.g., [ (50,1), (51,1) ] – Dot product more efficient: Sometimes true w is sparse – Want to recover non-zero dimensions 31
32
Lasso Guarantee Suppose data generated as: Then if: With high probability (increasing with N): See also: https://www.cs.utexas.edu/~pradeepr/courses/395T-LT/filez/highdimII.pdf http://www.eecs.berkeley.edu/~wainwrig/Papers/Wai_SparseInfo09.pdf High Precision Parameter Recovery Sometimes High Recall 32
33
PersonAge>10Male?Height > 55” Alice101 Bob010 Carol000 Dave111 Erin101 Frank011 Gena000 Harold111 Irene100 John011 Kelly101 Larry111 33
34
Recap: Lasso vs Ridge Model Assumptions – Lasso learns sparse weight vector Predictive Accuracy – Lasso often not as accurate – Re-run Least Squares on dimensions selected by Lasso Easy of Inspection – Sparse w’s easier to inspect Easy of Optimization – Lasso somewhat trickier to optimize 34
35
Recap: Regularization L2 L1 (Lasso) Multi-task [Insert Yours Here!] 35 Omitting b & for simplicity
36
Next Lecture: Recent Applications of Lasso Cancer Detection Personalization via twitter Recitation on Wednesday: Probability & Statistics Image Sources: http://statweb.stanford.edu/~tibs/ftp/canc.pdf https://dl.dropboxusercontent.com/u/16830382/papers/badgepaper-kdd2013.pdf 36
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.