Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS
Joint Optimization 100,000 ft View Complex Systems 2 Voxel Features Voxel Classifier 2-D Planner Cameras Ladar Voxel Grid Column Cost Y (Path to goal) Initialize with “cheap” data
10,000 ft view Sparse Coding = Generative Model Semi-supervised learning KL-Divergence Regularization Implicit Differentiation 3 Optimization Unlabeled DataLatent Variable Classifier Loss Gradient
Sparse Coding Understand X 4
As a combination of factors B 5
6 Sparse coding uses optimization Projection (feed-forward): Some vector Want to use to classify x Reconstruction loss function
Sparse vectors: Only a few significant elements 7
Example: X=Handwritten Digits 8 Basis vectors colored by w
Input Basis Projection 9 Optimization vs. Projection
Input Basis KL-regularized Optimization 10 Optimization vs. Projection Outputs are sparse for each example
Generative Model Latent variables are Independent PriorLikelihood Examples are Independent 11 i
PriorLikelihood Sparse Approximation MAP Estimate 12
Sparse Approximation Distance between reconstruction and input Distance between weight vector and prior mean Regularization Constant 13
Example: Squared Loss + L1 Convex + sparse (widely studied in engineering) Sparse coding solves for B as well (non-convex for now…) Shown to generate useful features on diverse problems 14 Tropp, Signal Processing, 2006 Donoho and Elad, Proceedings of the National Academy of Sciences, 2002 Raina, Battle, Lee, Packer, Ng, ICML, 2007
L1 Sparse Coding Shown to generate useful features on diverse problems Optimize B over all examples 15
Differentiable Sparse Coding Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008 Y X Learning Module (θ) Loss Function Optimization Module (B) Unlabeled Data Reconstruction Loss Labeled Data X W 16 Sparse Coding Raina, Battle, Lee, Packer, Ng, ICML, 2007
L1 Regularization is Not Differentiable Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008 Y X Learning Module (θ) Loss Function Optimization Module (B) Unlabeled Data Reconstruction Loss Labeled Data X W 17 Sparse Coding
Why is this unsatisfying? 18
Problem #1: Instability L1 Map Estimates are discontinuous Outputs are not differentiable Instead use KL-divergence 19 Proven to compete with L1 in online learning
Problem #2: No closed-form Equation At the MAP estimate: 20
Solution: Implicit Differentiation Differentiate both sides with respect to an element of B: Since is a function of B: 21 Solve for this
KL-Divergence Example: Squared Loss, KL prior 22
Handwritten Digit Recognition 50,000 digit training set 10,000 digit validation set 10,000 digit test set 23
Handwritten Digit Recognition Unsupervised Sparse Coding L 2 loss and L 1 prior Training Set Step #1: 24 Raina, Battle, Lee, Packer, Ng, ICML, 2007
Handwritten Digit Recognition Sparse Approximation Step #2: Maxent Classifier Loss Function 25
Handwritten Digit Recognition Supervised Sparse Coding Step #3: Maxent Classifier Loss Function 26
KL Maintains Sparsity 27 Weight concentrated in few elements Log scale
KL adds Stability X W (KL)W (backprop) W (L 1 ) 28 Duda, Hart, Stork. Pattern classification. 2001
Performance vs. Prior 29 Better
Classifier Comparison 30 Better
Comparison to other algorithms 31 Better
Transfer to English Characters ,000 character training set 12,000 character validation set 12,000 character test set 32
Transfer to English Characters Sparse Approximation Step #1: Digits Basis Maxent Classifier Loss Function 33 Raina, Battle, Lee, Packer, Ng, ICML, 2007
Transfer to English Characters Step #2: Maxent Classifier Loss Function Supervised Sparse Coding 34
Transfer to English Characters 35 Better
Text Application X Unsupervised Sparse Coding KL loss + KL prior 5,000 movie reviews 10 point sentiment scale 1=hated it, 10=loved it Step #1: 36 Pang, Lee, Proceeding of the ACL, 2005
Text Application X Sparse Approximation 5-fold Cross Validation 10 point sentiment scale 1=hated it, 10=loved it Linear Regression L2 Loss Step #2: 37
Text Application X 10 point sentiment scale 1=hated it, 10=loved it 5-fold Cross Validation Step #3: Linear Regression L2 Loss Supervised Sparse Coding 38
Movie Review Sentiment 39 Better Unsupervised basis Supervised basis State of the art graphical model Blei, McAuliffe, NIPS, 2007
RGB Camera NIR Camera Ladar Future Work Sparse Coding Engineered Features Labeled Training Data Example Paths Voxel Classifier MMP Camera Laser 40
Future Work: Convex Sparse Coding Sparse approximation is convex Sparse coding is not because fixed-size basis is a non-convex constraint Sparse coding ↔ sparse approximation on infinitely large basis + non-convex rank constraint – Relax to a convex L 1 rank constraint Use boosting for sparse approximation directly on infinitely large basis 41 Bengio, Le Roux, Vincent, Dellalleau, Marcotte, NIPS, 2005 Zhao, Yu. Feature Selection for Data Mining, 2005 Riffkin, Lippert. Journal of Machine Learning Research, 2007
Questions? 42