Download presentation
Presentation is loading. Please wait.
Published byJohn Francis McLaughlin Modified over 9 years ago
1
Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1
2
Joint Optimization 100,000 ft View Complex Systems 2 Voxel Features Voxel Classifier 2-D Planner Cameras Ladar Voxel Grid Column Cost Y (Path to goal) Initialize with “cheap” data
3
10,000 ft view Sparse Coding = Generative Model Semi-supervised learning KL-Divergence Regularization Implicit Differentiation 3 Optimization Unlabeled DataLatent Variable Classifier Loss Gradient
4
Sparse Coding Understand X 4
5
As a combination of factors B 5
6
6 Sparse coding uses optimization Projection (feed-forward): Some vector Want to use to classify x Reconstruction loss function
7
Sparse vectors: Only a few significant elements 7
8
Example: X=Handwritten Digits 8 Basis vectors colored by w
9
Input Basis Projection 9 Optimization vs. Projection
10
Input Basis KL-regularized Optimization 10 Optimization vs. Projection Outputs are sparse for each example
11
Generative Model Latent variables are Independent PriorLikelihood Examples are Independent 11 i
12
PriorLikelihood Sparse Approximation MAP Estimate 12
13
Sparse Approximation Distance between reconstruction and input Distance between weight vector and prior mean Regularization Constant 13
14
Example: Squared Loss + L1 Convex + sparse (widely studied in engineering) Sparse coding solves for B as well (non-convex for now…) Shown to generate useful features on diverse problems 14 Tropp, Signal Processing, 2006 Donoho and Elad, Proceedings of the National Academy of Sciences, 2002 Raina, Battle, Lee, Packer, Ng, ICML, 2007
15
L1 Sparse Coding Shown to generate useful features on diverse problems Optimize B over all examples 15
16
Differentiable Sparse Coding Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008 Y X Learning Module (θ) Loss Function Optimization Module (B) Unlabeled Data Reconstruction Loss Labeled Data X W 16 Sparse Coding Raina, Battle, Lee, Packer, Ng, ICML, 2007
17
L1 Regularization is Not Differentiable Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008 Y X Learning Module (θ) Loss Function Optimization Module (B) Unlabeled Data Reconstruction Loss Labeled Data X W 17 Sparse Coding
18
Why is this unsatisfying? 18
19
Problem #1: Instability L1 Map Estimates are discontinuous Outputs are not differentiable Instead use KL-divergence 19 Proven to compete with L1 in online learning
20
Problem #2: No closed-form Equation At the MAP estimate: 20
21
Solution: Implicit Differentiation Differentiate both sides with respect to an element of B: Since is a function of B: 21 Solve for this
22
KL-Divergence Example: Squared Loss, KL prior 22
23
Handwritten Digit Recognition 50,000 digit training set 10,000 digit validation set 10,000 digit test set 23
24
Handwritten Digit Recognition Unsupervised Sparse Coding L 2 loss and L 1 prior Training Set Step #1: 24 Raina, Battle, Lee, Packer, Ng, ICML, 2007
25
Handwritten Digit Recognition Sparse Approximation Step #2: Maxent Classifier Loss Function 25
26
Handwritten Digit Recognition Supervised Sparse Coding Step #3: Maxent Classifier Loss Function 26
27
KL Maintains Sparsity 27 Weight concentrated in few elements Log scale
28
KL adds Stability X W (KL)W (backprop) W (L 1 ) 28 Duda, Hart, Stork. Pattern classification. 2001
29
Performance vs. Prior 29 Better
30
Classifier Comparison 30 Better
31
Comparison to other algorithms 31 Better
32
Transfer to English Characters 8 16 24,000 character training set 12,000 character validation set 12,000 character test set 32
33
Transfer to English Characters Sparse Approximation Step #1: Digits Basis Maxent Classifier Loss Function 33 Raina, Battle, Lee, Packer, Ng, ICML, 2007
34
Transfer to English Characters Step #2: Maxent Classifier Loss Function Supervised Sparse Coding 34
35
Transfer to English Characters 35 Better
36
Text Application X Unsupervised Sparse Coding KL loss + KL prior 5,000 movie reviews 10 point sentiment scale 1=hated it, 10=loved it Step #1: 36 Pang, Lee, Proceeding of the ACL, 2005
37
Text Application X Sparse Approximation 5-fold Cross Validation 10 point sentiment scale 1=hated it, 10=loved it Linear Regression L2 Loss Step #2: 37
38
Text Application X 10 point sentiment scale 1=hated it, 10=loved it 5-fold Cross Validation Step #3: Linear Regression L2 Loss Supervised Sparse Coding 38
39
Movie Review Sentiment 39 Better Unsupervised basis Supervised basis State of the art graphical model Blei, McAuliffe, NIPS, 2007
40
RGB Camera NIR Camera Ladar Future Work Sparse Coding Engineered Features Labeled Training Data Example Paths Voxel Classifier MMP Camera Laser 40
41
Future Work: Convex Sparse Coding Sparse approximation is convex Sparse coding is not because fixed-size basis is a non-convex constraint Sparse coding ↔ sparse approximation on infinitely large basis + non-convex rank constraint – Relax to a convex L 1 rank constraint Use boosting for sparse approximation directly on infinitely large basis 41 Bengio, Le Roux, Vincent, Dellalleau, Marcotte, NIPS, 2005 Zhao, Yu. Feature Selection for Data Mining, 2005 Riffkin, Lippert. Journal of Machine Learning Research, 2007
42
Questions? 42
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.