Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1.

Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1

Joint Optimization 100,000 ft View Complex Systems 2 Voxel Features Voxel Classifier 2-D Planner Cameras Ladar Voxel Grid Column Cost Y (Path to goal) Initialize with “cheap” data

10,000 ft view Sparse Coding = Generative Model Semi-supervised learning KL-Divergence Regularization Implicit Differentiation 3 Optimization Unlabeled DataLatent Variable Classifier Loss Gradient

Sparse Coding Understand X 4

As a combination of factors B 5

6 Sparse coding uses optimization Projection (feed-forward): Some vector Want to use to classify x Reconstruction loss function

Sparse vectors: Only a few significant elements 7

Example: X=Handwritten Digits 8 Basis vectors colored by w

Input Basis Projection 9 Optimization vs. Projection

Input Basis KL-regularized Optimization 10 Optimization vs. Projection Outputs are sparse for each example

Generative Model Latent variables are Independent PriorLikelihood Examples are Independent 11 i

PriorLikelihood Sparse Approximation MAP Estimate 12

Sparse Approximation Distance between reconstruction and input Distance between weight vector and prior mean Regularization Constant 13

Example: Squared Loss + L1 Convex + sparse (widely studied in engineering) Sparse coding solves for B as well (non-convex for now…) Shown to generate useful features on diverse problems 14 Tropp, Signal Processing, 2006 Donoho and Elad, Proceedings of the National Academy of Sciences, 2002 Raina, Battle, Lee, Packer, Ng, ICML, 2007

L1 Sparse Coding Shown to generate useful features on diverse problems Optimize B over all examples 15

Differentiable Sparse Coding Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008 Y X Learning Module (θ) Loss Function Optimization Module (B) Unlabeled Data Reconstruction Loss Labeled Data X W 16 Sparse Coding Raina, Battle, Lee, Packer, Ng, ICML, 2007

L1 Regularization is Not Differentiable Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008 Y X Learning Module (θ) Loss Function Optimization Module (B) Unlabeled Data Reconstruction Loss Labeled Data X W 17 Sparse Coding

Why is this unsatisfying? 18

Problem #1: Instability L1 Map Estimates are discontinuous Outputs are not differentiable Instead use KL-divergence 19 Proven to compete with L1 in online learning

Problem #2: No closed-form Equation At the MAP estimate: 20

Solution: Implicit Differentiation Differentiate both sides with respect to an element of B: Since is a function of B: 21 Solve for this

KL-Divergence Example: Squared Loss, KL prior 22

Handwritten Digit Recognition 50,000 digit training set 10,000 digit validation set 10,000 digit test set 23

Handwritten Digit Recognition Unsupervised Sparse Coding L 2 loss and L 1 prior Training Set Step #1: 24 Raina, Battle, Lee, Packer, Ng, ICML, 2007

Handwritten Digit Recognition Sparse Approximation Step #2: Maxent Classifier Loss Function 25

Handwritten Digit Recognition Supervised Sparse Coding Step #3: Maxent Classifier Loss Function 26

KL Maintains Sparsity 27 Weight concentrated in few elements Log scale

KL adds Stability X W (KL)W (backprop) W (L 1 ) 28 Duda, Hart, Stork. Pattern classification. 2001

Performance vs. Prior 29 Better

Classifier Comparison 30 Better

Comparison to other algorithms 31 Better

Transfer to English Characters 8 16 24,000 character training set 12,000 character validation set 12,000 character test set 32

Transfer to English Characters Sparse Approximation Step #1: Digits Basis Maxent Classifier Loss Function 33 Raina, Battle, Lee, Packer, Ng, ICML, 2007

Transfer to English Characters Step #2: Maxent Classifier Loss Function Supervised Sparse Coding 34

Transfer to English Characters 35 Better

Text Application X Unsupervised Sparse Coding KL loss + KL prior 5,000 movie reviews 10 point sentiment scale 1=hated it, 10=loved it Step #1: 36 Pang, Lee, Proceeding of the ACL, 2005

Text Application X Sparse Approximation 5-fold Cross Validation 10 point sentiment scale 1=hated it, 10=loved it Linear Regression L2 Loss Step #2: 37

Text Application X 10 point sentiment scale 1=hated it, 10=loved it 5-fold Cross Validation Step #3: Linear Regression L2 Loss Supervised Sparse Coding 38

Movie Review Sentiment 39 Better Unsupervised basis Supervised basis State of the art graphical model Blei, McAuliffe, NIPS, 2007

RGB Camera NIR Camera Ladar Future Work Sparse Coding Engineered Features Labeled Training Data Example Paths Voxel Classifier MMP Camera Laser 40

Future Work: Convex Sparse Coding Sparse approximation is convex Sparse coding is not because fixed-size basis is a non-convex constraint Sparse coding ↔ sparse approximation on infinitely large basis + non-convex rank constraint – Relax to a convex L 1 rank constraint Use boosting for sparse approximation directly on infinitely large basis 41 Bengio, Le Roux, Vincent, Dellalleau, Marcotte, NIPS, 2005 Zhao, Yu. Feature Selection for Data Mining, 2005 Riffkin, Lippert. Journal of Machine Learning Research, 2007

Questions? 42

Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1.

Similar presentations

Presentation on theme: "Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1.

Similar presentations

Presentation on theme: "Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1."— Presentation transcript:

Similar presentations

About project

Feedback