Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1.

Slides:



Advertisements
Similar presentations
Curriculum Learning for Latent Structural SVM
Advertisements

Neural networks Introduction Fitting neural networks
Advanced topics.
Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University.
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Generalizing Backpropagation to Include Sparse Coding David M. Bradley and Drew Bagnell Robotics Institute Carnegie.
Optimization Tutorial
Supervised Learning Recap
Indian Statistical Institute Kolkata
Chapter 9 Perceptrons and their generalizations. Rosenblatt ’ s perceptron Proofs of the theorem Method of stochastic approximation and sigmoid approximation.
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009
A Two-Stage Approach to Domain Adaptation for Statistical Classifiers Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 14 – Neural Networks
Pattern Recognition and Machine Learning
Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.
1 Transfer Learning Algorithms for Image Classification Ariadna Quattoni MIT, CSAIL Advisors: Michael Collins Trevor Darrell.
Efficient Sparse Coding Algorithms
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.
High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning Jeff Michels Ashutosh Saxena Andrew Y. Ng Stanford University ICML 2005.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Image Classification using Sparse Coding: Advanced Topics
CS-424 Gregory Dudek Today’s Lecture Neural networks –Backprop example Clustering & classification: case study –Sound classification: the tapper Recurrent.
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Classification Part 3: Artificial Neural Networks
Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.
Non Negative Matrix Factorization
Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Group Sparse Coding Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow Google Mountain View, CA (NIPS2009) Presented by Miao Liu July
Transfer Learning for Image Classification Group No.: 15 Group member : Feng Cai Sauptik Dhar Sauptik.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.
Nonlinear Learning Using Local Coordinate Coding K. Yu, T. Zhang and Y. Gong, NIPS 2009 Improved Local Coordinate Coding Using Local Tangents K. Yu and.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
Logistic Regression William Cohen.
Hand-written character recognition
NTU & MSRA Ming-Feng Tsai
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Markus Uhr Feature Extraction Sparse, Flexible and Efficient Modeling using L 1 -Regularization Saharon Rosset and Ji Zhu.
Deep Feedforward Networks
Machine Learning & Deep Learning
LECTURE 11: Advanced Discriminant Analysis
Restricted Boltzmann Machines for Classification
Boosting and Additive Trees
Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.
Machine Learning – a Probabilistic Perspective
Presentation transcript:

Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS

Joint Optimization 100,000 ft View Complex Systems 2 Voxel Features Voxel Classifier 2-D Planner Cameras Ladar Voxel Grid Column Cost Y (Path to goal) Initialize with “cheap” data

10,000 ft view Sparse Coding = Generative Model Semi-supervised learning KL-Divergence Regularization Implicit Differentiation 3 Optimization Unlabeled DataLatent Variable Classifier Loss Gradient

Sparse Coding Understand X 4

As a combination of factors B 5

6 Sparse coding uses optimization Projection (feed-forward): Some vector Want to use to classify x Reconstruction loss function

Sparse vectors: Only a few significant elements 7

Example: X=Handwritten Digits 8 Basis vectors colored by w

Input Basis Projection 9 Optimization vs. Projection

Input Basis KL-regularized Optimization 10 Optimization vs. Projection Outputs are sparse for each example

Generative Model Latent variables are Independent PriorLikelihood Examples are Independent 11 i

PriorLikelihood Sparse Approximation MAP Estimate 12

Sparse Approximation Distance between reconstruction and input Distance between weight vector and prior mean Regularization Constant 13

Example: Squared Loss + L1 Convex + sparse (widely studied in engineering) Sparse coding solves for B as well (non-convex for now…) Shown to generate useful features on diverse problems 14 Tropp, Signal Processing, 2006 Donoho and Elad, Proceedings of the National Academy of Sciences, 2002 Raina, Battle, Lee, Packer, Ng, ICML, 2007

L1 Sparse Coding Shown to generate useful features on diverse problems Optimize B over all examples 15

Differentiable Sparse Coding Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008 Y X Learning Module (θ) Loss Function Optimization Module (B) Unlabeled Data Reconstruction Loss Labeled Data X W 16 Sparse Coding Raina, Battle, Lee, Packer, Ng, ICML, 2007

L1 Regularization is Not Differentiable Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008 Y X Learning Module (θ) Loss Function Optimization Module (B) Unlabeled Data Reconstruction Loss Labeled Data X W 17 Sparse Coding

Why is this unsatisfying? 18

Problem #1: Instability L1 Map Estimates are discontinuous Outputs are not differentiable Instead use KL-divergence 19 Proven to compete with L1 in online learning

Problem #2: No closed-form Equation At the MAP estimate: 20

Solution: Implicit Differentiation Differentiate both sides with respect to an element of B: Since is a function of B: 21 Solve for this

KL-Divergence Example: Squared Loss, KL prior 22

Handwritten Digit Recognition 50,000 digit training set 10,000 digit validation set 10,000 digit test set 23

Handwritten Digit Recognition Unsupervised Sparse Coding L 2 loss and L 1 prior Training Set Step #1: 24 Raina, Battle, Lee, Packer, Ng, ICML, 2007

Handwritten Digit Recognition Sparse Approximation Step #2: Maxent Classifier Loss Function 25

Handwritten Digit Recognition Supervised Sparse Coding Step #3: Maxent Classifier Loss Function 26

KL Maintains Sparsity 27 Weight concentrated in few elements Log scale

KL adds Stability X W (KL)W (backprop) W (L 1 ) 28 Duda, Hart, Stork. Pattern classification. 2001

Performance vs. Prior 29 Better

Classifier Comparison 30 Better

Comparison to other algorithms 31 Better

Transfer to English Characters ,000 character training set 12,000 character validation set 12,000 character test set 32

Transfer to English Characters Sparse Approximation Step #1: Digits Basis Maxent Classifier Loss Function 33 Raina, Battle, Lee, Packer, Ng, ICML, 2007

Transfer to English Characters Step #2: Maxent Classifier Loss Function Supervised Sparse Coding 34

Transfer to English Characters 35 Better

Text Application X Unsupervised Sparse Coding KL loss + KL prior 5,000 movie reviews 10 point sentiment scale 1=hated it, 10=loved it Step #1: 36 Pang, Lee, Proceeding of the ACL, 2005

Text Application X Sparse Approximation 5-fold Cross Validation 10 point sentiment scale 1=hated it, 10=loved it Linear Regression L2 Loss Step #2: 37

Text Application X 10 point sentiment scale 1=hated it, 10=loved it 5-fold Cross Validation Step #3: Linear Regression L2 Loss Supervised Sparse Coding 38

Movie Review Sentiment 39 Better Unsupervised basis Supervised basis State of the art graphical model Blei, McAuliffe, NIPS, 2007

RGB Camera NIR Camera Ladar Future Work Sparse Coding Engineered Features Labeled Training Data Example Paths Voxel Classifier MMP Camera Laser 40

Future Work: Convex Sparse Coding Sparse approximation is convex Sparse coding is not because fixed-size basis is a non-convex constraint Sparse coding ↔ sparse approximation on infinitely large basis + non-convex rank constraint – Relax to a convex L 1 rank constraint Use boosting for sparse approximation directly on infinitely large basis 41 Bengio, Le Roux, Vincent, Dellalleau, Marcotte, NIPS, 2005 Zhao, Yu. Feature Selection for Data Mining, 2005 Riffkin, Lippert. Journal of Machine Learning Research, 2007

Questions? 42