A Quest for a Universal Model for Signals: From Sparsity to ConvNets

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
MMSE Estimation for Sparse Representation Modeling
Joint work with Irad Yavneh
Patch-based Image Deconvolution via Joint Modeling of Sparse Priors Chao Jia and Brian L. Evans The University of Texas at Austin 12 Sep
Wangmeng Zuo, Deyu Meng, Lei Zhang, Xiangchu Feng, David Zhang
Learning sparse representations to restore, classify, and sense images and videos Guillermo Sapiro University of Minnesota Supported by NSF, NGA, NIH,
* * Joint work with Michal Aharon Freddy Bruckstein Michael Elad
Ilias Theodorakopoulos PhD Candidate
Dictionary-Learning for the Analysis Sparse Model Michael Elad The Computer Science Department The Technion – Israel Institute of technology Haifa 32000,
Image Super-Resolution Using Sparse Representation By: Michael Elad Single Image Super-Resolution Using Sparse Representation Michael Elad The Computer.
Sparse and Overcomplete Data Representation
SRINKAGE FOR REDUNDANT REPRESENTATIONS ? Michael Elad The Computer Science Department The Technion – Israel Institute of technology Haifa 32000, Israel.
Image Denoising via Learned Dictionaries and Sparse Representations
Image Denoising with K-SVD Priyam Chatterjee EE 264 – Image Processing & Reconstruction Instructor : Prof. Peyman Milanfar Spring 2007.
Recent Trends in Signal Representations and Their Role in Image Processing Michael Elad The CS Department The Technion – Israel Institute of technology.
SOS Boosting of Image Denoising Algorithms
Image Denoising and Inpainting with Deep Neural Networks Junyuan Xie, Linli Xu, Enhong Chen School of Computer Science and Technology University of Science.
A Weighted Average of Sparse Several Representations is Better than the Sparsest One Alone Michael Elad The Computer Science Department The Technion –
A Sparse Solution of is Necessarily Unique !! Alfred M. Bruckstein, Michael Elad & Michael Zibulevsky The Computer Science Department The Technion – Israel.
Sparse and Redundant Representation Modeling for Image Processing Michael Elad The Computer Science Department The Technion – Israel Institute of technology.
Topics in MMSE Estimation for Sparse Approximation Michael Elad The Computer Science Department The Technion – Israel Institute of technology Haifa 32000,
AMSC 6631 Sparse Solutions of Linear Systems of Equations and Sparse Modeling of Signals and Images: Midyear Report Alfredo Nava-Tudela John J. Benedetto,
Cs: compressed sensing
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
Image Decomposition, Inpainting, and Impulse Noise Removal by Sparse & Redundant Representations Michael Elad The Computer Science Department The Technion.
Single Image Interpolation via Adaptive Non-Local Sparsity-Based Modeling The research leading to these results has received funding from the European.
Jianchao Yang, John Wright, Thomas Huang, Yi Ma CVPR 2008 Image Super-Resolution as Sparse Representation of Raw Image Patches.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images Alfred M. Bruckstein (Technion), David L. Donoho (Stanford), Michael.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Today’s Lecture Neural networks Training
Neural networks and support vector machines
Big data classification using neural network
Sparsity Based Poisson Denoising and Inpainting
Fall 2004 Backpropagation CS478 - Machine Learning.
Compressive Coded Aperture Video Reconstruction
Deep Learning Amin Sobhani.
Multiplicative updates for L1-regularized regression
LECTURE 11: Advanced Discriminant Analysis
ECE 5424: Introduction to Machine Learning
Unrolling: A principled method to develop deep neural networks
Classification with Perceptrons Reading:
Neural networks (3) Regularization Autoencoder
Machine Learning Basics
Convolutional Networks
Basic Algorithms Christina Gallner
Sparse and Redundant Representations and Their Applications in
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Yaniv Romano The Electrical Engineering Department
RED: Regularization by Denoising
Sudocodes Fast measurement and reconstruction of sparse signals
Sparse and Redundant Representations and Their Applications in
Sparse and Redundant Representations and Their Applications in
On Convolutional Neural Network
Lecture: Deep Convolutional Neural Networks
Improving K-SVD Denoising by Post-Processing its Method-Noise
* * Joint work with Michal Aharon Freddy Bruckstein Michael Elad
Neural networks (3) Regularization Autoencoder
Sparse and Redundant Representations and Their Applications in
Introduction to Neural Networks
Non-Negative Matrix Factorization
Sparse Modeling of Data and its Relation to Deep Learning
Image recognition.
Outline Sparse Reconstruction RIP Condition
Presentation transcript:

A Quest for a Universal Model for Signals: From Sparsity to ConvNets Yaniv Romano The Electrical Engineering Department Technion – Israel Institute of technology Advisor: Prof. Michael Elad The research leading to these results has been received funding from the European union's Seventh Framework Program (FP/2007-2013) ERC grant Agreement ERC-SPARSE- 320649

Modeling Data Data-sources are structured Seismic Data Matrix Data Text Documents Stock Market Biological Signals Data-sources are structured This structure, when identified, is the engine behind our ability to process this data Radar Imaging Videos Social Networks Still Images Traffic info Voice Signals 3D Objects Medical Imaging

Modeling Data Seismic Data Matrix Data Text Documents Stock Market Biological Signals Almost every task in data processing requires a model – true for denoising, deblurring, super-resolution, compression, recognition, prediction, and more… We will put a special emphasis on image denoising Radar Imaging Videos Social Networks Still Images Traffic info Voice Signals 3D Objects Medical Imaging

Generative Modeling of Data Sources Enables Models A model: a mathematical description of the underlying signal of interest, describing our beliefs regarding its structure The Underlying Idea Generative Modeling of Data Sources Enables A systematic algorithm development, & A theoretical analysis of their performance

SparseLand We will concentrate on sparse and redundant representations It is a universal model Adaptive (via learning) to the data source at hand Intersection with machine-learning How can we treat high-dimensional signals? Facing the curse of dimensionality problem, as we need to learn lots of parameters

Improving the model → Improving the performance Part I: Local Modeling Traditional approach to treat images – reduce the dimensions by working locally on small image patches In the first part of this talk we will study the limitations of this local approach and suggest different ways to overcome them As a result, we push denoising algorithms to new heights Improving the model → Improving the performance

Part II: Beyond Denoising The advent of highly effective models for images has led researchers to believe that existing denoisers are touching the ceiling in terms of restoration performance Can we leverage this impressive achievement to treat other problems in image processing? In the second part of this talk we will see that a denoiser can do much more than just… denoising

Why can we use a local prior to solve a global problem? Global Sparse Models We proceed by revisiting the traditional local modeling approach for treating high-dimensional signals, and ask: This will take us to Convolutional Sparse Coding (CSC), which is a global model that based on a local sparsity prior The convolutional structure is the key to bypass the curse of dimensionality Why can we use a local prior to solve a global problem?

Part III: Analyzing CNN We then propose a multi-layered extension of the CSC model, called ML-CSC, which is tightly connected to Convolutional Neural Networks (CNN) SparseLand Sparse Representation Theory CNN Convolutional Neural Networks * The ML-CSC model enables a theoretical analysis of deep learning algorithms’ (e.g. forward-pass) performance *Our analysis holds true for fully connected networks as well

Denoising via Local Modeling Global Sparse Modeling Inverse Problems  Boosting of Image Denoising Algorithms [SIAM IS ‘15]  Convolutional Neural Networks Analyzed via Convolutional Sparse Coding [JMLR ’17]  The Little Engine that Could: Regularization by Denoising (RED) [SIAM IS ’17] Patch-Disagreement as a Way to Improve K-SVD Denoising [IEEE-ICASP ’15]  Turning a Denoiser into a Super-Resolver using Plug and Play Priors [IEEE-ICIP ’16] Convolutional Dictionary Learning via Local Processing [ICCV ’17]  Gaussian Mixture Diffusion [IEEE-ICSEE ’16]  Con-Patch: When a Patch Meets its Context [IEEE-TIP ’16] On the Global-Local Dichotomy in Sparsity Modeling [Compressed-Sensing ’17]  Improving K-SVD Denoising by Post-Processing its Method-Noise [IEEE-ICIP ’13] RAISR: Rapid and Accurate Image Super Resolution [IEEE-TCI ‘16] Improving local modeling → Improved denoising performance  Example-Based Image Synthesis via Randomized Patch-Matching [IEEE-TIP ’17] Leveraging denoising algorithm to treat other inverse problems Better treatment of inverse problems by improving local priors  Single Image Interpolation via Adaptive Non-Local Sparsity-Based Modeling [IEEE-TIP ‘14] Global sparse priors, and the connection to deep learning

Part I: Image Denoising via Local Modeling

Image Denoising Original Image 𝐗 White Gaussian Noise 𝐄 Noisy Image 𝐘 Leading image denoising methods are built upon powerful patch-based local models:

𝛀 The SparseLand Model = 𝐑 i 𝐗 𝛄 i Assumes that every patch is a linear combination of a few columns, called atoms, taken from a matrix called a dictionary The operator 𝐑 i extracts the i-th 𝑛-dimensional patch from 𝐗∈ ℝ 𝑁 Sparse coding: 𝑛 𝑚>𝑛 𝐑 i 𝐗 𝛄 i = 𝛀 𝐏 0 : min 𝛄 i 𝛄 i 0 s.t. 𝐑 i 𝐗=𝛀 𝛄 i

Patch Denoising Given a noisy patch 𝐑 i 𝐘, solve 𝐏 0 ϵ : 𝛄 i = argmin 𝛄 i 𝛄 i 0 s.t. 𝐑 i 𝐘−𝛀 𝛄 i 2 ≤ϵ Clean patch: 𝛀 𝛄 i 𝐏 0 and ( 𝐏 0 ϵ ) are hard to solve Greedy methods such as Orthogonal Matching Pursuit (OMP) or Thresholding Convex relaxations such as Basis Pursuit (BP) 𝐏 1 ϵ : min 𝛄 i 𝛄 i 1 +ξ 𝐑 i 𝐘−𝛀 𝛄 i 2 2

Recall K-SVD Denoising [Elad & Aharon, ‘06] Initial Dictionary Noisy Image Using K-SVD Reconstructed Image Update the dictionary Denoise each patch Using OMP Despite its simplicity, this is a very well-performing algorithm We refer to this framework as “patch averaging”

What is ? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of… The Local-Global Gap: Efficient Independent Local Patch Processing VS. The Global Need to Model The Entire Image

What is ? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of… The Local-Global Gap: Efficient Independent Local Patch Processing VS. The Global Need to Model The Entire Image Whitening the residual image [Romano & Elad ‘13] Exploiting self-similarities [Romano, Protter & Elad ‘14] Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16] Enforcing the local model on the final patches (EPLL) [Ren, Romano & Elad ‘17] SOS-Boosting [Romano & Elad ’15]

SOS-Boosting [Romano & Elad ’15] Given any denoiser, how can we improve its performance? Denoise

SOS-Boosting [Romano & Elad ’15] Given any denoiser, how can we improve its performance? Denoise Previous Result Strengthen

SOS-Boosting [Romano & Elad ’15] Given any denoiser, how can we improve its performance? Denoise Previous Result Strengthen – Operate

SOS-Boosting [Romano & Elad ’15] Given any denoiser, how can we improve its performance? Intuitively, an improvement is expected since SNR 𝐲+ 𝐱 >SNR 𝐲 Denoise Previous Result Strengthen – Operate – Subtract SOS-Formulation: 𝐗 𝑘+1 =𝑓 𝐘+ 𝜌 𝐗 𝑘 − 𝜌 𝐗 𝑘

SOS-Boosting [Romano & Elad ’15] We study the convergence of the SOS using only the linear part True for K-SVD, EPLL, NLM, BM3D and others, where the overall processing is divided into a non-linear stage of decisions, followed by a linear filtering Relation to Graph Theory: Minimizing a Laplacian regularization functional: Relation to Game Theory: The SOS encourages overlapping patches to reach a consensus SOS-Formulation: 𝐗 𝑘+1 =𝑓 𝐘+ 𝜌 𝐗 𝑘 − 𝜌 𝐗 𝑘 𝐱 =𝑓 𝐘 =𝐖 𝐘 𝐗 ∗ = argmin 𝐗 𝐗−𝐖𝐘 2 2 +𝜌 𝐗 𝑇 𝐗−𝐖𝐗

Part II: A denoiser can do much more than just… Denoising The Little Engine that Could: Regularization by denoising (RED) Yaniv Romano, Michael Elad and Peyman Milanfar

Regularization by Denoising (RED) How can we take a denoiser and use it for solving inverse problems? 𝐇 is a linear degradation operator (e.g. blur, decimation…) 𝐖 is a psuedo-linear filter of a denoising algorithm 𝑓 𝐗 is an arbitrary denoising algorithm min 𝐗 1 2 𝐇𝐗−𝐘 2 2 +G 𝐗 Relation to measurements Prior min 𝐗 1 2 𝐇𝐗−𝐘 2 2 + 𝜌 2 𝐗 T 𝐗 −𝑓 𝐗 G 𝐗 regularizer min 𝐗 1 2 𝐇𝐗−𝐘 2 2 + 𝜌 2 𝐗 T 𝐗 −𝐖𝐗 G 𝐗 regularizer Relation to measurements Relation to measurements G 𝐗 regularizer G 𝐗 regularizer 𝛻 𝐗 G 𝐗 =?

Regularization by Denoising min 𝐗 1 2 𝐇𝐗−𝐘 2 2 +G 𝐗 Relation to measurements Prior min 𝐗 1 2 𝐇𝐗−𝐘 2 2 + 𝜌 2 𝐗 T 𝐗 −𝑓 𝐗 G 𝐗 regularizer Relation to measurements Surprisingly this regularizer is differentiable Under the following assumptions: 𝑓 𝐗 is differentiable c𝑓 𝐗 =𝑓(c𝐗) for c=1+ϵ, where ϵ→0 (local homogeneity) Convexity is guaranteed if the spectral radius of 𝛻 𝐗 𝑓 𝐗 is ≤1 (passive operator) 𝛻 𝐗 G 𝐗 =𝐗 −𝑓 𝐗 Hold true for many denoising algorithms

Regularization by Denoising min 𝐗 1 2 𝐇𝐗−𝐘 2 2 +G 𝐗 Relation to measurements Prior min 𝐗 1 2 𝐇𝐗−𝐘 2 2 + 𝜌 2 𝐗 T 𝐗 −𝑓 𝐗 G 𝐗 regularizer Relation to measurements We have an easy access to the gradient of the above objective It is convex under the 3 conditions presented before Any reasonable optimization algorithm will get to the global minimum Plug a state-of-the-art denoiser, and get cutting-edge performance in image deblurring, super resolution, and more… 𝛻 𝐗 G 𝐗 =𝐗 −𝑓 𝐗 *We proposed 3 different ways to minimize RED: (1) Steepest Descent (2) ADMM (3) Fixed-Point * *We proposed 3 different ways to minimize RED: (1) Steepest Descent (2) ADMM (3) Fixed-Point

Numerical Approach: Fixed-Point min 𝐗 1 2 𝐇𝐗−𝐘 2 2 + 𝜌 2 𝐗 T 𝐗 −𝑓 𝐗 G 𝐗 regularizer Relation to measurements 𝐇 𝐓 𝐇𝐗−𝐘 +𝜌 𝐗 −𝑓 𝐗 =0 𝐇 𝐓 𝐇 𝐗 k+1 −𝐘 +𝜌 𝐗 k+1 −𝑓 𝐗 k =0 𝐗 k+1 = 𝐇 𝐓 𝐇+𝜌𝐈 −𝟏 𝐇 𝐓 𝐘+𝜌𝑓 𝐗 k Guaranteed to converge due to the passivity of 𝑓 ⋅

Fixed-Point: Relation to CNN (1) 𝐗 k+1 = 𝐇 𝐓 𝐇+𝜌𝐈 −𝟏 𝐇 𝐓 𝐘+𝜌 𝐇 𝐓 𝐇+𝜌𝐈 −𝟏 𝑓 𝐗 k b 𝐌 b 𝑓 𝐗 𝐌 𝐗 k 𝐗 k+𝟏

Fixed-Point: Relation to CNN (2) b b 𝐙 k−1 𝐙 k 𝑓 𝐗 𝐌 𝑓 𝐗 𝐌 𝐗 k−1 𝐗 k 𝐗 k+1 b b 𝐙 k+1 𝐙 k 𝐌 𝑓 𝐗 𝐌 𝑓 𝐗 𝐙 k−1 While CNN use a point-wise non-linearity 𝑓 ⋅ we propose a very aggressive and image-aware denoiser Our scheme is guaranteed to minimize a clear and relevant objective function 𝐳 k+1 =𝑓 𝐌 𝐳 k +𝐛

Part III: (Deep) Global Sparse Model

What is ? Missing a theoretical backbone! Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of… The Local-Global Gap: Efficient Independent Local Patch Processing VS. The Global Need to Model The Entire Image Missing a theoretical backbone! Why can we use a local prior to solve a global problem?

Convolutional Sparse Coding Working Locally Thinking Globally: Theoretical Guarantees for Convolutional Sparse Coding Vardan Papyan, Jeremias Sulam and Michael Elad Convolutional Dictionary Learning via Local Processing Vardan Papyan, Yaniv Romano, Jeremias Sulam, and Michael Elad

𝐗 = + + + + + + + = + + + + + + + Intuitively… The first filter = + + + + + + + = + + + + + + + The first filter The second filter

Convolutional Sparse Coding (CSC) = ∗ ∗ + + ∗ + ∗ + ∗ …

Convolutional Sparse Representation = 𝐃 L 𝑛 (2𝑛−1)𝑚 𝑛 (2𝑛−1)𝑚 𝐑 i 𝐗 𝐑 i+1 𝐗 𝛄 i 𝛄 i+1 stripe-dictionary stripe vector 𝐗=𝐃𝚪 Adjacent representations overlap, as they skip by 𝑚 items as we sweep through the patches of 𝐗 𝐑 i 𝐗=𝛀 𝛄 i

CSC Relation to Our Story A clear global model: every patch has a sparse representation w.r.t. to the same local dictionary 𝛀, just as we have assumed No notion of disagreement on the patch overlaps Related to the current common practice of patch averaging ( 𝐑 i T - put the patch 𝛀 𝛄 i back in the i-th location of the global vector) 𝐗=𝐃𝚪= 1 𝑛 i 𝐑 i T 𝛀 𝛄 i What about the Pursuit? “Patch averaging”: independent sparse coding for each patch CSC: should seek all the representations together What about the theory? Until recently little was known regrading the theoretical aspects of CSC

Classical Sparse Theory (Noiseless) 𝐏 0 : min 𝚪 𝚪 0 s.t. 𝐗=𝐃𝚪 Mutual Coherence: μ 𝐃 = max i≠j | d i T d j | Theorem: The OMP and BP are guaranteed to recover the true sparse code assuming that 𝚪 0 < 1 2 1+ 1 μ 𝐃 [Donoho & Elad ‘03] [Tropp ‘04], [Donoho & Elad ‘03] Assuming that 𝑚=2 and 𝑛=64 we have that [Welch, ’74] μ 𝐃 ≥0.063 As a result, uniqueness and success of pursuits is guaranteed as long as 𝚪 0 <8 Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result!

Classical Sparse Theory (Noiseless) 𝐏 0 : min 𝚪 𝚪 0 s.t. 𝐗=𝐃𝚪 Mutual Coherence: μ 𝐃 = max i≠j | d i T d j | Theorem: The OMP and BP are guaranteed to recover the true sparse code assuming that 𝚪 0 < 1 2 1+ 1 μ 𝐃 Classic SparseLand theory cannot provide good explanations for the CSC model [Donoho & Elad ‘03] [Tropp ‘04], [Donoho & Elad ‘03] Assuming that 𝑚=2 and 𝑛=64 we have that [Welch, ’74] μ 𝐃 ≥0.063 As a result, uniqueness and success of pursuits is guaranteed as long as 𝚪 0 <8 Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result!

Moving to Local Sparsity: Stripes 𝑚=2 𝛄 i 𝚪 0,∞ s = max i 𝛄 i 0 If 𝚪 is locally sparse [Papyan, Sulam & Elad ‘16] The solution of 𝐏 0,∞ is necessarily unique The global OMP and BP are guaranteed to recover it 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝐏 0,∞ : min 𝚪 𝚪 0,∞ s s.t. 𝐗=𝐃𝚪 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝚪 0,∞ s is low  all 𝛄 i are sparse  every patch has a sparse representation over 𝛀 𝛄 i 𝛄 i 𝛄 i 𝛄 i This result poses a local constraint for a global guarantee, and as such, the guarantees scale linearly with the dimension of the signal

From Ideal to Noisy Signals In practice, 𝐘=𝐃𝚪+𝐄, where 𝐄 is due to noise or model deviations If 𝚪 is locally sparse and the noise is bounded [Papyan, Sulam & Elad ‘16] The solution of the 𝐏 0,∞ ϵ is stable The solution obtained via global OMP/BP is stable 𝐃𝚪 𝐄 𝐘 𝐏 0,∞ ϵ : min 𝚪 𝚪 0,∞ s s.t. 𝐘−𝐃𝚪 2 ≤ϵ 𝚪 How close is 𝚪 to 𝚪? The true and estimated representation are close

Going Deeper Convolutional Neural Networks Analyzed via Convolutional Sparse Coding Vardan Papyan, Yaniv Romano, and Michael Elad Multi-Layer Convolutional Sparse Modeling: Pursuit and Dictionary Learning Jeremias Sulam, Vardan Papyan, Yaniv Romano, Michael Elad

CSC and CNN There is an analogy between CSC and CNN: Convolutional structure Data driven models ReLU is a sparsifying operator We propose a principled way to analyze CNN through the eyes of sparse representations But first, a short review of CNN…

Forward Pass of CNN 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐘, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐗 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐗 𝐙 2 ∈ ℝ 𝑁 𝑚 2 𝐛 2 ∈ ℝ 𝑁 𝑚 2 𝐖 2 T ∈ ℝ 𝑁 𝑚 2 ×𝑁 𝑚 1 𝑛 1 𝑚 1 𝑚 2 𝐛 1 ∈ ℝ 𝑁 𝑚 1 𝐖 1 T ∈ ℝ 𝑁 𝑚 1 ×𝑁 𝑚 1 𝑛 0 ReLU ReLU 𝐘∈ ℝ 𝑁 𝑚 1

Training Stage of CNN Consider the task of classification Given a set of signals 𝐘 j j and their corresponding labels h 𝐘 j j , the CNN learns an end-to-end mapping min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i True label Classifier Output of last layer

Multi-Layer CSC (ML-CSC) Back to CSC 𝐗∈ ℝ 𝑁 𝑚 1 𝑛 0 𝐃 1 ∈ ℝ 𝑁×𝑁 𝑚 1 𝚪 1 ∈ ℝ 𝑁 𝑚 1 We propose to impose the same structure on the representations themselves 𝚪 1 ∈ ℝ 𝑁 𝑚 1 𝑛 1 𝑚 1 𝑚 2 𝐃 2 ∈ ℝ 𝑁 𝑚 1 ×𝑁 𝑚 2 𝑚 1 𝚪 2 ∈ ℝ 𝑁 𝑚 2 Convolutional sparsity (CSC) assumes an inherent structure is present in natural signals Multi-Layer CSC (ML-CSC)

Intuition: From Atoms to Molecules 𝐗∈ ℝ 𝑁 𝐃 1 ∈ ℝ 𝑁×𝑁 𝑚 1 𝚪 1 ∈ ℝ 𝑁 𝑚 1 𝐃 2 ∈ ℝ 𝑁 𝑚 1 ×𝑁 𝑚 2 𝚪 1 ∈ ℝ 𝑁 𝑚 1 𝚪 2 ∈ ℝ 𝑁 𝑚 2 Columns in 𝐃 1 are convolutional atoms Columns in 𝐃 2 combine the atoms in 𝐃 𝟏 , creating more complex structures The dictionary 𝐃 1 𝐃 2 is a superposition of the atoms of 𝐃 1 The size of the effective atoms is increased throughout the layers (receptive field)

Intuition: From Atoms to Molecules 𝐗∈ ℝ 𝑁 𝐃 1 ∈ ℝ 𝑁×𝑁 𝑚 1 𝐃 2 ∈ ℝ 𝑁 𝑚 1 ×𝑁 𝑚 2 𝚪 1 ∈ ℝ 𝑁 𝑚 1 𝚪 2 ∈ ℝ 𝑁 𝑚 2 One could chain the multiplication of all the dictionaries into one effective dictionary 𝐃 eff = 𝐃 1 𝐃 2 𝐃 3 ∙∙∙ 𝐃 K and then 𝐗=𝐃 eff 𝚪 K as in SparseLand However, a key property in this model is the sparsity of each representation (feature-maps)

A Small Taste: Model Training (MNIST) MNIST Dictionary: D1: 32 filters of size 7×7, with stride of 2 (dense) D2: 128 filters of size 5×5×32 with stride of 1 - 99.09 % sparse D3: 1024 filters of size 7×7×128 – 99.89 % sparse 𝐃 1 (7×7) 𝐃 1 𝐃 2 (15×15) 𝐃 1 𝐃 2 𝐃 3 (28×28)

A Small Taste: Pursuit Y Γ 0 Γ 1 Γ 2 Γ 3 94.51 % sparse (213 nnz)

Deep Coding Problem (𝐃𝐂𝐏) Noiseless Pursuit Find 𝚪 j j=1 K s.t. 𝐗= 𝐃 1 𝚪 1 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 = 𝐃 2 𝚪 2 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 = 𝐃 K 𝚪 K 𝚪 K 0,∞ s ≤ λ K Noisy Pursuit Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K

Deep Learning Problem 𝐃𝐋𝐏 min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 Supervised Dictionary Learning Deepest representation obtained from DCP Task driven dictionary learning [Mairal, Bach, Sapiro & Ponce ’12] True label Classifier Unsupervised Dictionary Learning Find 𝐃 i i=1 K s.t. 𝐘 j − 𝐃 1 𝚪 𝟏 j 2 ≤ ℰ 0 𝚪 𝟏 j 0,∞ s ≤ λ 1 𝚪 1 j − 𝐃 2 𝚪 𝟐 j 𝟐 ≤ ℰ 𝟏 𝚪 𝟐 j 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 j − 𝐃 K 𝚪 𝑲 j 𝟐 ≤ ℰ 𝐊−𝟏 𝚪 K j 0,∞ s ≤ λ K j=1 𝐽

ML-CSC: The Simplest Pursuit The simplest pursuit algorithm (single-layer case) is the THR algorithm, which operates on a given input signal 𝐘 by: Restricting the coefficients to be nonnegative does not restrict the expressiveness of the model 𝐘=𝐃𝚪+𝐄 𝚪 = 𝒫 𝛽 𝐃 T 𝐘 and 𝚪 is sparse ReLU = Soft Nonnegative Thresholding

Consider this for Solving the DCP 𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Layered thresholding (LT): 𝚪 2 = 𝒫 β 2 𝐃 2 T 𝒫 β 1 𝐃 1 T 𝐘

Consider this for Solving the DCP 𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Layered thresholding (LT): Estimate 𝚪 1 via the THR algorithm 𝚪 2 = 𝒫 β 2 𝐃 2 T 𝒫 β 1 𝐃 1 T 𝐘

Consider this for Solving the DCP 𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Layered thresholding (LT): Forward pass of CNN: Estimate 𝚪 1 via the THR algorithm 𝚪 2 = 𝒫 β 2 𝐃 2 T 𝒫 β 1 𝐃 1 T 𝐘 Estimate 𝚪 2 via the THR algorithm 𝑓 𝐗 =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 ReLU Soft Non-Negative THR CNN Language Bias 𝐛 Thresholds 𝛽 SparseLand Language Weights 𝐖 Dictionary D Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂 𝐏 ∗

Consider this for Solving the DCP 𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Layered thresholding (LT): Forward pass of CNN: Estimate 𝚪 1 via the THR algorithm 𝚪 2 = 𝒫 β 2 𝐃 2 T 𝒫 β 1 𝐃 1 T 𝐘 Estimate 𝚪 2 via the THR algorithm 𝑓 𝐗 =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 The layered (soft nonnegative) thresholding and the forward pass algorithm are the very same things !!!

Consider this for Solving the DLP DLP (supervised ): min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 The thresholds for the DCP should also learned CNN Language SparseLand Language Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂 𝐏 ∗

Consider this for Solving the DLP DLP (supervised ): min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 The thresholds for the DCP should also learned Estimate via the layered THR algorithm CNN Language SparseLand Language Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂 𝐏 ∗

Consider this for Solving the DLP DLP (supervised ): CNN training: min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 The thresholds for the DCP should also learned Estimate via the layered THR algorithm min 𝐖 i , 𝐛 i ,U j ℓ h 𝐘 j ,𝐔,𝑓 𝐘, 𝐖 i , 𝐛 i CNN Language SparseLand Language Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂 𝐏 ∗

Consider this for Solving the DLP * DLP (supervised ): CNN training: min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 The thresholds for the DCP should also learned Estimate via the layered THR algorithm min 𝐖 i , 𝐛 i ,U j ℓ h 𝐘 j ,𝐔,𝑓 𝐘, 𝐖 i , 𝐛 i The problem solved by the training stage of CNN and the DLP are equivalent as well, assuming that the DCP is approximated via the layered thresholding algorithm CNN Language SparseLand Language * Recall that for the ML-CSC, there exists an unsupervised avenue for training the dictionaries that has no simple parallel in CNN Forward pass 𝑓 ⋅ Layered THR Pursuit 𝐃𝐂 𝐏 ∗

Theoretical Questions M A 𝚪 i i=1 K 𝐗= 𝐃 1 𝚪 1 𝚪 1 = 𝐃 2 𝚪 2 ⋮ 𝚪 K−1 = 𝐃 K 𝚪 K 𝚪 i is 𝐋 0,∞ sparse 𝐃𝐂 𝐏 λ ℰ Layered Thresholding (Forward Pass) Other? 𝐗 𝐘 Our dream is to find the true Gama that is used in the task of classification.

Uniqueness of 𝐃𝐂 𝐏 λ 𝐃𝐂 𝐏 λ : Find a set of representations satisfying 𝐗= 𝐃 1 𝚪 1 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 = 𝐃 2 𝚪 2 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 = 𝐃 K 𝚪 K 𝚪 K 0,∞ s ≤ λ K Is this set unique? Theorem: If a set of solutions 𝚪 i i=1 K is found for ( 𝐃𝐂𝐏 λ ) such that: 𝚪 i 0,∞ s = λ i < 1 2 1+ 1 μ 𝐃 i Then these are necessarily the unique solution to this problem. Mutual Coherence: μ 𝐃 = max i≠j | d i T d j | [Donoho & Elad ‘03] The feature maps CNN aims to recover are unique

𝐃𝐂 𝐏 λ ℰ : Find a set of representations satisfying Is this set stable? Stability of 𝐃𝐂 𝐏 λ ℰ 𝐃𝐂 𝐏 λ ℰ : Find a set of representations satisfying 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Suppose that we manage to solve the 𝐃𝐂 𝐏 λ ℰ and find a feasible set of representations satisfying all the conditions The question we pose is: How close is 𝚪 i to 𝚪 i ? 𝚪 i i=1 K

The problem CNN aims to solve is stable under certain conditions Stability of 𝐃𝐂 𝐏 λ ℰ Theorem: If the true representations 𝚪 i i=1 K satisfy 𝚪 i 0,∞ s ≤ λ i < 1 2 1+ 1 μ 𝐃 i then the set of solutions 𝚪 i i=1 K obtained by solving this problem (somehow) with ℰ 0 = 𝐄 2 2 and ℰ i =0 (𝑖≥1) must obey 𝚪 i − 𝚪 i 2 2 ≤4 E 2 2 𝑗=1 𝑖 1 1− 2 λ j −1 μ 𝐃 j The problem CNN aims to solve is stable under certain conditions Observe this annoying effect of error magnification as we dive into the model

Local Noise Assumption Our analysis relied on the local sparsity of the underlying solution 𝚪, which was enforced through the ℓ 0,∞ norm In what follows, we present stability guarantees that will also depend on the local energy in the noise vector 𝐄 This will be enforced via the ℓ 2,∞ norm, defined as: 𝐄 2,∞ p = max i 𝐑 i 𝐄 2

Stability of Layered-THR Theorem: If 𝚪 i 0,∞ s < 1 2 1+ 1 μ 𝐃 i ⋅ 𝚪 i min 𝚪 i max − 1 μ 𝐃 i ⋅ ε L i−1 𝚪 i max then the layered hard THR (with the proper thresholds) will find the correct supports and 𝚪 i LT − 𝚪 i 2,∞ p ≤ ε L i where we have defined ε L 0 = 𝐄 2,∞ p and ε L i = 𝚪 i 0,∞ p ⋅ ε L i−1 +μ 𝐃 i 𝚪 i 0,∞ s −1 𝚪 i max Problems: Contrast Error growth Error even if no noise The stability of the forward pass is guaranteed if the underlying representations are locally sparse and the noise is locally bounded

Layered Basis Pursuit (Noisy) 𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K We chose the Thresholding algorithm due to its simplicity, but we do know that there are better pursuit methods – how about using them? Lets use the Basis Pursuit instead … 𝚪 1 LBP = min 𝚪 1 1 2 𝐘− 𝐃 1 𝚪 1 2 2 + ξ 1 𝚪 1 1 Deconvolutional networks [Zeiler, Krishnan, Taylor & Fergus ‘10] 𝚪 2 LBP = min 𝚪 2 1 2 𝚪 1 LBP − 𝐃 2 𝚪 2 2 2 + ξ 2 𝚪 2 1

Stability of Layered BP Theorem: Assuming that 𝚪 i 0,∞ s ≤ 1 3 1+ 1 μ 𝐃 i then for correctly chosen ξ i i=1 K we are guaranteed that The support of 𝚪 i LBP is contained in that of 𝚪 i The error is bounded: 𝚪 i LBP − 𝚪 i 2,∞ p ≤ ε L i , where ε L i = 7.5 i 𝐄 2,∞ p j=1 i 𝚪 j 0,∞ p Every entry in 𝚪 i greater than ε L i / 𝚪 i 0,∞ p will be found Problems: Contrast Error growth Error even if no noise

Layered Iterative Thresholding Layered BP: 𝚪 j LBP = min 𝚪 j 1 2 𝚪 j−1 LBP − 𝐃 j 𝚪 j 2 2 + ξ j 𝚪 j 1 j Layered Iterative Soft-Thresholding 𝚪 j t = 𝒮 ξ j / c j 𝚪 j t−1 + 1 c j 𝐃 j T 𝚪 j−1 − 𝐃 j 𝚪 j t−1 j t Note that our suggestion implies that groups of layers share the same dictionaries Can be seen as a recurrent neural network [Gregor & LeCun ‘10] * c i >0.5 λ max ( 𝐃 i T 𝐃 i )

This Talk Independent patch-processing Local Sparsity RED for solving inverse problems Convolutional Neural Networks Convolutional Sparse Coding Multi-Layer Convolutional Sparse Coding The underlying idea: Modeling the data source in order to be able to theoretically analyze algorithms’ performance A novel interpretation and theoretical understanding of CNN Extension of the classical sparse theory to a multi-layer setting We showed that a denoiser can be used to solve other inverse problems … propose and analyze a multi-layer extension of CSC, shown to be tightly connected to CNN We mentioned several interesting connections between CSC and CNN and this led us to … We described the limitations of patch-based processing and ways to overcome some of these We presented a theoretical study of the CSC and a practical algorithm that works locally The ML-CSC was shown to enable a theoretical study of CNN, along with new insights

Questions?

RED: G 𝐗 = 1 2 𝐗 T 𝐗 −𝑓 𝐗 Lets derive this expression Condition I: A closer look on the directional derivative Lets dare and compute the Hessian Condition I: Differentiability 𝛻 𝐗 G 𝐗 =𝐗 − 1 2 𝑓 𝐗 − 1 2 𝐗 T 𝛻 𝐗 𝑓 𝐗 =𝐗 −𝑓 𝐗 𝐗 T 𝛻 𝐗 𝑓 𝐗 = lim 𝝐→0 𝑓 𝐗+𝜖𝐗 −𝑓 𝐗 𝜖 = lim 𝝐→0 𝑓 1+𝜖 𝐗 −𝑓 𝐗 𝜖 Condition II: Local Homogeneity = 1+𝜖 𝑓 𝐗 −𝑓 𝐗 𝜖 =𝑓 𝐗 Condition III: Passivity max 𝜆 𝛻 𝐗 𝑓 𝐗 ≤1 𝛻 𝐗 𝛻 𝐗 G 𝐗 =𝐈 − 𝛻 𝐗 𝑓 𝐗 ≽0 This regularizer is convex!