A Quest for a Universal Model for Signals: From Sparsity to ConvNets

A Quest for a Universal Model for Signals: From Sparsity to ConvNets
Yaniv Romano The Electrical Engineering Department Technion – Israel Institute of technology Advisor: Prof. Michael Elad The research leading to these results has been received funding from the European union's Seventh Framework Program (FP/ ) ERC grant Agreement ERC-SPARSE

Modeling Data Data-sources are structured
Seismic Data Matrix Data Text Documents Stock Market Biological Signals Data-sources are structured This structure, when identified, is the engine behind our ability to process this data Radar Imaging Videos Social Networks Still Images Traffic info Voice Signals 3D Objects Medical Imaging

Modeling Data Seismic Data Matrix Data Text Documents Stock Market Biological Signals Almost every task in data processing requires a model – true for denoising, deblurring, super-resolution, compression, recognition, prediction, and more… We will put a special emphasis on image denoising Radar Imaging Videos Social Networks Still Images Traffic info Voice Signals 3D Objects Medical Imaging

Generative Modeling of Data Sources Enables
Models A model: a mathematical description of the underlying signal of interest, describing our beliefs regarding its structure The Underlying Idea Generative Modeling of Data Sources Enables A systematic algorithm development, & A theoretical analysis of their performance

SparseLand We will concentrate on sparse and redundant representations
It is a universal model Adaptive (via learning) to the data source at hand Intersection with machine-learning How can we treat high-dimensional signals? Facing the curse of dimensionality problem, as we need to learn lots of parameters

Improving the model → Improving the performance
Part I: Local Modeling Traditional approach to treat images – reduce the dimensions by working locally on small image patches In the first part of this talk we will study the limitations of this local approach and suggest different ways to overcome them As a result, we push denoising algorithms to new heights Improving the model → Improving the performance

Part II: Beyond Denoising
The advent of highly effective models for images has led researchers to believe that existing denoisers are touching the ceiling in terms of restoration performance Can we leverage this impressive achievement to treat other problems in image processing? In the second part of this talk we will see that a denoiser can do much more than just… denoising

Why can we use a local prior to solve a global problem?
Global Sparse Models We proceed by revisiting the traditional local modeling approach for treating high-dimensional signals, and ask: This will take us to Convolutional Sparse Coding (CSC), which is a global model that based on a local sparsity prior The convolutional structure is the key to bypass the curse of dimensionality Why can we use a local prior to solve a global problem?

Part III: Analyzing CNN
We then propose a multi-layered extension of the CSC model, called ML-CSC, which is tightly connected to Convolutional Neural Networks (CNN) SparseLand Sparse Representation Theory CNN Convolutional Neural Networks * The ML-CSC model enables a theoretical analysis of deep learning algorithms’ (e.g. forward-pass) performance *Our analysis holds true for fully connected networks as well

Denoising via Local Modeling Global Sparse Modeling Inverse Problems
Boosting of Image Denoising Algorithms [SIAM IS ‘15] Convolutional Neural Networks Analyzed via Convolutional Sparse Coding [JMLR ’17] The Little Engine that Could: Regularization by Denoising (RED) [SIAM IS ’17] Patch-Disagreement as a Way to Improve K-SVD Denoising [IEEE-ICASP ’15] Turning a Denoiser into a Super-Resolver using Plug and Play Priors [IEEE-ICIP ’16] Convolutional Dictionary Learning via Local Processing [ICCV ’17] Gaussian Mixture Diffusion [IEEE-ICSEE ’16] Con-Patch: When a Patch Meets its Context [IEEE-TIP ’16] On the Global-Local Dichotomy in Sparsity Modeling [Compressed-Sensing ’17] Improving K-SVD Denoising by Post-Processing its Method-Noise [IEEE-ICIP ’13] RAISR: Rapid and Accurate Image Super Resolution [IEEE-TCI ‘16] Improving local modeling → Improved denoising performance Example-Based Image Synthesis via Randomized Patch-Matching [IEEE-TIP ’17] Leveraging denoising algorithm to treat other inverse problems Better treatment of inverse problems by improving local priors Single Image Interpolation via Adaptive Non-Local Sparsity-Based Modeling [IEEE-TIP ‘14] Global sparse priors, and the connection to deep learning

Part I: Image Denoising via Local Modeling

Image Denoising Original Image 𝐗 White Gaussian Noise 𝐄 Noisy Image 𝐘
Leading image denoising methods are built upon powerful patch-based local models:

𝛀 The SparseLand Model = 𝐑 i 𝐗 𝛄 i
Assumes that every patch is a linear combination of a few columns, called atoms, taken from a matrix called a dictionary The operator 𝐑 i extracts the i-th 𝑛-dimensional patch from 𝐗∈ ℝ 𝑁 Sparse coding: 𝑛 𝑚>𝑛 𝐑 i 𝐗 𝛄 i = 𝛀 𝐏 0 : min 𝛄 i 𝛄 i s.t. 𝐑 i 𝐗=𝛀 𝛄 i

Patch Denoising Given a noisy patch 𝐑 i 𝐘, solve
𝐏 0 ϵ : 𝛄 i = argmin 𝛄 i 𝛄 i s.t 𝐑 i 𝐘−𝛀 𝛄 i 2 ≤ϵ Clean patch: 𝛀 𝛄 i 𝐏 0 and ( 𝐏 0 ϵ ) are hard to solve Greedy methods such as Orthogonal Matching Pursuit (OMP) or Thresholding Convex relaxations such as Basis Pursuit (BP) 𝐏 1 ϵ : min 𝛄 i 𝛄 i 1 +ξ 𝐑 i 𝐘−𝛀 𝛄 i 2 2

Recall K-SVD Denoising [Elad & Aharon, ‘06]
Initial Dictionary Noisy Image Using K-SVD Reconstructed Image Update the dictionary Denoise each patch Using OMP Despite its simplicity, this is a very well-performing algorithm We refer to this framework as “patch averaging”

What is ? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of… The Local-Global Gap: Efficient Independent Local Patch Processing VS. The Global Need to Model The Entire Image

What is ? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of… The Local-Global Gap: Efficient Independent Local Patch Processing VS. The Global Need to Model The Entire Image Whitening the residual image [Romano & Elad ‘13] Exploiting self-similarities [Romano, Protter & Elad ‘14] Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16] Enforcing the local model on the final patches (EPLL) [Ren, Romano & Elad ‘17] SOS-Boosting [Romano & Elad ’15]

SOS-Boosting [Romano & Elad ’15]
Given any denoiser, how can we improve its performance? Denoise

Given any denoiser, how can we improve its performance? Denoise Previous Result Strengthen

Given any denoiser, how can we improve its performance? Denoise Previous Result Strengthen – Operate

Given any denoiser, how can we improve its performance? Intuitively, an improvement is expected since SNR 𝐲+ 𝐱 >SNR 𝐲 Denoise Previous Result Strengthen – Operate – Subtract SOS-Formulation: 𝐗 𝑘+1 =𝑓 𝐘+ 𝜌 𝐗 𝑘 − 𝜌 𝐗 𝑘

We study the convergence of the SOS using only the linear part True for K-SVD, EPLL, NLM, BM3D and others, where the overall processing is divided into a non-linear stage of decisions, followed by a linear filtering Relation to Graph Theory: Minimizing a Laplacian regularization functional: Relation to Game Theory: The SOS encourages overlapping patches to reach a consensus SOS-Formulation: 𝐗 𝑘+1 =𝑓 𝐘+ 𝜌 𝐗 𝑘 − 𝜌 𝐗 𝑘 𝐱 =𝑓 𝐘 =𝐖 𝐘 𝐗 ∗ = argmin 𝐗 𝐗−𝐖𝐘 𝜌 𝐗 𝑇 𝐗−𝐖𝐗

Part II: A denoiser can do much more than just… Denoising
The Little Engine that Could: Regularization by denoising (RED) Yaniv Romano, Michael Elad and Peyman Milanfar

Regularization by Denoising (RED)
How can we take a denoiser and use it for solving inverse problems? 𝐇 is a linear degradation operator (e.g. blur, decimation…) 𝐖 is a psuedo-linear filter of a denoising algorithm 𝑓 𝐗 is an arbitrary denoising algorithm min 𝐗 𝐇𝐗−𝐘 G 𝐗 Relation to measurements Prior min 𝐗 𝐇𝐗−𝐘 𝜌 2 𝐗 T 𝐗 −𝑓 𝐗 G 𝐗 regularizer min 𝐗 𝐇𝐗−𝐘 𝜌 2 𝐗 T 𝐗 −𝐖𝐗 G 𝐗 regularizer Relation to measurements Relation to measurements G 𝐗 regularizer G 𝐗 regularizer 𝛻 𝐗 G 𝐗 =?

Regularization by Denoising
min 𝐗 𝐇𝐗−𝐘 G 𝐗 Relation to measurements Prior min 𝐗 𝐇𝐗−𝐘 𝜌 2 𝐗 T 𝐗 −𝑓 𝐗 G 𝐗 regularizer Relation to measurements Surprisingly this regularizer is differentiable Under the following assumptions: 𝑓 𝐗 is differentiable c𝑓 𝐗 =𝑓(c𝐗) for c=1+ϵ, where ϵ→0 (local homogeneity) Convexity is guaranteed if the spectral radius of 𝛻 𝐗 𝑓 𝐗 is ≤1 (passive operator) 𝛻 𝐗 G 𝐗 =𝐗 −𝑓 𝐗 Hold true for many denoising algorithms

Regularization by Denoising
min 𝐗 𝐇𝐗−𝐘 G 𝐗 Relation to measurements Prior min 𝐗 𝐇𝐗−𝐘 𝜌 2 𝐗 T 𝐗 −𝑓 𝐗 G 𝐗 regularizer Relation to measurements We have an easy access to the gradient of the above objective It is convex under the 3 conditions presented before Any reasonable optimization algorithm will get to the global minimum Plug a state-of-the-art denoiser, and get cutting-edge performance in image deblurring, super resolution, and more… 𝛻 𝐗 G 𝐗 =𝐗 −𝑓 𝐗 *We proposed 3 different ways to minimize RED: (1) Steepest Descent (2) ADMM (3) Fixed-Point * *We proposed 3 different ways to minimize RED: (1) Steepest Descent (2) ADMM (3) Fixed-Point

Numerical Approach: Fixed-Point
min 𝐗 𝐇𝐗−𝐘 𝜌 2 𝐗 T 𝐗 −𝑓 𝐗 G 𝐗 regularizer Relation to measurements 𝐇 𝐓 𝐇𝐗−𝐘 +𝜌 𝐗 −𝑓 𝐗 =0 𝐇 𝐓 𝐇 𝐗 k+1 −𝐘 +𝜌 𝐗 k+1 −𝑓 𝐗 k =0 𝐗 k+1 = 𝐇 𝐓 𝐇+𝜌𝐈 −𝟏 𝐇 𝐓 𝐘+𝜌𝑓 𝐗 k Guaranteed to converge due to the passivity of 𝑓 ⋅

Fixed-Point: Relation to CNN (1)
𝐗 k+1 = 𝐇 𝐓 𝐇+𝜌𝐈 −𝟏 𝐇 𝐓 𝐘+𝜌 𝐇 𝐓 𝐇+𝜌𝐈 −𝟏 𝑓 𝐗 k b 𝐌 b 𝑓 𝐗 𝐌 𝐗 k 𝐗 k+𝟏

Fixed-Point: Relation to CNN (2)
b b 𝐙 k−1 𝐙 k 𝑓 𝐗 𝐌 𝑓 𝐗 𝐌 𝐗 k−1 𝐗 k 𝐗 k+1 b b 𝐙 k+1 𝐙 k 𝐌 𝑓 𝐗 𝐌 𝑓 𝐗 𝐙 k−1 While CNN use a point-wise non-linearity 𝑓 ⋅ we propose a very aggressive and image-aware denoiser Our scheme is guaranteed to minimize a clear and relevant objective function 𝐳 k+1 =𝑓 𝐌 𝐳 k +𝐛

Part III: (Deep) Global Sparse Model

What is ? Missing a theoretical backbone!
Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of… The Local-Global Gap: Efficient Independent Local Patch Processing VS. The Global Need to Model The Entire Image Missing a theoretical backbone! Why can we use a local prior to solve a global problem?

Convolutional Sparse Coding
Working Locally Thinking Globally: Theoretical Guarantees for Convolutional Sparse Coding Vardan Papyan, Jeremias Sulam and Michael Elad Convolutional Dictionary Learning via Local Processing Vardan Papyan, Yaniv Romano, Jeremias Sulam, and Michael Elad

𝐗 = + + + + + + + = + + + + + + + Intuitively… The first filter
= = The first filter The second filter

Convolutional Sparse Coding (CSC)
= ∗ ∗ + + ∗ + ∗ + ∗ …

Convolutional Sparse Representation
= 𝐃 L 𝑛 (2𝑛−1)𝑚 𝑛 (2𝑛−1)𝑚 𝐑 i 𝐗 𝐑 i+1 𝐗 𝛄 i 𝛄 i+1 stripe-dictionary stripe vector 𝐗=𝐃𝚪 Adjacent representations overlap, as they skip by 𝑚 items as we sweep through the patches of 𝐗 𝐑 i 𝐗=𝛀 𝛄 i

CSC Relation to Our Story
A clear global model: every patch has a sparse representation w.r.t. to the same local dictionary 𝛀, just as we have assumed No notion of disagreement on the patch overlaps Related to the current common practice of patch averaging ( 𝐑 i T - put the patch 𝛀 𝛄 i back in the i-th location of the global vector) 𝐗=𝐃𝚪= 1 𝑛 i 𝐑 i T 𝛀 𝛄 i What about the Pursuit? “Patch averaging”: independent sparse coding for each patch CSC: should seek all the representations together What about the theory? Until recently little was known regrading the theoretical aspects of CSC

Classical Sparse Theory (Noiseless)
𝐏 0 : min 𝚪 𝚪 s.t. 𝐗=𝐃𝚪 Mutual Coherence: μ 𝐃 = max i≠j | d i T d j | Theorem: The OMP and BP are guaranteed to recover the true sparse code assuming that 𝚪 0 < μ 𝐃 [Donoho & Elad ‘03] [Tropp ‘04], [Donoho & Elad ‘03] Assuming that 𝑚=2 and 𝑛=64 we have that [Welch, ’74] μ 𝐃 ≥0.063 As a result, uniqueness and success of pursuits is guaranteed as long as 𝚪 0 <8 Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result!

Classical Sparse Theory (Noiseless)
𝐏 0 : min 𝚪 𝚪 s.t. 𝐗=𝐃𝚪 Mutual Coherence: μ 𝐃 = max i≠j | d i T d j | Theorem: The OMP and BP are guaranteed to recover the true sparse code assuming that 𝚪 0 < μ 𝐃 Classic SparseLand theory cannot provide good explanations for the CSC model [Donoho & Elad ‘03] [Tropp ‘04], [Donoho & Elad ‘03] Assuming that 𝑚=2 and 𝑛=64 we have that [Welch, ’74] μ 𝐃 ≥0.063 As a result, uniqueness and success of pursuits is guaranteed as long as 𝚪 0 <8 Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result!

Moving to Local Sparsity: Stripes
𝑚=2 𝛄 i 𝚪 0,∞ s = max i 𝛄 i 0 If 𝚪 is locally sparse [Papyan, Sulam & Elad ‘16] The solution of 𝐏 0,∞ is necessarily unique The global OMP and BP are guaranteed to recover it 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝐏 0,∞ : min 𝚪 𝚪 0,∞ s s.t. 𝐗=𝐃𝚪 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝚪 0,∞ s is low  all 𝛄 i are sparse  every patch has a sparse representation over 𝛀 𝛄 i 𝛄 i 𝛄 i 𝛄 i This result poses a local constraint for a global guarantee, and as such, the guarantees scale linearly with the dimension of the signal

From Ideal to Noisy Signals
In practice, 𝐘=𝐃𝚪+𝐄, where 𝐄 is due to noise or model deviations If 𝚪 is locally sparse and the noise is bounded [Papyan, Sulam & Elad ‘16] The solution of the 𝐏 0,∞ ϵ is stable The solution obtained via global OMP/BP is stable 𝐃𝚪 𝐄 𝐘 𝐏 0,∞ ϵ : min 𝚪 𝚪 0,∞ s s.t 𝐘−𝐃𝚪 2 ≤ϵ 𝚪 How close is 𝚪 to 𝚪? The true and estimated representation are close

Going Deeper Convolutional Neural Networks Analyzed via Convolutional Sparse Coding Vardan Papyan, Yaniv Romano, and Michael Elad Multi-Layer Convolutional Sparse Modeling: Pursuit and Dictionary Learning Jeremias Sulam, Vardan Papyan, Yaniv Romano, Michael Elad

CSC and CNN There is an analogy between CSC and CNN:
Convolutional structure Data driven models ReLU is a sparsifying operator We propose a principled way to analyze CNN through the eyes of sparse representations But first, a short review of CNN…

Forward Pass of CNN 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐘, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐗 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐗 𝐙 2 ∈ ℝ 𝑁 𝑚 2 𝐛 2 ∈ ℝ 𝑁 𝑚 2 𝐖 2 T ∈ ℝ 𝑁 𝑚 2 ×𝑁 𝑚 1 𝑛 1 𝑚 1 𝑚 2 𝐛 1 ∈ ℝ 𝑁 𝑚 1 𝐖 1 T ∈ ℝ 𝑁 𝑚 1 ×𝑁 𝑚 1 𝑛 0 ReLU ReLU 𝐘∈ ℝ 𝑁 𝑚 1

Training Stage of CNN Consider the task of classification
Given a set of signals 𝐘 j j and their corresponding labels h 𝐘 j j , the CNN learns an end-to-end mapping min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i True label Classifier Output of last layer

Multi-Layer CSC (ML-CSC)
Back to CSC 𝐗∈ ℝ 𝑁 𝑚 1 𝑛 0 𝐃 1 ∈ ℝ 𝑁×𝑁 𝑚 1 𝚪 1 ∈ ℝ 𝑁 𝑚 1 We propose to impose the same structure on the representations themselves 𝚪 1 ∈ ℝ 𝑁 𝑚 1 𝑛 1 𝑚 1 𝑚 2 𝐃 2 ∈ ℝ 𝑁 𝑚 1 ×𝑁 𝑚 2 𝑚 1 𝚪 2 ∈ ℝ 𝑁 𝑚 2 Convolutional sparsity (CSC) assumes an inherent structure is present in natural signals Multi-Layer CSC (ML-CSC)

Intuition: From Atoms to Molecules
𝐗∈ ℝ 𝑁 𝐃 1 ∈ ℝ 𝑁×𝑁 𝑚 1 𝚪 1 ∈ ℝ 𝑁 𝑚 1 𝐃 2 ∈ ℝ 𝑁 𝑚 1 ×𝑁 𝑚 2 𝚪 1 ∈ ℝ 𝑁 𝑚 1 𝚪 2 ∈ ℝ 𝑁 𝑚 2 Columns in 𝐃 1 are convolutional atoms Columns in 𝐃 2 combine the atoms in 𝐃 𝟏 , creating more complex structures The dictionary 𝐃 1 𝐃 2 is a superposition of the atoms of 𝐃 1 The size of the effective atoms is increased throughout the layers (receptive field)

Intuition: From Atoms to Molecules
𝐗∈ ℝ 𝑁 𝐃 1 ∈ ℝ 𝑁×𝑁 𝑚 1 𝐃 2 ∈ ℝ 𝑁 𝑚 1 ×𝑁 𝑚 2 𝚪 1 ∈ ℝ 𝑁 𝑚 1 𝚪 2 ∈ ℝ 𝑁 𝑚 2 One could chain the multiplication of all the dictionaries into one effective dictionary 𝐃 eff = 𝐃 1 𝐃 2 𝐃 3 ∙∙∙ 𝐃 K and then 𝐗=𝐃 eff 𝚪 K as in SparseLand However, a key property in this model is the sparsity of each representation (feature-maps)

A Small Taste: Model Training (MNIST)
MNIST Dictionary: D1: 32 filters of size 7×7, with stride of 2 (dense) D2: 128 filters of size 5×5×32 with stride of % sparse D3: 1024 filters of size 7×7×128 – % sparse 𝐃 1 (7×7) 𝐃 1 𝐃 2 (15×15) 𝐃 1 𝐃 2 𝐃 3 (28×28)

A Small Taste: Pursuit Y Γ 0 Γ 1 Γ 2 Γ 3 94.51 % sparse (213 nnz)

Deep Coding Problem (𝐃𝐂𝐏)
Noiseless Pursuit Find 𝚪 j j=1 K s.t. 𝐗= 𝐃 1 𝚪 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 = 𝐃 2 𝚪 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 = 𝐃 K 𝚪 K 𝚪 K 0,∞ s ≤ λ K Noisy Pursuit Find 𝚪 j j=1 K s.t 𝐘− 𝐃 1 𝚪 ≤ ℰ 𝚪 1 0,∞ s ≤ λ 𝚪 1 − 𝐃 2 𝚪 ≤ ℰ 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K

Deep Learning Problem 𝐃𝐋𝐏
min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 Supervised Dictionary Learning Deepest representation obtained from DCP Task driven dictionary learning [Mairal, Bach, Sapiro & Ponce ’12] True label Classifier Unsupervised Dictionary Learning Find 𝐃 i i=1 K s.t 𝐘 j − 𝐃 1 𝚪 𝟏 j 2 ≤ ℰ 𝚪 𝟏 j 0,∞ s ≤ λ 𝚪 1 j − 𝐃 2 𝚪 𝟐 j 𝟐 ≤ ℰ 𝟏 𝚪 𝟐 j 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 j − 𝐃 K 𝚪 𝑲 j 𝟐 ≤ ℰ 𝐊−𝟏 𝚪 K j 0,∞ s ≤ λ K j=1 𝐽

ML-CSC: The Simplest Pursuit
The simplest pursuit algorithm (single-layer case) is the THR algorithm, which operates on a given input signal 𝐘 by: Restricting the coefficients to be nonnegative does not restrict the expressiveness of the model 𝐘=𝐃𝚪+𝐄 𝚪 = 𝒫 𝛽 𝐃 T 𝐘 and 𝚪 is sparse ReLU = Soft Nonnegative Thresholding

Consider this for Solving the DCP
𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 ≤ ℰ 𝚪 1 0,∞ s ≤ λ 𝚪 1 − 𝐃 2 𝚪 ≤ ℰ 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Layered thresholding (LT): 𝚪 2 = 𝒫 β 2 𝐃 2 T 𝒫 β 1 𝐃 1 T 𝐘

𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 ≤ ℰ 𝚪 1 0,∞ s ≤ λ 𝚪 1 − 𝐃 2 𝚪 ≤ ℰ 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Layered thresholding (LT): Estimate 𝚪 1 via the THR algorithm 𝚪 2 = 𝒫 β 2 𝐃 2 T 𝒫 β 1 𝐃 1 T 𝐘

𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 ≤ ℰ 𝚪 1 0,∞ s ≤ λ 𝚪 1 − 𝐃 2 𝚪 ≤ ℰ 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Layered thresholding (LT): Forward pass of CNN: Estimate 𝚪 1 via the THR algorithm 𝚪 2 = 𝒫 β 2 𝐃 2 T 𝒫 β 1 𝐃 1 T 𝐘 Estimate 𝚪 2 via the THR algorithm 𝑓 𝐗 =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 ReLU Soft Non-Negative THR CNN Language Bias 𝐛 Thresholds 𝛽 SparseLand Language Weights 𝐖 Dictionary D Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂 𝐏 ∗

𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 ≤ ℰ 𝚪 1 0,∞ s ≤ λ 𝚪 1 − 𝐃 2 𝚪 ≤ ℰ 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Layered thresholding (LT): Forward pass of CNN: Estimate 𝚪 1 via the THR algorithm 𝚪 2 = 𝒫 β 2 𝐃 2 T 𝒫 β 1 𝐃 1 T 𝐘 Estimate 𝚪 2 via the THR algorithm 𝑓 𝐗 =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 The layered (soft nonnegative) thresholding and the forward pass algorithm are the very same things !!!

Consider this for Solving the DLP
DLP (supervised ): min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 The thresholds for the DCP should also learned CNN Language SparseLand Language Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂 𝐏 ∗

DLP (supervised ): min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 The thresholds for the DCP should also learned Estimate via the layered THR algorithm CNN Language SparseLand Language Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂 𝐏 ∗

DLP (supervised ): CNN training: min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 The thresholds for the DCP should also learned Estimate via the layered THR algorithm min 𝐖 i , 𝐛 i ,U j ℓ h 𝐘 j ,𝐔,𝑓 𝐘, 𝐖 i , 𝐛 i CNN Language SparseLand Language Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂 𝐏 ∗

* DLP (supervised ): CNN training: min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 The thresholds for the DCP should also learned Estimate via the layered THR algorithm min 𝐖 i , 𝐛 i ,U j ℓ h 𝐘 j ,𝐔,𝑓 𝐘, 𝐖 i , 𝐛 i The problem solved by the training stage of CNN and the DLP are equivalent as well, assuming that the DCP is approximated via the layered thresholding algorithm CNN Language SparseLand Language * Recall that for the ML-CSC, there exists an unsupervised avenue for training the dictionaries that has no simple parallel in CNN Forward pass 𝑓 ⋅ Layered THR Pursuit 𝐃𝐂 𝐏 ∗

Theoretical Questions
M A 𝚪 i i=1 K 𝐗= 𝐃 1 𝚪 1 𝚪 1 = 𝐃 2 𝚪 2 ⋮ 𝚪 K−1 = 𝐃 K 𝚪 K 𝚪 i is 𝐋 0,∞ sparse 𝐃𝐂 𝐏 λ ℰ Layered Thresholding (Forward Pass) Other? 𝐗 𝐘 Our dream is to find the true Gama that is used in the task of classification.

Uniqueness of 𝐃𝐂 𝐏 λ 𝐃𝐂 𝐏 λ : Find a set of representations satisfying
𝐗= 𝐃 1 𝚪 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 = 𝐃 2 𝚪 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 = 𝐃 K 𝚪 K 𝚪 K 0,∞ s ≤ λ K Is this set unique? Theorem: If a set of solutions 𝚪 i i=1 K is found for ( 𝐃𝐂𝐏 λ ) such that: 𝚪 i 0,∞ s = λ i < μ 𝐃 i Then these are necessarily the unique solution to this problem. Mutual Coherence: μ 𝐃 = max i≠j | d i T d j | [Donoho & Elad ‘03] The feature maps CNN aims to recover are unique

𝐃𝐂 𝐏 λ ℰ : Find a set of representations satisfying
Is this set stable? Stability of 𝐃𝐂 𝐏 λ ℰ 𝐃𝐂 𝐏 λ ℰ : Find a set of representations satisfying 𝐘− 𝐃 1 𝚪 ≤ ℰ 𝚪 1 0,∞ s ≤ λ 𝚪 1 − 𝐃 2 𝚪 ≤ ℰ 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Suppose that we manage to solve the 𝐃𝐂 𝐏 λ ℰ and find a feasible set of representations satisfying all the conditions The question we pose is: How close is 𝚪 i to 𝚪 i ? 𝚪 i i=1 K

The problem CNN aims to solve is stable under certain conditions
Stability of 𝐃𝐂 𝐏 λ ℰ Theorem: If the true representations 𝚪 i i=1 K satisfy 𝚪 i 0,∞ s ≤ λ i < μ 𝐃 i then the set of solutions 𝚪 i i=1 K obtained by solving this problem (somehow) with ℰ 0 = 𝐄 and ℰ i =0 (𝑖≥1) must obey 𝚪 i − 𝚪 i ≤4 E 𝑗=1 𝑖 1 1− 2 λ j −1 μ 𝐃 j The problem CNN aims to solve is stable under certain conditions Observe this annoying effect of error magnification as we dive into the model

Local Noise Assumption
Our analysis relied on the local sparsity of the underlying solution 𝚪, which was enforced through the ℓ 0,∞ norm In what follows, we present stability guarantees that will also depend on the local energy in the noise vector 𝐄 This will be enforced via the ℓ 2,∞ norm, defined as: 𝐄 2,∞ p = max i 𝐑 i 𝐄 2

Stability of Layered-THR
Theorem: If 𝚪 i 0,∞ s < μ 𝐃 i ⋅ 𝚪 i min 𝚪 i max − 1 μ 𝐃 i ⋅ ε L i−1 𝚪 i max then the layered hard THR (with the proper thresholds) will find the correct supports and 𝚪 i LT − 𝚪 i 2,∞ p ≤ ε L i where we have defined ε L 0 = 𝐄 2,∞ p and ε L i = 𝚪 i 0,∞ p ⋅ ε L i−1 +μ 𝐃 i 𝚪 i 0,∞ s −1 𝚪 i max Problems: Contrast Error growth Error even if no noise The stability of the forward pass is guaranteed if the underlying representations are locally sparse and the noise is locally bounded

Layered Basis Pursuit (Noisy)
𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 ≤ ℰ 𝚪 1 0,∞ s ≤ λ 𝚪 1 − 𝐃 2 𝚪 ≤ ℰ 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K We chose the Thresholding algorithm due to its simplicity, but we do know that there are better pursuit methods – how about using them? Lets use the Basis Pursuit instead … 𝚪 1 LBP = min 𝚪 𝐘− 𝐃 1 𝚪 ξ 1 𝚪 1 1 Deconvolutional networks [Zeiler, Krishnan, Taylor & Fergus ‘10] 𝚪 2 LBP = min 𝚪 𝚪 1 LBP − 𝐃 2 𝚪 ξ 2 𝚪 2 1

Stability of Layered BP
Theorem: Assuming that 𝚪 i 0,∞ s ≤ μ 𝐃 i then for correctly chosen ξ i i=1 K we are guaranteed that The support of 𝚪 i LBP is contained in that of 𝚪 i The error is bounded: 𝚪 i LBP − 𝚪 i 2,∞ p ≤ ε L i , where ε L i = 7.5 i 𝐄 2,∞ p j=1 i 𝚪 j 0,∞ p Every entry in 𝚪 i greater than ε L i / 𝚪 i 0,∞ p will be found Problems: Contrast Error growth Error even if no noise

Layered Iterative Thresholding
Layered BP: 𝚪 j LBP = min 𝚪 j 𝚪 j−1 LBP − 𝐃 j 𝚪 j ξ j 𝚪 j 1 j Layered Iterative Soft-Thresholding 𝚪 j t = 𝒮 ξ j / c j 𝚪 j t− c j 𝐃 j T 𝚪 j−1 − 𝐃 j 𝚪 j t−1 j t Note that our suggestion implies that groups of layers share the same dictionaries Can be seen as a recurrent neural network [Gregor & LeCun ‘10] * c i >0.5 λ max ( 𝐃 i T 𝐃 i )

This Talk Independent patch-processing Local Sparsity RED for solving inverse problems Convolutional Neural Networks Convolutional Sparse Coding Multi-Layer Convolutional Sparse Coding The underlying idea: Modeling the data source in order to be able to theoretically analyze algorithms’ performance A novel interpretation and theoretical understanding of CNN Extension of the classical sparse theory to a multi-layer setting We showed that a denoiser can be used to solve other inverse problems … propose and analyze a multi-layer extension of CSC, shown to be tightly connected to CNN We mentioned several interesting connections between CSC and CNN and this led us to … We described the limitations of patch-based processing and ways to overcome some of these We presented a theoretical study of the CSC and a practical algorithm that works locally The ML-CSC was shown to enable a theoretical study of CNN, along with new insights

Questions?

RED: G 𝐗 = 1 2 𝐗 T 𝐗 −𝑓 𝐗 Lets derive this expression Condition I:
A closer look on the directional derivative Lets dare and compute the Hessian Condition I: Differentiability 𝛻 𝐗 G 𝐗 =𝐗 − 1 2 𝑓 𝐗 − 1 2 𝐗 T 𝛻 𝐗 𝑓 𝐗 =𝐗 −𝑓 𝐗 𝐗 T 𝛻 𝐗 𝑓 𝐗 = lim 𝝐→0 𝑓 𝐗+𝜖𝐗 −𝑓 𝐗 𝜖 = lim 𝝐→0 𝑓 1+𝜖 𝐗 −𝑓 𝐗 𝜖 Condition II: Local Homogeneity = 1+𝜖 𝑓 𝐗 −𝑓 𝐗 𝜖 =𝑓 𝐗 Condition III: Passivity max 𝜆 𝛻 𝐗 𝑓 𝐗 ≤1 𝛻 𝐗 𝛻 𝐗 G 𝐗 =𝐈 − 𝛻 𝐗 𝑓 𝐗 ≽0 This regularizer is convex!

A Quest for a Universal Model for Signals: From Sparsity to ConvNets

Similar presentations

Presentation on theme: "A Quest for a Universal Model for Signals: From Sparsity to ConvNets"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Quest for a Universal Model for Signals: From Sparsity to ConvNets

Similar presentations

Presentation on theme: "A Quest for a Universal Model for Signals: From Sparsity to ConvNets"— Presentation transcript:

Similar presentations

About project

Feedback