A Quest for a Universal Model for Signals: From Sparsity to ConvNets

Slides:

Advertisements

Similar presentations

Request Dispatching for Cheap Energy Prices in Cloud Data Centers

Advertisements

SpringerLink Training Kit

Luminosity measurements at Hadron Colliders

From Word Embeddings To Document Distances

Choosing a Dental Plan Student Name

Virtual Environments and Computer Graphics

Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI

THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –

D. Phát triển thương hiệu

NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN

Điều trị chống huyết khối trong tai biến mạch máu não

BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.

Nasal Cannula X particulate mask

Evolving Architecture for Beyond the Standard Model

HF NOISE FILTERS PERFORMANCE

Electronics for Pedestrians – Passive Components –

Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel

L-Systems and Affine Transformations

CMSC423: Bioinformatic Algorithms, Databases and Tools

Some aspect concerning the LMDZ dynamical core and its use

Bayesian Confidence Limits and Intervals

实习总结（Internship Summary)

Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,

Front End Electronics for SOI Monolithic Pixel Sensor

Face Recognition Monday, February 1, 2016.

Solving Rubik's Cube By: Etai Nativ.

CS284 Paper Presentation Arpad Kovacs

انتقال حرارت 2 خانم خسرویار.

Summer Student Program First results

Theoretical Results on Neutrinos

HERMESでのHard Exclusive生成過程による核子内クォーク全角運動量についての研究

Wavelet Coherence & Cross-Wavelet Transform

yaSpMV: Yet Another SpMV Framework on GPUs

Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.

MOCLA02 Design of a Compact L-band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Fuel cell development program for electric vehicle

Overview of TST-2 Experiment

Optomechanics with atoms

داده کاوی سئوالات نمونه

Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium

ლექცია 4 - ფული და ინფლაცია

10. predavanje Novac i financijski sustav

Wissenschaftliche Aussprache zur Dissertation

FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,

Particle acceleration during the gamma-ray flares of the Crab Nebular

Interpretations of the Derivative Gottfried Wilhelm Leibniz

Advisor: Chiuyuan Chen Student: Shao-Chun Lin

Widow Rockfish Assessment

SiW-ECAL Beam Test 2015 Kick-Off meeting

On Robust Neighbor Discovery in Mobile Wireless Networks

Chapter 6 并发：死锁和饥饿 Operating Systems: Internals and Design Principles

You NEED your book!!! Frequency Distribution

Y V =0 a V =V0 x b b V =0 z

Fairness-oriented Scheduling Support for Multicore Systems

Climate-Energy-Policy Interaction

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Ch48 Statistics by Chtan FYHSKulai

The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.

Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs

Online Learning: An Introduction

Factor Based Index of Systemic Stress (FISS)

What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.

THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*

Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.

The Toroidal Sporadic Source: Understanding Temporal Variations

FW 3.4: More Circle Practice

ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف

Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM

Limits on Anomalous WWγ and WWZ Couplings from DØ

Presentation transcript:

A Quest for a Universal Model for Signals: From Sparsity to ConvNets Yaniv Romano The Electrical Engineering Department Technion – Israel Institute of technology Advisor: Prof. Michael Elad The research leading to these results has been received funding from the European union's Seventh Framework Program (FP/2007-2013) ERC grant Agreement ERC-SPARSE- 320649

Modeling Data Data-sources are structured Seismic Data Matrix Data Text Documents Stock Market Biological Signals Data-sources are structured This structure, when identified, is the engine behind our ability to process this data Radar Imaging Videos Social Networks Still Images Traffic info Voice Signals 3D Objects Medical Imaging

Modeling Data Seismic Data Matrix Data Text Documents Stock Market Biological Signals Almost every task in data processing requires a model – true for denoising, deblurring, super-resolution, compression, recognition, prediction, and more… We will put a special emphasis on image denoising Radar Imaging Videos Social Networks Still Images Traffic info Voice Signals 3D Objects Medical Imaging

Generative Modeling of Data Sources Enables Models A model: a mathematical description of the underlying signal of interest, describing our beliefs regarding its structure The Underlying Idea Generative Modeling of Data Sources Enables A systematic algorithm development, & A theoretical analysis of their performance

Sparse-Land We will concentrate on sparse and redundant representations It is a universal model Adaptive (via learning) to the data source at hand Intersection with machine-learning How can we treat high-dimensional signals? Facing the curse of dimensionality problem, as we need to learn lots of parameters

Improving the model → Improving the performance Part I: Local Modeling Traditional approach to treat images – reduce the dimensions by working locally on small image patches In the first part of this talk we will study the limitations of this local approach and suggest different ways to overcome them As a result, we push denoising algorithms to new heights Improving the model → Improving the performance

Part II: Global Sparse Models We proceed by revisiting the traditional local modeling approach for treating high-dimensional signals, and ask: This will take us to Convolutional Sparse Coding (CSC), which is a global model that based on a local sparsity prior The convolutional structure is the key to bypass the curse of dimensionality Why can we use a local prior to solve a global problem?

Part III: Going Deeper We then propose a multi-layered extension of the CSC model, called ML-CSC, which is tightly connected to Convolutional Neural Networks (CNN) SparseLand Sparse Representation Theory CNN Convolutional Neural Networks * The ML-CSC model enables a theoretical analysis of deep learning algorithms’ (e.g. forward-pass) performance *Our analysis holds true for fully connected networks as well

Denoising via Local Modeling Global Sparse Modeling Inverse Problems Boosting of Image Denoising Algorithms [SIAM IS ‘15] Convolutional Neural Networks Analyzed via Convolutional Sparse Coding [JMLR ’16] The Little Engine that Could: Regularization by Denoising (RED) [SIAM IS ’17] Patch-Disagreement as a Way to Improve K-SVD Denoising [IEEE-ICASP ’15] Turning a Denoiser into a Super-Resolver using Plug and Play Priors [IEEE-ICIP ’16] Convolutional Dictionary Learning via Local Processing [ICCV ’17] Gaussian Mixture Diffusion [IEEE-ICSEE ’16] Con-Patch: When a Patch Meets its Context [IEEE-TIP ’16] On the Global-Local Dichotomy in Sparsity Modeling [Compressed-Sensing ’17] Improving K-SVD Denoising by Post-Processing its Method-Noise [IEEE-ICIP ’13] RAISR: Rapid and Accurate Image Super Resolution [IEEE-TCI ‘16] Improving local modeling → Improved denoising performance Example-Based Image Synthesis via Randomized Patch-Matching [IEEE-TIP ’17] Leveraging denoising algorithm to treat other inverse problems Better treatment of inverse problems by improving local priors Single Image Interpolation via Adaptive Non-Local Sparsity-Based Modeling [IEEE-TIP ‘14] Global sparse priors, and the connection to deep learning

Denoising via Local Modeling Global Sparse Modeling Inverse Problems Boosting of Image Denoising Algorithms [SIAM IS ‘15] Convolutional Neural Networks Analyzed via Convolutional Sparse Coding [JMLR ’16] The Little Engine that Could: Regularization by Denoising (RED) [SIAM IS ’17] Patch-Disagreement as a Way to Improve K-SVD Denoising [IEEE-ICASP ’15] Turning a Denoiser into a Super-Resolver using Plug and Play Priors [IEEE-ICIP ’16] Convolutional Dictionary Learning via Local Processing [ICCV ’17] Gaussian Mixture Diffusion [IEEE-ICSEE ’16] Con-Patch: When a Patch Meets its Context [IEEE-TIP ’16] On the Global-Local Dichotomy in Sparsity Modeling [Compressed-Sensing ’17] Improving K-SVD Denoising by Post-Processing its Method-Noise [IEEE-ICIP ’13] RAISR: Rapid and Accurate Image Super Resolution [IEEE-TCI ‘16] Improving local modeling → Improved denoising performance Example-Based Image Synthesis via Randomized Patch-Matching [IEEE-TIP ’17] Leveraging denoising algorithm to treat other inverse problems Better treatment of inverse problems by improving local priors Single Image Interpolation via Adaptive Non-Local Sparsity-Based Modeling [IEEE-TIP ‘14] Global sparse priors, and the connection to deep learning

Denoising via Local Modeling Global Sparse Modeling Inverse Problems Boosting of Image Denoising Algorithms [SIAM IS ‘15] Convolutional Neural Networks Analyzed via Convolutional Sparse Coding [JMLR ’16] The Little Engine that Could: Regularization by Denoising (RED) [SIAM IS ’17] Patch-Disagreement as a Way to Improve K-SVD Denoising [IEEE-ICASP ’15] Turning a Denoiser into a Super-Resolver using Plug and Play Priors [IEEE-ICIP ’16] Convolutional Dictionary Learning via Local Processing [ICCV ’17] Gaussian Mixture Diffusion [IEEE-ICSEE ’16] Con-Patch: When a Patch Meets its Context [IEEE-TIP ’16] On the Global-Local Dichotomy in Sparsity Modeling [Compressed-Sensing ’17] Improving K-SVD Denoising by Post-Processing its Method-Noise [IEEE-ICIP ’13] RAISR: Rapid and Accurate Image Super Resolution [IEEE-TCI ‘16] Improving local modeling → Improved denoising performance Example-Based Image Synthesis via Randomized Patch-Matching [IEEE-TIP ’17] Leveraging denoising algorithm to treat other inverse problems Better treatment of inverse problems by improving local priors Single Image Interpolation via Adaptive Non-Local Sparsity-Based Modeling [IEEE-TIP ‘14] Global sparse priors, and the connection to deep learning

Denoising via Local Modeling Global Sparse Modeling Inverse Problems Boosting of Image Denoising Algorithms [SIAM IS ‘15] Convolutional Neural Networks Analyzed via Convolutional Sparse Coding [JMLR ’16] The Little Engine that Could: Regularization by Denoising (RED) [SIAM IS ’17] Patch-Disagreement as a Way to Improve K-SVD Denoising [IEEE-ICASP ’15] Turning a Denoiser into a Super-Resolver using Plug and Play Priors [IEEE-ICIP ’16] Convolutional Dictionary Learning via Local Processing [ICCV ’17] Gaussian Mixture Diffusion [IEEE-ICSEE ’16] Con-Patch: When a Patch Meets its Context [IEEE-TIP ’16] On the Global-Local Dichotomy in Sparsity Modeling [Compressed-Sensing ’17] Improving K-SVD Denoising by Post-Processing its Method-Noise [IEEE-ICIP ’13] RAISR: Rapid and Accurate Image Super Resolution [IEEE-TCI ‘16] Improving local modeling → Improved denoising performance Example-Based Image Synthesis via Randomized Patch-Matching [IEEE-TIP ’17] Leveraging denoising algorithm to treat other inverse problems Better treatment of inverse problems by improving local priors Single Image Interpolation via Adaptive Non-Local Sparsity-Based Modeling [IEEE-TIP ‘14] Global sparse priors, and the connection to deep learning

Part I: Image Denoising via Local Modeling

Image Denoising Original Image 𝐗 White Gaussian Noise 𝐄 Noisy Image 𝐘 Leading image denoising methods are built upon powerful patch-based local models:

𝛀 The Sparse-Land Model = 𝐑 i 𝐗 𝛄 i Assumes that every patch is a linear combination of a few columns, called atoms, taken from a matrix called a dictionary The operator 𝐑 i extracts the i-th 𝑛-dimensional patch from 𝐗∈ ℝ 𝑁 Sparse coding: 𝑛 𝑚>𝑛 𝐑 i 𝐗 𝛄 i = 𝛀 𝐏 0 : min 𝛄 i 𝛄 i 0 s.t. 𝐑 i 𝐗=𝛀 𝛄 i

Patch Denoising Given a noisy patch 𝐑 i 𝐘, solve 𝐏 0 ϵ : 𝛄 i = argmin 𝛄 i 𝛄 i 0 s.t. 𝐑 i 𝐘−𝛀 𝛄 i 2 ≤ϵ Clean patch: 𝛀 𝛄 i 𝐏 0 and ( 𝐏 0 ϵ ) are hard to solve Greedy methods such as Orthogonal Matching Pursuit (OMP) or Thresholding Convex relaxations such as Basis Pursuit (BP) 𝐏 1 ϵ : min 𝛄 i 𝛄 i 1 +ξ 𝐑 i 𝐘−𝛀 𝛄 i 2 2

Recall K-SVD Denoising [Elad & Aharon, ‘06] Initial Dictionary Noisy Image Using K-SVD Reconstructed Image Update the dictionary Denoise each patch Using OMP Despite its simplicity, this is a very well-performing algorithm We refer to this framework as “patch averaging”

What is ? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of… The Local-Global Gap: Efficient Independent Local Patch Processing VS. The Global Need to Model The Entire Image

What is ? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of… The Local-Global Gap: Efficient Independent Local Patch Processing VS. The Global Need to Model The Entire Image Whitening the residual image [Romano & Elad ‘13] Exploiting self-similarities [Romano, Protter & Elad ‘14] Leveraging the context of the patch [Romano & Elad ‘16] [Romano, Isidoro & Milanfar ‘16] Enforcing the local model on the final patches (EPLL) [Ren, Romano & Elad ‘17] SOS-Boosting [Romano & Elad ’15]

SOS-Boosting [Romano & Elad ’15] Given any denoiser, how can we improve its performance? Denoise

SOS-Boosting [Romano & Elad ’15] Given any denoiser, how can we improve its performance? Denoise Previous Result Strengthen

SOS-Boosting [Romano & Elad ’15] Given any denoiser, how can we improve its performance? Denoise Previous Result Strengthen – Operate

SOS-Boosting [Romano & Elad ’15] Given any denoiser, how can we improve its performance? Intuitively, an improvement is expected since SNR 𝐲+ 𝐱 >SNR 𝐲 Denoise Previous Result Strengthen – Operate – Subtract SOS-Formulation: 𝐗 𝑘+1 =𝑓 𝐘+ 𝜌 𝐗 𝑘 − 𝜌 𝐗 𝑘

SOS-Boosting [Romano & Elad ’15] We study the convergence of the SOS using only the linear part True for K-SVD, EPLL, NLM, BM3D and others, where the overall processing is divided into a non-linear stage of decisions, followed by a linear filtering Relation to Graph Theory: Minimizing a Laplacian regularization functional: Relation to Game Theory: The SOS encourages overlapping patches to reach a consensus SOS-Formulation: 𝐗 𝑘+1 =𝑓 𝐘+ 𝜌 𝐗 𝑘 − 𝜌 𝐗 𝑘 𝐱 =𝑓 𝐲 =𝐖 𝐲 𝐗 ∗ = argmin 𝐗 𝐗−𝐖𝐘 2 2 +𝜌 𝐗 𝑇 𝐗−𝐖𝐗

What is ? Missing a theoretical backbone! Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of… The Local-Global Gap: Efficient Independent Local Patch Processing VS. The Global Need to Model The Entire Image Missing a theoretical backbone! Why can we use a local prior to solve a global problem?

Part II: Convolutional Sparse Coding Working Locally Thinking Globally: Theoretical Guarantees for Convolutional Sparse Coding Vardan Papyan, Jeremias Sulam and Michael Elad Convolutional Dictionary Learning via Local Processing Vardan Papyan, Yaniv Romano, Jeremias Sulam, and Michael Elad [LeCun, Bottou, Bengio and Haffner ‘98] [Lewicki & Sejnowski ‘99] [Hashimoto & Kurata, ‘00] [Mørup, Schmidt & Hansen ’08] [Zeiler, Krishnan, Taylor & Fergus ‘10] [Jarrett, Kavukcuoglu, Gregor, LeCun ‘11] [Heide, Heidrich & Wetzstein ‘15] [Gu, Zuo, Xie, Meng, Feng & Zhang ‘15] …

𝐗 = + + + + + + + = + + + + + + + Intuitively… The first filter = + + + + + + + = + + + + + + + The first filter The second filter

Convolutional Sparse Coding (CSC) = ∗ ∗ + + ∗ + ∗ + ∗ …

Convolutional Sparse Representation = 𝐃 L 𝑛 (2𝑛−1)𝑚 𝑛 (2𝑛−1)𝑚 𝐑 i 𝐗 𝐑 i+1 𝐗 𝛄 i 𝛄 i+1 stripe-dictionary stripe vector 𝐗=𝐃𝚪 Adjacent representations overlap, as they skip by 𝑚 items as we sweep through the patches of 𝐗 𝐑 i 𝐗=𝛀 𝛄 i

CSC Relation to Our Story A clear global model: every patch has a sparse representation w.r.t. to the same local dictionary 𝛀, just as we have assumed No notion of disagreement on the patch overlaps Related to the current common practice of patch averaging ( 𝐑 i T - put the patch 𝛀 𝛄 i back in the i-th location of the global vector) 𝐗=𝐃𝚪= 1 𝑛 i 𝐑 i T 𝛀 𝛄 i What about the Pursuit? “Patch averaging”: independent sparse coding for each patch CSC: should seek all the representations together Is there a bridge between the two? We’ll come back to this later … What about the theory? Until recently little was known regrading the theoretical aspects of CSC

Classical Sparse Theory (Noiseless) 𝐏 0 : min 𝚪 𝚪 0 s.t. 𝐗=𝐃𝚪 Mutual Coherence: μ 𝐃 = max i≠j | d i T d j | Theorem: The OMP and BP are guaranteed to recover the true sparse code assuming that 𝚪 0 < 1 2 1+ 1 μ 𝐃 [Donoho & Elad ‘03] [Tropp ‘04], [Donoho & Elad ‘03] Assuming that 𝑚=2 and 𝑛=64 we have that [Welch, ’74] μ 𝐃 ≥0.063 As a result, uniqueness and success of pursuits is guaranteed as long as 𝚪 0 <8 Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result!

Classical Sparse Theory (Noiseless) 𝐏 0 : min 𝚪 𝚪 0 s.t. 𝐗=𝐃𝚪 Mutual Coherence: μ 𝐃 = max i≠j | d i T d j | Theorem: The OMP and BP are guaranteed to recover the true sparse code assuming that 𝚪 0 < 1 2 1+ 1 μ 𝐃 Classic SparseLand theory cannot provide good explanations for the CSC model [Donoho & Elad ‘03] [Tropp ‘04], [Donoho & Elad ‘03] Assuming that 𝑚=2 and 𝑛=64 we have that [Welch, ’74] μ 𝐃 ≥0.063 As a result, uniqueness and success of pursuits is guaranteed as long as 𝚪 0 <8 Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result!

Moving to Local Sparsity: Stripes 𝑚=2 𝛄 i 𝚪 0,∞ s = max i 𝛄 i 0 If 𝚪 is locally sparse [Papyan, Sulam & Elad ‘16] The solution of 𝐏 0,∞ is necessarily unique The global OMP and BP are guaranteed to recover it 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝐏 0,∞ : min 𝚪 𝚪 0,∞ s s.t. 𝐗=𝐃𝚪 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝛄 i 𝚪 0,∞ s is low  all 𝛄 i are sparse  every patch has a sparse representation over 𝛀 𝛄 i 𝛄 i 𝛄 i 𝛄 i This result poses a local constraint for a global guarantee, and as such, the guarantees scale linearly with the dimension of the signal

From Ideal to Noisy Signals In practice, 𝐘=𝐃𝚪+𝐄, where 𝐄 is due to noise or model deviations If 𝚪 is locally sparse and the noise is bounded [Papyan, Sulam & Elad ‘16] The solution of the 𝐏 0,∞ ϵ is stable The solution obtained via global OMP/BP is stable 𝐃𝚪 𝐄 𝐘 𝐏 0,∞ ϵ : min 𝚪 𝚪 0,∞ s s.t. 𝐘−𝐃𝚪 2 ≤ϵ 𝚪 How close is 𝚪 to 𝚪? The true and estimated representation are close

Global Pursuit via Local Processing 𝐏 1 ϵ : 𝚪 BP = min 𝚪 1 2 𝐘−𝐃𝚪 2 2 +ξ 𝚪 1 𝐗=𝐃𝚪 = = i 𝐑 i T 𝐃 L 𝛂 i = i 𝐑 i T 𝐬 i 𝑛 𝑚 𝑚 𝐃 L α i s i are slices not patches

Global Pursuit via Local Processing (2) 𝐏 1 ϵ : 𝚪 BP = min 𝚪 1 2 𝐘−𝐃𝚪 2 2 +ξ 𝚪 1 Using variable splitting s i = 𝐃 L α i min 𝐬 𝑖 , 𝛂 𝑖 1 2 𝐘− i 𝐑 i T 𝐬 i 2 2 + i 𝜉 𝛂 𝑖 1 s.t. 𝐬 i = 𝐃 L 𝛂 i These two problems are equivalent, and convex w.r.t their variables The new formulation targets the local slices, and their sparse representations Can be solved via ADMM, replace the constraint with a penalty

Global Pursuit via Local Processing (2) min 𝐬 𝑖 , 𝛂 𝑖 1 2 𝐘− i 𝐑 i T 𝐬 i 2 2 + i λ 𝛂 i 1 + 𝜌 2 𝐬 i − 𝐃 L 𝛂 i + u 𝑖 2 2 ADMM Sparse-Update: Separable and local LARS problems min 𝛂 i i λ 𝛂 i 1 + 𝜌 2 𝐬 i − 𝐃 L 𝛂 i + u 𝑖 2 2 Comment: One iteration of this procedure amounts to … the very same patch-averaging algorithm we started with This algorithm operates locally while guaranteeing to solve the global problem Slice-Update: Simple L2-based aggregation and averaging min 𝐬 i 1 2 𝐘− i 𝐑 i T 𝐬 i 2 2 + i 𝜌 2 𝐬 i − 𝐃 L 𝛂 i + u 𝑖 2 2

Two Comments About this Scheme The Proposed Scheme can be used for Dictionary ( 𝐃 L ) Learning We work with Slices and not Patches Slice-based DL algorithm using standard patch-based tools, leading to a faster and simpler method, compared to existing methods Patches extracted from natural images, and their corresponding slices. Observe how the slices are far simpler, and contained by their corresponding patches Heide et. al Wohlberg Patches Slices Patches Slices [Wohlberg, 2016] Ours

Part III: Going Deeper Convolutional Neural Networks Analyzed via Convolutional Sparse Coding Joint work with Vardan Papyan and Michael Elad

CSC and CNN There is an analogy between CSC and CNN: Convolutional structure Data driven models ReLU is a sparsifying operator We propose a principled way to analyze CNN through the eyes of sparse representations But first, a short review of CNN…

Forward Pass of CNN 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐘, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐗 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 𝑓 𝐗, 𝐖 i , 𝐛 i =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐗 𝐙 2 ∈ ℝ 𝑁 𝑚 2 𝐛 2 ∈ ℝ 𝑁 𝑚 2 𝐖 2 T ∈ ℝ 𝑁 𝑚 2 ×𝑁 𝑚 1 𝑛 1 𝑚 1 𝑚 2 𝐛 1 ∈ ℝ 𝑁 𝑚 1 𝐖 1 T ∈ ℝ 𝑁 𝑚 1 ×𝑁 𝑚 1 𝑛 0 ReLU ReLU 𝐘∈ ℝ 𝑁 𝑚 1

Training Stage of CNN Consider the task of classification Given a set of signals 𝐘 j j and their corresponding labels h 𝐘 j j , the CNN learns an end-to-end mapping min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i min 𝐖 i , 𝐛 i ,𝐔 j ℓ h 𝐘 j ,𝐔,𝑓 𝐘 j , 𝐖 i , 𝐛 i True label Classifier Output of last layer

Multi-Layer CSC (ML-CSC) Back to CSC 𝐗∈ ℝ 𝑁 𝑚 1 𝑛 0 𝐃 1 ∈ ℝ 𝑁×𝑁 𝑚 1 𝚪 1 ∈ ℝ 𝑁 𝑚 1 We propose to impose the same structure on the representations themselves 𝚪 1 ∈ ℝ 𝑁 𝑚 1 𝑛 1 𝑚 1 𝑚 2 𝐃 2 ∈ ℝ 𝑁 𝑚 1 ×𝑁 𝑚 2 𝑚 1 𝚪 2 ∈ ℝ 𝑁 𝑚 2 Convolutional sparsity (CSC) assumes an inherent structure is present in natural signals Multi-Layer CSC (ML-CSC)

Intuition: From Atoms to Molecules 𝐗∈ ℝ 𝑁 𝐃 1 ∈ ℝ 𝑁×𝑁 𝑚 1 𝚪 1 ∈ ℝ 𝑁 𝑚 1 𝐃 2 ∈ ℝ 𝑁 𝑚 1 ×𝑁 𝑚 2 𝚪 1 ∈ ℝ 𝑁 𝑚 1 𝚪 2 ∈ ℝ 𝑁 𝑚 2 Columns in 𝐃 1 are convolutional atoms Columns in 𝐃 2 combine the atoms in 𝐃 𝟏 , creating more complex structures The dictionary 𝐃 1 𝐃 2 is a superposition of the atoms of 𝐃 1 The size of the effective atoms is increased throughout the layers (receptive field)

Intuition: From Atoms to Molecules 𝐗∈ ℝ 𝑁 𝐃 1 ∈ ℝ 𝑁×𝑁 𝑚 1 𝐃 2 ∈ ℝ 𝑁 𝑚 1 ×𝑁 𝑚 2 𝚪 1 ∈ ℝ 𝑁 𝑚 1 𝚪 2 ∈ ℝ 𝑁 𝑚 2 One could chain the multiplication of all the dictionaries into one effective dictionary 𝐃 eff = 𝐃 1 𝐃 2 𝐃 3 ∙∙∙ 𝐃 K and then 𝐗=𝐃 eff 𝚪 K as in SparseLand However, a key property in this model is the sparsity of each representation (feature-maps)

A Small Taste: Model Training (MNIST) MNIST Dictionary: D1: 32 filters of size 7×7, with stride of 2 (dense) D2: 128 filters of size 5×5×32 with stride of 1 - 99.09 % sparse D3: 1024 filters of size 7×7×128 – 99.89 % sparse 𝐃 1 (7×7) 𝐃 1 𝐃 2 (15×15) 𝐃 1 𝐃 2 𝐃 3 (28×28)

A Small Taste: Pursuit Y Γ 0 Γ 1 Γ 2 Γ 3 94.51 % sparse (213 nnz)

Deep Coding Problem (𝐃𝐂𝐏) Noiseless Pursuit Find 𝚪 j j=1 K s.t. 𝐗= 𝐃 1 𝚪 1 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 = 𝐃 2 𝚪 2 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 = 𝐃 K 𝚪 K 𝚪 K 0,∞ s ≤ λ K Noisy Pursuit Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K

Deep Learning Problem 𝐃𝐋𝐏 min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 Supervised Dictionary Learning Deepest representation obtained from DCP Task driven dictionary learning [Mairal, Bach, Sapiro & Ponce ’12] True label Classifier Unsupervised Dictionary Learning Find 𝐃 i i=1 K s.t. 𝐘 j − 𝐃 1 𝚪 𝟏 j 2 ≤ ℰ 0 𝚪 𝟏 j 0,∞ s ≤ λ 1 𝚪 1 j − 𝐃 2 𝚪 𝟐 j 𝟐 ≤ ℰ 𝟏 𝚪 𝟐 j 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 j − 𝐃 K 𝚪 𝑲 j 𝟐 ≤ ℰ 𝐊−𝟏 𝚪 K j 0,∞ s ≤ λ K j=1 𝐽

ML-CSC: The Simplest Pursuit The simplest pursuit algorithm (single-layer case) is the THR algorithm, which operates on a given input signal 𝐘 by: Restricting the coefficients to be nonnegative does not restrict the expressiveness of the model 𝐘=𝐃𝚪+𝐄 𝚪 = 𝒫 𝛽 𝐃 T 𝐘 and 𝚪 is sparse ReLU = Soft Nonnegative Thresholding

Consider this for Solving the DCP 𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Layered thresholding (LT): 𝚪 2 = 𝒫 β 2 𝐃 2 T 𝒫 β 1 𝐃 1 T 𝐘

Consider this for Solving the DCP 𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Layered thresholding (LT): Estimate 𝚪 1 via the THR algorithm 𝚪 2 = 𝒫 β 2 𝐃 2 T 𝒫 β 1 𝐃 1 T 𝐘

Consider this for Solving the DCP 𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Layered thresholding (LT): Forward pass of CNN: Estimate 𝚪 1 via the THR algorithm 𝚪 2 = 𝒫 β 2 𝐃 2 T 𝒫 β 1 𝐃 1 T 𝐘 Estimate 𝚪 2 via the THR algorithm 𝑓 𝐗 =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 ReLU Soft Non-Negative THR CNN Language Bias 𝐛 Thresholds 𝛽 SparseLand Language Weights 𝐖 Dictionary D Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂 𝐏 ∗

Consider this for Solving the DCP 𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Layered thresholding (LT): Forward pass of CNN: Estimate 𝚪 1 via the THR algorithm 𝚪 2 = 𝒫 β 2 𝐃 2 T 𝒫 β 1 𝐃 1 T 𝐘 Estimate 𝚪 2 via the THR algorithm 𝑓 𝐗 =ReLU 𝐛 2 + 𝐖 2 T ReLU 𝐛 1 + 𝐖 1 T 𝐘 The layered (soft nonnegative) thresholding and the forward pass algorithm are the very same things !!!

Consider this for Solving the DLP DLP (supervised ): min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 The thresholds for the DCP should also learned CNN Language SparseLand Language Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂 𝐏 ∗

Consider this for Solving the DLP DLP (supervised ): min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 The thresholds for the DCP should also learned Estimate via the layered THR algorithm CNN Language SparseLand Language Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂 𝐏 ∗

Consider this for Solving the DLP DLP (supervised ): CNN training: min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 The thresholds for the DCP should also learned Estimate via the layered THR algorithm min 𝐖 i , 𝐛 i ,U j ℓ h 𝐘 j ,𝐔,𝑓 𝐘, 𝐖 i , 𝐛 i CNN Language SparseLand Language Forward pass 𝑓 ⋅ Layered Soft NN THR 𝐃𝐂 𝐏 ∗

Consider this for Solving the DLP * DLP (supervised ): CNN training: min 𝐃 i i=1 K ,𝐔 j ℓ h 𝐘 j ,𝐔, 𝐃𝐂𝐏 ⋆ 𝐘 j , 𝐃 𝐢 The thresholds for the DCP should also learned Estimate via the layered THR algorithm min 𝐖 i , 𝐛 i ,U j ℓ h 𝐘 j ,𝐔,𝑓 𝐘, 𝐖 i , 𝐛 i The problem solved by the training stage of CNN and the DLP are equivalent as well, assuming that the DCP is approximated via the layered thresholding algorithm CNN Language SparseLand Language * Recall that for the ML-CSC, there exists an unsupervised avenue for training the dictionaries that has no simple parallel in CNN Forward pass 𝑓 ⋅ Layered THR Pursuit 𝐃𝐂 𝐏 ∗

Theoretical Questions M A 𝚪 i i=1 K 𝐗= 𝐃 1 𝚪 1 𝚪 1 = 𝐃 2 𝚪 2 ⋮ 𝚪 K−1 = 𝐃 K 𝚪 K 𝚪 i is 𝐋 0,∞ sparse 𝐃𝐂 𝐏 λ ℰ Layered Thresholding (Forward Pass) Other? 𝐗 𝐘 Our dream is to find the true Gama that is used in the task of classification.

Uniqueness of 𝐃𝐂 𝐏 λ 𝐃𝐂 𝐏 λ : Find a set of representations satisfying 𝐗= 𝐃 1 𝚪 1 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 = 𝐃 2 𝚪 2 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 = 𝐃 K 𝚪 K 𝚪 K 0,∞ s ≤ λ K Is this set unique? Theorem: If a set of solutions 𝚪 i i=1 K is found for ( 𝐃𝐂𝐏 λ ) such that: 𝚪 i 0,∞ s = λ i < 1 2 1+ 1 μ 𝐃 i Then these are necessarily the unique solution to this problem. Mutual Coherence: μ 𝐃 = max i≠j | d i T d j | [Donoho & Elad ‘03] The feature maps CNN aims to recover are unique

𝐃𝐂 𝐏 λ ℰ : Find a set of representations satisfying Is this set stable? Stability of 𝐃𝐂 𝐏 λ ℰ 𝐃𝐂 𝐏 λ ℰ : Find a set of representations satisfying 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K Suppose that we manage to solve the 𝐃𝐂 𝐏 λ ℰ and find a feasible set of representations satisfying all the conditions The question we pose is: How close is 𝚪 i to 𝚪 i ? 𝚪 i i=1 K

The problem CNN aims to solve is stable under certain conditions Stability of 𝐃𝐂 𝐏 λ ℰ Theorem: If the true representations 𝚪 i i=1 K satisfy 𝚪 i 0,∞ s ≤ λ i < 1 2 1+ 1 μ 𝐃 i then the set of solutions 𝚪 i i=1 K obtained by solving this problem (somehow) with ℰ 0 = 𝐄 2 2 and ℰ i =0 (𝑖≥1) must obey 𝚪 i − 𝚪 i 2 2 ≤4 E 2 2 𝑗=1 𝑖 1 1− 2 λ j −1 μ 𝐃 j The problem CNN aims to solve is stable under certain conditions Observe this annoying effect of error magnification as we dive into the model

Local Noise Assumption Our analysis relied on the local sparsity of the underlying solution 𝚪, which was enforced through the ℓ 0,∞ norm In what follows, we present stability guarantees that will also depend on the local energy in the noise vector 𝐄 This will be enforced via the ℓ 2,∞ norm, defined as: 𝐄 2,∞ p = max i 𝐑 i 𝐄 2

Stability of Layered-THR Theorem: If 𝚪 i 0,∞ s < 1 2 1+ 1 μ 𝐃 i ⋅ 𝚪 i min 𝚪 i max − 1 μ 𝐃 i ⋅ ε L i−1 𝚪 i max then the layered hard THR (with the proper thresholds) will find the correct supports and 𝚪 i LT − 𝚪 i 2,∞ p ≤ ε L i where we have defined ε L 0 = 𝐄 2,∞ p and ε L i = 𝚪 i 0,∞ p ⋅ ε L i−1 +μ 𝐃 i 𝚪 i 0,∞ s −1 𝚪 i max Problems: Contrast Error growth Error even if no noise The stability of the forward pass is guaranteed if the underlying representations are locally sparse and the noise is locally bounded

Layered Basis Pursuit (Noisy) 𝐃𝐂 𝐏 λ ℰ : Find 𝚪 j j=1 K s.t. 𝐘− 𝐃 1 𝚪 1 2 ≤ ℰ 0 𝚪 1 0,∞ s ≤ λ 1 𝚪 1 − 𝐃 2 𝚪 2 2 ≤ ℰ 1 𝚪 2 0,∞ s ≤ λ 2 ⋮ ⋮ 𝚪 K−1 − 𝐃 K 𝚪 K 2 ≤ ℰ K−1 𝚪 K 0,∞ s ≤ λ K We chose the Thresholding algorithm due to its simplicity, but we do know that there are better pursuit methods – how about using them? Lets use the Basis Pursuit instead … 𝚪 1 LBP = min 𝚪 1 1 2 𝐘− 𝐃 1 𝚪 1 2 2 + ξ 1 𝚪 1 1 Deconvolutional networks [Zeiler, Krishnan, Taylor & Fergus ‘10] 𝚪 2 LBP = min 𝚪 2 1 2 𝚪 1 LBP − 𝐃 2 𝚪 2 2 2 + ξ 2 𝚪 2 1

Stability of Layered BP Theorem: Assuming that 𝚪 i 0,∞ s ≤ 1 3 1+ 1 μ 𝐃 i then for correctly chosen ξ i i=1 K we are guaranteed that The support of 𝚪 i LBP is contained in that of 𝚪 i The error is bounded: 𝚪 i LBP − 𝚪 i 2,∞ p ≤ ε L i , where ε L i = 7.5 i 𝐄 2,∞ p j=1 i 𝚪 j 0,∞ p Every entry in 𝚪 i greater than ε L i / 𝚪 i 0,∞ p will be found Problems: Contrast Error growth Error even if no noise

Layered Iterative Thresholding Layered BP: 𝚪 j LBP = min 𝚪 j 1 2 𝚪 j−1 LBP − 𝐃 j 𝚪 j 2 2 + ξ j 𝚪 j 1 j Layered Iterative Soft-Thresholding 𝚪 j t = 𝒮 ξ j / c j 𝚪 j t−1 + 1 c j 𝐃 j T 𝚪 j−1 − 𝐃 j 𝚪 j t−1 j t Note that our suggestion implies that groups of layers share the same dictionaries Can be seen as a recurrent neural network [Gregor & LeCun ‘10] * c i >0.5 λ max ( 𝐃 i T 𝐃 i )

This Talk Independent patch-processing Local Sparsity The ML-CSC was shown to enable a theoretical study of CNN, along with new insights Convolutional Neural Networks Convolutional Sparse Coding Multi-Layer Convolutional Sparse Coding The underlying idea: Modeling the data source in order to be able to theoretically analyze algorithms’ performance A novel interpretation and theoretical understanding of CNN Extension of the classical sparse theory to a multi-layer setting … propose and analyze a multi-layer extension of CSC, shown to be tightly connected to CNN We mentioned several interesting connections between CSC and CNN and this led us to … We presented a theoretical study of the CSC and a practical algorithm that works locally We described the limitations of patch-based processing and ways to overcome some of these

Questions?