Role of Stein’s Lemma in Guaranteed Training of Neural Networks

Slides:

Advertisements

Similar presentations

Neural networks Introduction Fitting neural networks

Advertisements

Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.

Introduction to Neural Networks Computing

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.

INTRODUCTION TO Machine Learning 3rd Edition

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.

Speaker Adaptation for Vowel Classification

Chapter 5. Operations on Multiple R. V.'s 1 Chapter 5. Operations on Multiple Random Variables 0. Introduction 1. Expected Value of a Function of Random.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.

LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Neural Nets: Something you can use and something to think about Cris Koutsougeras What are Neural Nets What are they good for Pointers to some models and.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

CS621 : Artificial Intelligence

פרקים נבחרים בפיסיקת החלקיקים אבנר סופר אביב

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Machine learning optimization Usman Roshan. Machine learning Two components: – Modeling – Optimization Modeling – Generative: we assume a probabilistic.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Today’s Lecture Neural networks Training

Neural networks and support vector machines

CS 9633 Machine Learning Support Vector Machines

Provable Learning of Noisy-OR Networks

Fall 2004 Backpropagation CS478 - Machine Learning.

Learning Deep Generative Models by Ruslan Salakhutdinov

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Deep Feedforward Networks

Ch 12. Continuous Latent Variables ~ 12

LECTURE 11: Advanced Discriminant Analysis

Extreme Learning Machine

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

Real Neurons Cell structures Cell body Dendrites Axon

Understanding Generalization in Adaptive Data Analysis

Support Vector Machines

Generalization and adaptivity in stochastic convex optimization

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

CS 188: Artificial Intelligence

Basic Algorithms Christina Gallner

Probabilistic Models for Linear Regression

Classification / Regression Neural Networks 2

CS621: Artificial Intelligence

PCA vs ICA vs LDA.

Machine Learning Today: Reading: Maria Florina Balcan

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Bayesian Models in Machine Learning

Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.

Probabilistic Models with Latent Variables

Artificial Neural Networks

Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.

Neural Networks Geoff Hulten.

Deep Learning for Non-Linear Control

Support Vector Machines

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Approximation and Generalization in Neural Networks

Introduction to Radial Basis Function Networks

CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis

Neural networks (1) Traditional multi-layer perceptrons

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Linear Discrimination

Seminar on Machine Learning Rada Mihalcea

CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.

Introduction to Neural Networks

Computational Intelligence

Rong Ge, Duke University

Presentation transcript:

Role of Stein’s Lemma in Guaranteed Training of Neural Networks Anima Anandkumar NVIDIA and Caltech

Non-Convex Optimization ► Unique optimum: global/local ► Multiple local optima ► In high dimensions possibly exponential local optima How to deal with the challenge of non-convexity? Finding the global optimum 3/ 33

Local Optima in Neural Networks Example of Failure of Backpropagation y y = −1 y =1 σ(·) σ(·) Local optimum Global optimum x x1 x2 Exponential (in dimensions) no. of local optima for backpropagation

Guaranteed Learning through Tensor Methods Replace the objective function Cross Entropy vs. Best Tensor decomp. Preserves Global Optimum (infinite samples) I θ arg min IT (x) − T (θ) 2 F T (x): empirical tensor, T (θ): low rank tensor based on θ. Dataset 1 Dataset 2 Finding globally opt tensor decomposition Model Class Simple algorithms succeed under mild and natural conditions for many learning problems.

Why Tensors? Method of Moments

Matrix vs. Tensor: Why Tensors are Necessary?

Matrix vs. Tensor: Why Tensors are Necessary?

Matrix vs. Tensor: Why Tensors are Necessary?

Matrix vs. Tensor: Why Tensors are Necessary?

From Matrix to Tensor

From Matrix to Tensor

From Matrix to Tensor

Guaranteed Training of Neural Networks using Tensor Decomposition Majid Janzamin Hanie Sedghi

Method of Moments for a Neural Network ► Supervised setting: observing {(xi, yi)} ► Non-linear transformations via activating function σ(·) ► Random x and y: Moment possibilities: E[y ⊗ y], E[y ⊗ x], . . . y σ(·) σ(·) E[y ⊗ x] = E[σ(ATx) ⊗ x] 1 A1 ⇒ No linear transformation of A1. × x x1 x2 x3 One solution: Linearization by using Stein’s Lemma σ(ATx) Derivative −−−−−−→ σ'(·)A1T 1 26/ 33

Moments of a Neural Network y E[y|x] := f (x) = (a2, σ(AT1 x)) a2   σ(·) σ(·)   A = 1  x x1 x2 x3 “Score Function Features for Discriminative Learning: Matrix and Tensor Framework” by M. Janzamin, H. Sedghi, and A. , Dec. 2014.

Moments of a Neural Network y E[y|x] := f (x) = (a2, σ(AT1 x)) Moments using score functions S(·) a2   σ(·) σ(·)   A = 1  x x1 x2 x3 “Score Function Features for Discriminative Learning: Matrix and Tensor Framework” by M. Janzamin, H. Sedghi, and A. , Dec. 2014.

Moments of a Neural Network y E[y|x] := f (x) = (a2, σ(AT1 x)) Moments using score functions S(·) E [y · S1(x)] = + a2   σ(·) σ(·)   A = 1  x 1 x x 2 x 3 “Score Function Features for Discriminative Learning: Matrix and Tensor Framework” by M. Janzamin, H. Sedghi, and A. , Dec. 2014.

Moments of a Neural Network y E[y|x] := f (x) = (a2, σ(AT1 x)) Moments using score functions S(·) E [y · S2(x)] = + a2   σ(·) σ(·)   A = 1  x x1 x2 x3 “Score Function Features for Discriminative Learning: Matrix and Tensor Framework” by M. Janzamin, H. Sedghi, and A. , Dec. 2014.

Moments of a Neural Network y E[y|x] := f (x) = (a2, σ(AT1 x)) Moments using score functions S(·) E [y · S3(x)] = + a2   σ(·) σ(·)   A = 1  x x1 x2 x3 “Score Function Features for Discriminative Learning: Matrix and Tensor Framework” by M. Janzamin, H. Sedghi, and A. , Dec. 2014.

Moments of a Neural Network y E[y|x] := f (x) = (a2, σ(AT1 x)) Moments using score functions S(·) E [y · S3(x)] = + a2   σ(·) σ(·)   A = 1  x x1 x2 x3 ► Linearization using derivative operator. Stein’s lemma φm(x) : m-th order derivative operator “Score Function Features for Discriminative Learning: Matrix and Tensor Framework” by M. Janzamin, H. Sedghi, and A. , Dec. 2014.

Moments of a Neural Network y E[y|x] := f (x) = (a2, σ(AT1 x)) Moments using score functions S(·) E [y · S3(x)] = + a2   σ(·) σ(·)   A = 1  x x1 x2 x3 Theorem (Score function property) When p(x) vanishes at boundary, Sm(x) exists, and m-diﬀerentiable function f (·) Stein’s lemma E [y · S (x)] = E [f (x) · S (x)] = E [ ∇ f (x)] . (m) m m x . “Score Function Features for Discriminative Learning: Matrix and Tensor Framework” by M. Janzamin, H. Sedghi, and A. , Dec. 2014.

Stein’s Lemma through Score functions ► Continuous x with pdf p(·): S1(x) := −∇x log p(x) Input: S1(x) ∈ Rd x ∈ Rd 28/ 33

Stein’s Lemma through Score functions Input: S1(x) ∈ Rd x ∈ Rd ► Continuous x with pdf p(·): ► mth-order score function: m ∇(m)p(x) Sm(x) := (−1) p(x) 28/ 33

Stein’s Lemma through Score functions x ∈ Rd ► Continuous x with pdf p(·): ► mth-order score function: m ∇(m)p(x) Sm(x) := (−1) p(x) Input: S2(x) ∈ Rd×d 28/ 33

Stein’s Lemma through Score functions x ∈ Rd ► Continuous x with pdf p(·): ► mth-order score function: m ∇(m)p(x) Sm(x) := (−1) p(x) Input: S3(x) ∈ Rd×d×d 28/ 33

Stein’s Lemma through Score functions x ∈ Rd ► Continuous x with pdf p(·): ► mth-order score function: m ∇(m)p(x) Sm(x) := (−1) p(x) Input: S3(x) ∈ Rd×d×d ► For Gaussian x ∼ N (0, I): orthogonal Hermite polynomials S1(x) = x, S2(x) = xxT − I, . . . 28/ 33

Stein’s Lemma through Score functions ► Continuous x with pdf p(·): ► mth-order score function: x ∈ Rd m ∇(m)p(x) Sm(x) := (−1) p(x) Input: S3(x) ∈ Rd×d×d ► For Gaussian x ∼ N (0, I): orthogonal Hermite polynomials S1(x) = x, S2(x) = xxT − I, . . . Application of Stein’s Lemma ► Providing derivative information: let E[y|x] := f (x), then ► For Gaussian x ∼ N (0, I): orthogonal Hermite polynomials S1(x) = x, S2(x) = xxT − I, . . . 28/ 33

NN-LIFT: Neural Network-LearnIng using Feature Tensors

NN-LIFT: Neural Network-LearnIng using Feature Tensors

Realizable: E[y · Sm(x)] has CP tensor decomposition. Training Neural Networks with Tensors Realizable: E[y · Sm(x)] has CP tensor decomposition. M. Janzamin, H. Sedghi, and A., “Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods,” June. 2015. A. Barron, “Approximation and Estimation Bounds for Artiﬁcial Neural Networks,” Machine Learning, 1994.

Training Neural Networks with Tensors Realizable: E[y · Sm(x)] has tensor decomposition. Non-realizable: Theorem (training neural networks) For small enough C E [|f (x) − f (x) ˆ ˜ | ] ≤ O(C /k)+ O(1/n). 2 2 f f x n samples, k number of neurons M. Janzamin, H. Sedghi, and A., “Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods,” June. 2015. A. Barron, “Approximation and Estimation Bounds for Artiﬁcial Neural Networks,” Machine Learning, 1994.

Training Neural Networks with Tensors First guaranteed method for training neural networks M. Janzamin, H. Sedghi, and A., “Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods,” June. 2015. A. Barron, “Approximation and Estimation Bounds for Artiﬁcial Neural Networks,” Machine Learning, 1994.

Background on optimization landscape of tensor decomposition

Notion of Tensor Contraction Extends the notion of matrix product Matrix product Tensor Contraction

Symmetric Tensor Decomposition = + + ··· T = v1⊗3 + v2⊗3 + ··· , A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Symmetric Tensor Decomposition Tensor Power Method = + +··· T (v, v, ·) v → IT (v, v, ·)I . 2 2 T (v, v, ·) = (v, v1) v1 + (v, v2) v2 A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Symmetric Tensor Decomposition Tensor Power Method = + +··· T (v, v, ·) v → IT (v, v, ·)I . 2 2 T (v, v, ·) = (v, v1) v1 + (v, v2) v2 Orthogonal Tensors v1 ⊥ v2. T (v1, v1, ·) = λ1v1. A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Symmetric Tensor Decomposition Tensor Power Method = + +··· T (v, v, ·) v → IT (v, v, ·)I . 2 2 T (v, v, ·) = (v, v1) v1 + (v, v2) v2 Orthogonal Tensors v1 ⊥ v2. T (v1, v1, ·) = λ1v1. = A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Symmetric Tensor Decomposition Tensor Power Method = + +··· T (v, v, ·) v → IT (v, v, ·)I . T (v, v, ·) = (v, v1)2v1 + (v, v2)2v2 Exponential no. of stationary points for power method: T (v, v, ·)= λv A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Symmetric Tensor Decomposition Tensor Power Method = + +··· T (v, v, ·) v → IT (v, v, ·)I . T (v, v, ·) = (v, v1)2v1 + (v, v2)2v2 Exponential no. of stationary points for power method: T (v, v, ·)= λv Stable Unstable Other statitionary points A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Symmetric Tensor Decomposition Tensor Power Method = + +··· T (v, v, ·) v → IT (v, v, ·)I . T (v, v, ·) = (v, v1)2v1 + (v, v2)2v2 Exponential no. of stationary points for power method: T (v, v, ·)= λv For power method on orthogonal tensor, no spurious stable points A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Non-orthogonal Tensor Decomposition = + + ··· T = v1⊗3 + v2⊗3 + ··· , A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Non-orthogonal Tensor Decomposition Orthogonalization Input tensor T A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Non-orthogonal Tensor Decomposition Orthogonalization T (W, W, W ) = T˜ A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Non-orthogonal Tensor Decomposition Orthogonalization v1 v2 W v˜1 v˜2 T (W, W, W ) = T˜ A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Non-orthogonal Tensor Decomposition Orthogonalization v1 v2 W v˜1 v˜2 T (W, W, W ) = T˜ T˜ = T (W, W, W ) = v˜1⊗3 + v˜2⊗3 + ··· , = + A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Non-orthogonal Tensor Decomposition Orthogonalization v1 v2 W v˜1 v˜2 T (W, W, W ) = T˜ Find W using SVD of Matrix Slice + M = T (·, ·, θ) = A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Non-orthogonal Tensor Decomposition Orthogonalization v1 v2 W v˜1 v˜2 T (W, W, W ) = T˜ Orthogonalization: invertible when vi’s linearly independent. A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Non-orthogonal Tensor Decomposition Orthogonalization v1 v2 W v˜1 v˜2 T (W, W, W ) = T˜ Orthogonalization: invertible when vi’s linearly independent. Recovery of Network Weights under Linear Independence A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Perturbation Analysis for Tensor Decomposition Well understood for matrix decomposition vs. hard for polynomials. Contribution: ﬁrst results for tensor decomposition. A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Perturbation Analysis for Tensor Decomposition Well understood for matrix decomposition vs. hard for polynomials. Contribution: ﬁrst results for tensor decomposition. T ∈ Rd×d×d: Orthogonal tensor. E: noise tensor. Theorem: When , in iterations of power method and linear no. of restarts, recovery of {vi} up to error IEI. A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Perturbation Analysis for Tensor Decomposition v3 v1 v2 Theorem: When iterations of power method and linear no. of restarts, recovery of {vi} up to error IEI. A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Perturbation Analysis for Tensor Decomposition Dataset 1 Model Class Dataset 2 Require datasets with good model ﬁtting. Theorem: When λ iterations of power method and linear no. of restarts, recovery of {vi} up to error IEI. Polynomial computational and sample complexity for tensor methods A., R. Ge, D. Hsu, S. Kakade, M. Telgarsky, “Tensor Decompositions for Learning Latent Variable Models,” JMLR 2014.

Implications and next steps

NN-LIFT: Neural Network-LearnIng using Feature Tensors

Overparameterization as a solution? Gradient descent (no stochasticity) Width of each layer exponentially large in number of layers for fully connected but polynomial for ConvNets and Resnets (but requires overparameterization in each layer)

So what is the cost? Slides from Ben Recht

Role of Generative Process ► Continuous x with pdf p(·): ► mth-order score function: x ∈ Rd m ∇(m)p(x) Sm(x) := (−1) p(x) Input: S3(x) ∈ Rd×d×d Score functions: need generative model. Shows role of generative process in discriminative learning. Estimating this is hard in general. More relevant in system ID setting where p(x) is known. In control, wireless networks (channel estimation) etc. Open question: how to optimally select p(x)? Also, how to generalize it to multi-layer networks?