Done.

Slides:



Advertisements
Similar presentations
Deep Learning and Neural Nets Spring 2015
Advertisements

Practical Advice For Building Neural Nets Deep Learning and Neural Nets Spring 2015.
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
x – independent variable (input)
1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo
1 Statistical Mechanics and Multi- Scale Simulation Methods ChBE Prof. C. Heath Turner Lecture 11 Some materials adapted from Prof. Keith E. Gubbins:
Neural Nets: Something you can use and something to think about Cris Koutsougeras What are Neural Nets What are they good for Pointers to some models and.
CS-424 Gregory Dudek Today’s Lecture Neural networks –Training Backpropagation of error (backprop) –Example –Radial basis functions.
Fast Learning in Networks of Locally-Tuned Processing Units John Moody and Christian J. Darken Yale Computer Science Neural Computation 1, (1989)
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Today’s Lecture Neural networks Training
Neural networks.
Neural networks and support vector machines
Big data classification using neural network
Support vector machines
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Feedforward Networks
ECE 5424: Introduction to Machine Learning
Introduction to Machine Learning
A Simple Artificial Neuron
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Intelligent Information System Lab
ECE 5424: Introduction to Machine Learning
Convolutional Networks
Random walk initialization for training very deep feedforward networks
Neural Networks and Backpropagation
An Introduction to Support Vector Machines
CSCI 5822 Probabilistic Models of Human and Machine Learning
RNNs: Going Beyond the SRN in Language Prediction
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Perceptrons for Dummies
ECE 471/571 - Lecture 17 Back Propagation.
Introduction to Neural Networks
Computational Intelligence
CS 4501: Introduction to Computer Vision Training Neural Networks II
Neuro-Computing Lecture 4 Radial Basis Function Network
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Tips for Training Deep Network
ECE 599/692 – Deep Learning Lecture 5 – CNN: The Representative Power
Very Deep Convolutional Networks for Large-Scale Image Recognition
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Computational Intelligence
Neural Networks Geoff Hulten.
Deep Learning for Non-Linear Control
Ensemble learning.
Done Source:
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Approximation and Generalization in Neural Networks
Support vector machines
Done Source: Somewhere on Reddit.
Problems with CNNs and recent innovations 2/13/19
实习生汇报 ——北邮 张安迪.
Backpropagation David Kauchak CS159 – Fall 2019.
Image Classification & Training of Neural Networks
Computational Intelligence
Computer Vision Lecture 19: Object Recognition III
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Introduction to Neural Networks
Batch Normalization.
Computational Intelligence
CSC 578 Neural Networks and Deep Learning
CS249: Neural Language Model
Presentation transcript:

Done

Done

Deep vs. shallow learning 02/04/19 CIS 700-004: Lecture 4M Deep vs. shallow learning 02/04/19 Done

Course Announcements Homework has not been released yet. Keep relaxing :) Done

Design in deep learning Deep learning is not a science yet. We don’t know what networks will work for what problems. People in general just tweak what worked best. We don’t have answers (yet). Today, we’ll talk about the start of a science for deep learning - intuition and theory. Done

The intuitive benefits of depth Done

How deep is the brain? Felleman and vanEssen Jonas and Kording 2017

And locally Shepherd 1994

Compositionality in tasks Using a vision example would have been more powerful

Expressivity Done

Expressivity We know even shallow neural nets (=1 hidden layer) are universal approximators under various assumptions. That could require huge width. Given a particular architecture, we can looks at its expressivity, or the set of functions it can approximate. Why do we need to look at approximation here? Done

Expressivity gaps: number of linear pieces Sawtooth function The sawtooth function with 2n pieces can be expressed succinctly with ~3n neurons (Telgarsky 2015) and depth ~2n. The naive shallow implementation takes exponentially more neurons. Done

Expressivity gaps: number of linear pieces Montufar et al. (2014) showed that the number of linear pieces that can be expressed by a deep piecewise linear network is exponential in the depth and polynomial in the number of input dimensions. Done

Expressivity gaps: curvature Thm: For a bounded activation function, a unit length curve sent through a deep network can grow in length exponentially with the depth. For a shallow network, the length is only linear in the width. (Poole et al. 2016) Done

Expressivity gaps: curvature Thm: For a bounded activation function, a unit length curve sent through a deep network can grow in length exponentially with the depth. For a shallow network, the length is only linear in the width. (Poole et al. 2016) Empirically, they find that the curvature of the output curve grows exponentially with depth. They prove this in the infinite-width limit for random networks (i.e., at init). Infinite-width limit is used in many proofs, though doesn’t capture everything (as we’ll see later). One example of finite-width effect on variance: if a ReLU layer gets zeroed out. Done

Expressivity gaps: the multiplication problem Theorem (Lin et al. 2017). For approximating the multiplication of n inputs x1, x2, … ,xn to within an arbitrary ε accuracy, a shallow network requires 2n neurons, but a deep net requires only O(n) neurons (linear in n). Theorem (Rolnick & Tegmark, 2018). More generally, the number of neurons required for a shallow network to approximate the monomial is also exponential in n: The number of neurons to required to approximate the sum of m monomials is at most 1/m times the number required for each individual monomial. Thus there is also an exponential gap for any (sparse) polynomial. Done

Expressivity gaps: the multiplication problem Theorem (Rolnick & Tegmark 2018). When using k layers to approximate the product of n inputs, the number of neurons needed is at most: Conjectured that the bound is tight. Done

Learnability Done

Is expressivity the problem with shallow nets? Ba & Caruana (2014): A wide, shallow network can be trained to mimic a deep network, attaining significantly greater accuracy than training the shallow network directly on the data. Or an ensemble of deep networks. Done

Is expressivity the problem with shallow nets? Ba & Caruana (2014): A wide, shallow network can be trained to mimic a deep network, attaining significantly greater accuracy than training the shallow network directly on the data. Or an ensemble of deep networks. The mimic networks output the pre-softmax output of the teach networks. Why is there more information here than simply training on the raw data? Learnability of deeper networks may be more important than expressivity in practice. Done

Is expressivity typical or just possible? Hanin & Rolnick (2019) Is expressivity typical or just possible? Sawtooth function Weights, biases + noise (normal std dev 0.1) Done

Linear regions in ReLU nets Plane through 3 MNIST examples Depth 3, width 64 network at init Done Hanin & Rolnick (2019)

Linear regions in ReLU nets The number of linear regions in a ReLU network (which is piecewise linear) can be exponential in the depth. (Montufar et al. 2014) Hanin & Rolnick (2019) - the regions for a typical ReLU net at initialization. Theorem 1: The expected number of regions that intersect any 1D trajectory (e.g. a line) per unit length is linear in N, the total number of neurons. Theorem 2: The expected surface area of the total boundary between regions, per unit volume, is linear in N. Theorem 3: The expected distance to the nearest region boundary scales as 1/N. For n-dimensional input, number of regions conjectured to grow as (depth)n. Done

Linear regions in ReLU nets Done Hanin & Rolnick (2019)

Linear regions in ReLU nets Initialization Epoch 1 Epoch 20 Done Hanin & Rolnick (2019)

Loss landscapes of neural networks The loss landscape refers to how the loss changes over parameter space. The dimension of parameter space is very high (potentially millions). Learning aims to find a global minimum. On right is a surface plot with z = loss and xy = a 2D projection of params. The individual directions in the projection are normalized by the network weights. Li et al. (2018) Done

Local optima, saddle points The classic worry of optimization is that one can fall into a local optimum. (This is why convex optimization is great - local minima are global.) But actually for deep networks there is another problem. Local minima are rare but saddle points are common (Dauphin et al. 2014). Why is this the case? Eigenvalues of the Hessian are like for a random matrix, but shifted right by an amount determined by the loss at the point in question (Bray and Dean 2007). Saddle points look like plateaus. Done

Learning XOR Consider the following problem for S a subset of {1, 2, …, d}: For each d-dimensional binary input x, compute the XOR of coordinates of x indexed by S. Theorem (Shalev-Shwartz et al. 2017). As S varies, the gradient of the loss between a predictor and the true XOR is tightly concentrated - that is, the gradient doesn’t depend strongly on S. (Formally, the variance of the gradient with respect to S is exponentially small in d.) The loss landscape is exponentially flat - except right around the minimum. Done

Exploding & vanishing - theory and practice Hanin & Rolnick (2018): ReLU networks at initialization, randomly initialized weights i.i.d. with variance: Consider the squared length of the activation vector at layer j, normalized by the width: Theorem 1. The mean across initializations is exponential in . Theorem 2. The variance of the squared length between layers is exponential in the sum of reciprocals of layer widths: Hanin 2018: The variance of gradients of the network is also exponential in this quantity. Done

Exploding & vanishing - initialization Exponential growth of the mean squared length of the output vector for many popular initializations. The negative impact of very large and small output lengths on early training over MNIST. Done Hanin & Rolnick (2018)

Exploding & vanishing - architecture Early training dynamics for a variety of architectures when training on MNIST. In the left panel, the pink curve has a smaller sum of reciprocals at each depth, while all other curves have the same (larger) sum. Done Hanin & Rolnick (2018)

Exploding & vanishing - takeaways Poor initialization and poor architecture both stop networks from learning. Initialization: Use i.i.d. weights with variance 2/fan-in (e.g. He normal / He uniform). Watch out for truncated normals! Architecture: Width (or #features in ConvNets) should grow with depth. Even a single narrow layer makes training hard. Done

Summary: depth and width Depth is really useful, but diminishing returns. Very deep networks are probably more useful based on their learning biases than because of their expressivity. Deeper networks in practice learn more complex functions but harder to train at all. (ResNets make it easier to train deep networks.) Wider networks are easier to train. There is no absolute rule here, sorry! Done