Done Source: Somewhere on Reddit.

Slides:



Advertisements
Similar presentations
NEURAL NETWORKS Backpropagation Algorithm
Advertisements

Neural networks Introduction Fitting neural networks
Deep Learning and Neural Nets Spring 2015
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
The loss function, the normal equation,
Lecture 14 – Neural Networks
x – independent variable (input)
Artificial Neural Networks
Collaborative Filtering Matrix Factorization Approach
Machine Learning Chapter 4. Artificial Neural Networks
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Non-Bayes classifiers. Linear discriminants, neural networks.
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
CSC321: Lecture 7:Ways to prevent overfitting
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Artificial Intelligence Methods Neural Networks Lecture 3 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Neural Networks for Machine Learning Lecture 3a Learning the weights of a linear neuron Geoffrey Hinton with Nitish Srivastava Kevin Swersky.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Today’s Lecture Neural networks Training
Machine Learning Supervised Learning Classification and Regression
Deep Learning Methods For Automated Discourse CIS 700-7
Fall 2004 Backpropagation CS478 - Machine Learning.
RNNs: An example applied to the prediction task
Deep Feedforward Networks
Artificial Neural Networks
Learning with Perceptrons and Neural Networks
ECE 5424: Introduction to Machine Learning
Computer Science and Engineering, Seoul National University
Adversarial Learning for Neural Dialogue Generation
A Simple Artificial Neuron
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Classification with Perceptrons Reading:
ECE 5424: Introduction to Machine Learning
CS 188: Artificial Intelligence
Neural Networks and Backpropagation
RNNs: Going Beyond the SRN in Language Prediction
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Collaborative Filtering Matrix Factorization Approach
CS 4501: Introduction to Computer Vision Training Neural Networks II
Logistic Regression & Parallel SGD
INF 5860 Machine learning for image classification
Neuro-Computing Lecture 4 Radial Basis Function Network
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
[Figure taken from googleblog
Generalization in deep learning
Neural Networks Geoff Hulten.
Deep Learning for Non-Linear Control
ML – Lecture 3B Deep NN.
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
The loss function, the normal equation,
Done Source:
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Mathematical Foundations of BME Reza Shadmehr
RNNs: Going Beyond the SRN in Language Prediction
Done.
Neural networks (1) Traditional multi-layer perceptrons
Machine learning overview
Problems with CNNs and recent innovations 2/13/19
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Introduction to Neural Networks
CS249: Neural Language Model
Presentation transcript:

Done Source: Somewhere on Reddit

Challenges in training neural nets 01/30/19 CIS 700-004: Lecture 3W Challenges in training neural nets 01/30/19 Done

Course Announcements HW0 is due tonight. Reminder to make a private Piazza post if you'd like Canvas access to submit the homework as an auditor. We have 10 additional spots for enrollment available to people who submit the homework.

Today, we pretend to be doctors!

Today's agenda Finish up the anatomy of a neural net Loss functions Universal approximation Tensorboard: a data scientist's best friend Challenges in training neural nets Initialization challenges Gradient challenges Magnitude problems Direction problems

Loss functions Done

Mean squared error Often used if the network is predicting a scalar. Done

Cross-entropy Most common loss function for classification tasks. It is a measure of distance between two probability distributions. For classification, generally the target vector is one-hot encoded, which means that is 1 where belongs to class j, and is otherwise 0. Be careful to apply cross-entropy loss only to probabilities! (e.g. after softmax) In PyTorch, softmax and cross-entropy can be applied in a single step. Be careful in that case not to apply it twice :)

Cosine similarity Computes similarity between directions of two vectors. Often used in NLP contexts. For example, the two vectors might be the frequencies of different words. Done

Universal Approximation

No Free Lunch Theorem “If an algorithm performs better than random search on some class of problems then in must perform worse than random search on the remaining problems.” Aka "A superior black-box optimisation strategy is impossible." Aka "If our feature space is unstructured, we can do no better than random."

No Free Lunch Theorem

Some problems are harder than others. Linearly separable problems Problems with no structure Continuous problems Real-life problems xor Easy Hard Solved by perceptron Unsolvable by No Free Lunch Theorem

Some problems are harder than others. Linearly separable problems Problems with no structure Continuous problems Real-life problems xor Easy Hard Solved by perceptron Unsolvable by No Free Lunch Theorem Loosely, UAT says that neural nets can solve these problems

Universal Approximation Theorem

Cybenko's construction Suppose we want to approximate the following function: y x

Cybenko's construction Suppose we want to approximate the following function: y x 1 2 3 4 We can approximate it as a piecewise-constant function.

Cybenko's construction Let's make a neural net that represents the indicator function for the interval [2, 3].

Cybenko's construction: the indicator function

Cybenko's construction Interval a1 y1 Σ y ... Interval an yn Indicator approximators

UAT insight: any nonconstant, bounded, continuous transfer function could generate a class of universal approximators.

Training Challenges

Training Challenges: Initialization

How do we initialize our weights? Most models work best when normalized around zero. All of our choices of activation functions inflect about zero. What is the simplest possible initialization strategy?

Zero initialization We set all weights and biases to be zero in every element. How will this perform?

Zero initialization: is horrible. Vanilla

Zero initialization: debugging why it's so horrible Outputs

Zero initialization: debugging why it's so horrible Parameters

Zero initialization: debugging why it's so horrible Gradients

Zero initialization gives you an effective layer width of one Zero initialization gives you an effective layer width of one. Don't use it. Ever.

The next simplest thing: random initialization We select all weights and biases uniformly from the interval [0,1] How will this perform?

Random initialization: slightly better. Vanilla

Random initialization scales poorly with layers.

Smart initializations The variance of a single weight should scale inversely with width. Correction - He et al.: variance = 2/n, not 1/2n

Xavier initialization: good. Also familiar. Vanilla

PyTorch uses Xavier initialization.

Training Challenges: Gradient Flow

Various learning curves demonstrating problems.

Training Challenges: Gradient Magnitude

Rate tuning

Advanced optimizers: Adagrad "Adaptive gradient algorithm" Adapts a learning rate for each parameter based on size of previous gradients.

Advanced optimizers: RMSprop "Root mean square prop" Adapts a learning rate for each parameter based on size of previous gradients.

Advanced optimizers: Adam "Adaptive moment estimation" Similar to RMSprop, but with both the first and second moments of the gradients

What to do when your gradients are still enormous

What to do when your gradients are too small Find a better activation? Find a better architecture? Find a better initialization? Find a better career?

Training Challenges: Gradient Direction

A brief interlude on algorithms A batch algorithm does a computation all in one go given an entire dataset Closed-form solution to linear regression Selection sort Quicksort Pretty much anything involving gradient descent. An online algorithm incrementally computes as new data becomes available Heap algorithm for the kth largest element in a stream Perceptron algorithm Insertion sort Greedy algorithms

Minibatching A minibatch is a small subset of a large dataset. In order to perform gradient descent, we need an accurate measure of the gradient of the loss with respect to the parameters. The best measure is the average gradient over all of the examples (gradient descent is a batch algorithm). Computing over 60K examples on MNIST for a single (extremely accurate) update is stupidly expensive. We use minibatches (say, 50 examples) to compute a noisy estimate of the true gradient. The gradient updates are worse, but there are many of them. This converts the neural net training into an online algorithm.

How PyTorch handles batching # MNIST Dataset dataset = dsets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True) # Data Loader (Input Pipeline) data_loader = torch.utils.data.DataLoader(dataset=dataset, batch_size=100, shuffle=True)

Classical regularization

Dropout

Dropout Prevents overfitting by preventing co-adaptations Forces "dead branches" to learn Developed by Geoffrey Hinton, patented by Google

Dropout

Early stopping* *not built-in. Oh, the horror.

So why do we have all this machinery?

So why is training a deep neural net hard? Universal approximation theorem doesn't help us learn a good approximator. Neural nets do not have a closed form solution. Neural nets do not have a convex loss function. Neural nets are really expressive.

The awful, awful VC-dimension of neural nets

The optimization problem is hard, and there are no guarantees on learnability. We invented methods for manipulating gradient flow to get good parametrizations despite tricky loss landscapes.

Looking forward HW0 is due tonight. HW1 will be released later this week. Next week, we discuss architecture tradeoffs and begin computer vision.