Tips for Training Neural Network

Slides:



Advertisements
Similar presentations
A brief review of non-neural-network approaches to deep learning
Advertisements

NEURAL NETWORKS Backpropagation Algorithm
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
ImageNet Classification with Deep Convolutional Neural Networks
The loss function, the normal equation,
Overview over different methods – Supervised Learning
Lecture 14 – Neural Networks
Artificial Neural Networks ECE 398BD Instructor: Shobha Vasudevan.
Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 6: Multilayer Neural Networks
Optimal Adaptation for Statistical Classifiers Xiao Li.
Examples of Ensemble Methods
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
Artificial Neural Networks
Sparse vs. Ensemble Approaches to Supervised Learning
LOGO Classification III Lecturer: Dr. Bo Yuan
CS 4700: Foundations of Artificial Intelligence
Machine learning Image source:
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.
Machine Learning Chapter 4. Artificial Neural Networks
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Neural Networks1 Introduction to NETLAB NETLAB is a Matlab toolbox for experimenting with neural networks Available from:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Non-Bayes classifiers. Linear discriminants, neural networks.
Analysis of Classification Algorithms In Handwritten Digit Recognition Logan Helms Jon Daniele.
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Artificial Intelligence Methods Neural Networks Lecture 3 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
Machine Learning Supervised Learning Classification and Regression
Convolutional Neural Network
Deep Learning Amin Sobhani.
Object Classification through Deconvolutional Neural Networks
Artificial Neural Networks
Lecture 3. Fully Connected NN & Hello World of Deep Learning
Computer Science and Engineering, Seoul National University
Real Neurons Cell structures Cell body Dendrites Axon
Applications of Deep Learning and how to get started with implementation of deep learning Presentation By : Manaswi Advisor : Dr.Chinmay.
Classification with Perceptrons Reading:
Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.
Machine Learning Today: Reading: Maria Florina Balcan
Face Recognition with Deep Learning Method
Lecture 11. MLP (III): Back-Propagation
CS 4501: Introduction to Computer Vision Training Neural Networks II
INF 5860 Machine learning for image classification
Tips for Training Deep Network
network of simple neuron-like computing elements
Fluctuation-Dissipation Relations for Stochastic Gradient Descent
Image to Image Translation using GANs
Deep Learning and Mixed Integer Optimization
Neural Networks Geoff Hulten.
Deep Learning for Non-Linear Control
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Image Classification & Training of Neural Networks
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Introduction to Neural Networks
Batch Normalization.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Presentation transcript:

Tips for Training Neural Network scratch the surface

Two Concerns There are two things you have to concern. Optimization Can I find the “best” parameter set θ* in limited of time? Generalization Is the “best” parameter set θ* good for testing data as well?

Initialization For gradient descent, we need to pick an initialization parameter θ0. Do not set all the parameters θ0 equal Set the parameters in θ0 randomly

Learning Rate Set the learning rate η carefully Toy Example Training Data (20 examples) x = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5] y = [0.1, 0.4, 0.9, 1.6, 2.2, 2.5, 2.8, 3.5, 3.9, 4.7, 5.1, 5.3, 6.3, 6.5, 6.7, 7.5, 8.1, 8.5, 8.9, 9.5]

Learning Rate Toy Example Error Surface: C(w,b) start target

Learning Rate Toy Example Different learning rate η

Gradient descent Gradient descent Stochastic Gradient descent Pick an example xr Two approaches update the parameters towards the same direction, but stochastic is faster Faster! If all example xr have equal probabilities to be picked

Gradient descent Stochastic Gradient descent One epoch Training Data: Starting at θ0 pick x1 pick x2 … … Seen all the examples once  [ˋɛpək] pick xr … … One epoch pick xR pick x1

Gradient descent Toy Example See only one example See all examples Update 20 times in an epoch Gradient descent Stochastic Gradient descent 1 epoch

Gradient descent Gradient descent Stochastic Gradient descent Pick an example xr Mini Batch Gradient Descent What is the meaning of shuffle your data? Pick B examples as a batch b (B is batch size) Average the gradient of the examples in the batch b Shuffle your data

Gradient descent Real Example: Handwriting Digit Classification Batch size = 1 Gradient descent

Two Concerns There are two things you have to concern. Optimization Can I find the “best” parameter set θ* in limited of time? Generalization Is the “best” parameter set θ* good for testing data as well?

Training data and testing data have different distribution. Generalization You pick a “best” parameter set θ* Training Data: However, Testing Data: Training Data: Testing Data: Training data and testing data have different distribution.

Panacea Have more training data if possible …… Create more training data (?) Handwriting recognition: Original Training Data: Created Training Data: In speech recognition: add noise warping Shift 15。

Reference Chapter 3 of Neural network and Deep Learning http://neuralnetworksanddeeplearning.com/ch ap3.html

Appendix A lot of reference http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Recommended_Readings Who is afraid about the nonconvex function? http://videolectures.net/eml07_lecun_wia/ Can the structure also be Learned? I do not know. Maybe baysian The story of cat https://www.google.com.tw/search?safe=off&biw=1242&bih=585&tbm=isch&sa=1&q=deep+feedforward+neural+network&oq=deep+feedforward+neural+network&gs_l=img.3...1794.8632.0.8774.16.15.0.1.1.0.62.840.15.15.0.msedr...0...1c.1.61.img..7.9.471.9ANzFYArdWc#imgdii=_&imgrc=R42kdSG1GGHbLM%253A%3BgPo1PHNnvXx63M%3Bhttps%253A%252F%252Fdreamtolearn.com%252Finternal%252Fdoc-asset%252F331R5HIW3VFPQPH8KUXGUTM2Y%252Fdeep3.png%3Bhttps%253A%252F%252Fdreamtolearn.com%252Fryan%252Fdata_analytics_viz%252F74%3B696%3B412 Rprop, Momenton: https://www.youtube.com/watch?v=Cy2g9_hR-5Y&list=PL29C61214F2146796&index=8 (OK, very good gradient) Minecraft……: https://www.youtube.com/watch?v=zi6gdjLLSPE 補充?: B-diagram NN VC dim

Overfitting The function that performs well on the training data does not necessarily perform well on the testing data. Training Data: Testing Data: Different view for regularization http://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization The picked hypothesis fit the training data well, but fail to generalize to testing data Overfitting in our daily life: Memorize the answers of the previous examples ……

Joke for overfiting http://xkcd.com/1122/

Initialization For gradient descent, we need to pick an initialization parameter θ0. Do not set all the parameters θ0 equal Or your parameters will always be equal, no matter how many times you update the parameters Randomly pick θ0 If the last layer has more neurons, the initialization values should be smaller. E.g. Last layer has Nl-1

MNIST The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. git clone https://github.com/mnielsen/neural-networks- and-deep-learning.git http://yann.lecun.com/exdb/mnist/ http://www.deeplearning.net/tutorial/gettingstarted.html

MNIST The current (2013) record is classifying 9,979 of 10,000 images correctly. This was done by Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. At that level the performance is close to human- equivalent, and is arguably better, since quite a few of the MNIST images are difficult even for humans to recognize with confidence.

Early Stopping For iteration Layer

Difficulty of Deep Lower layer cannot plan