CSC 578 Neural Networks and Deep Learning

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

NEURAL NETWORKS Backpropagation Algorithm
Neural networks Introduction Fitting neural networks
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning Neural Networks
Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.
Radial Basis Functions
Neural Networks Marco Loog.
Saturation, Flat-spotting Shift up Derivative Weight Decay No derivative on output nodes.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Appendix B: An Example of Back-propagation algorithm
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
ARTIFICIAL NEURAL NETWORKS. Overview EdGeneral concepts Areej:Learning and Training Wesley:Limitations and optimization of ANNs Cora:Applications and.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Multi-Layer Perceptron
Non-Bayes classifiers. Linear discriminants, neural networks.
Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.
Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Big data classification using neural network
Regularization Techniques in Neural Networks
Deep Learning Methods For Automated Discourse CIS 700-7
Fall 2004 Backpropagation CS478 - Machine Learning.
Supervised Learning in ANNs
Chapter 4 Supervised learning: Multilayer Networks II
Computer Science and Engineering, Seoul National University
One-layer neural networks Approximation problems
Neural Networks CS 446 Machine Learning.
Data Mining Lecture 11.
CSC 578 Neural Networks and Deep Learning
Classification / Regression Neural Networks 2
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
CS 4501: Introduction to Computer Vision Training Neural Networks II
CSC 578 Neural Networks and Deep Learning
ECE 599/692 – Deep Learning Lecture 4 – CNN: Practical Issues
BACKPROPAGATION Multlayer Network.
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Tips for Training Deep Network
ECE 599/692 – Deep Learning Lecture 5 – CNN: The Representative Power
CSC 578 Neural Networks and Deep Learning
Neural Networks Geoff Hulten.
Capabilities of Threshold Neurons
Lecture Notes for Chapter 4 Artificial Neural Networks
Overfitting and Underfitting
Backpropagation.
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Neural networks (1) Traditional multi-layer perceptrons
Artificial Intelligence 10. Neural Networks
Backpropagation David Kauchak CS159 – Fall 2019.
COSC 4335: Part2: Other Classification Techniques
Computer Vision Lecture 19: Object Recognition III
Backpropagation.
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Introduction to Neural Networks
Batch Normalization.
CSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning
Principles of Back-Propagation
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

CSC 578 Neural Networks and Deep Learning Fall 2018/19 4. Deep Neural Networks (Some figures adapted from NNDL book) Noriko Tomuro

Various Approaches to Improve Neural Networks Deep Neural Networks Concepts, Principles Challenges in Deep Neural Networks Overfitting Long computation time Vanishing gradient Hyper-parameters tuning Noriko Tomuro

1 Deep Neural Networks Single vs. Multilayer Neural networks: Deep networks (with larger number of layers) are generally more expressive/powerful. Noriko Tomuro

2 Challenges in Deep Networks Though powerful, deep neural networks have many challenges, especially in training. Notable ones are: Overfitting Long computation time Vanishing Gradient Too many hyper-parameters to tune Noriko Tomuro

2.1 Overfitting Deep neural networks are (naturally) prone to overfitting due to the very large number of nodes/variables and parameters (i.e., degree of freedom). To overcome overfitting, several techniques have bee proposed including: Regularization (e.g. L2, L1) Dropout Early stopping Sampling and/or clipping of training data Noriko Tomuro

2.2 Computation Time In addition to the large network size (the number of layers and units per layer), which takes time to process, deep neural networks usually have many hyper-parameters (e.g. the learning rate η and the regularization parameter λ). Poor parameter values (including initial weights) could make the computation even longer, preventing the weights to converge. There are ways to speed up the network (learning): Mini-batching Early stopping Better parameter selection Utilize hardware support, e.g. GPU Noriko Tomuro

2.3 Vanishing Gradient Observation: Early hidden layers learn much slowly than later hidden layers. Reason: In BP, gradient is propagated backwards from the output layer to the input layer so the early layers receive fraction of the gradient (i.e., vanishing gradient), although in some cases, gradient gets larger as it is propagated backwards (i.e., exploding gradient). In general, gradient is unstable in deep networks. Gradient in the lth layer is Noriko Tomuro

The derivative of the sigmoid function 𝑓 𝑧 = 1 1+ 𝑒 −𝑧 is 𝑓 ′ 𝑧 = 𝑑𝑓(𝑧) 𝑑𝑧 =𝑓(𝑧)∙(1−𝑓(𝑧)) The value of f’ maximizes to 0.25 when z = 0. The value of f’ minimizes close to 0 when z is a large positive or a large negative. Since typical activation (z) of a node is less than 0.25 (especially if the initial weights were given between 0 and 1), successive gradient propagation makes the node activation to decrease – vanishing gradient. On the other hand, if the activation was kept close to 0 (by large and similar values for the weights and bias), the gradient is kept large – exploding gradient. Noriko Tomuro

Some techniques to overcome vanishing gradient: With deep networks with gradient-based cost minimization, vanishing gradient is very difficult to circumvent, unless weights and biases somehow balance out (luckily). Some techniques to overcome vanishing gradient: Cross-Entropy cost function Regularization Noriko Tomuro

2.4 Hyper-Parameter Tuning From NNDL Ch. 3: Strip the problem down Reduce the data (i.e., simplify the problem) Start with a simple network (i.e., simplify the architecture) Speed up testing by monitoring performance frequently Try rough variations of parameters Find a good threshold for learning rate Stop training when performance is good (or shows no improvement) Adaptive learning rate schedule Several algorithms have been applied to systematically search for the optimal hyper-parameter values. Grid search Optimization techniques such as Monte Carlo and Genetic Algorithms Noriko Tomuro