CSC 578 Neural Networks and Deep Learning

Slides:

Advertisements

Similar presentations

Multi-Layer Perceptron (MLP)

Advertisements

NEURAL NETWORKS Backpropagation Algorithm

Neural networks Introduction Fitting neural networks

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Machine Learning Neural Networks

Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.

Radial Basis Functions

Neural Networks Marco Loog.

Saturation, Flat-spotting Shift up Derivative Weight Decay No derivative on output nodes.

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Appendix B: An Example of Back-propagation algorithm

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

CS 478 – Tools for Machine Learning and Data Mining Backpropagation.

ARTIFICIAL NEURAL NETWORKS. Overview EdGeneral concepts Areej:Learning and Training Wesley:Limitations and optimization of ANNs Cora:Applications and.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

Multi-Layer Perceptron

Non-Bayes classifiers. Linear discriminants, neural networks.

Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.

Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.

Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.

BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.

Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Big data classification using neural network

Regularization Techniques in Neural Networks

Deep Learning Methods For Automated Discourse CIS 700-7

Fall 2004 Backpropagation CS478 - Machine Learning.

Supervised Learning in ANNs

Chapter 4 Supervised learning: Multilayer Networks II

Computer Science and Engineering, Seoul National University

One-layer neural networks Approximation problems

Neural Networks CS 446 Machine Learning.

Data Mining Lecture 11.

CSC 578 Neural Networks and Deep Learning

Classification / Regression Neural Networks 2

Machine Learning Today: Reading: Maria Florina Balcan

CSC 578 Neural Networks and Deep Learning

CS 4501: Introduction to Computer Vision Training Neural Networks II

CSC 578 Neural Networks and Deep Learning

ECE 599/692 – Deep Learning Lecture 4 – CNN: Practical Issues

BACKPROPAGATION Multlayer Network.

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Tips for Training Deep Network

ECE 599/692 – Deep Learning Lecture 5 – CNN: The Representative Power

CSC 578 Neural Networks and Deep Learning

Neural Networks Geoff Hulten.

Capabilities of Threshold Neurons

Lecture Notes for Chapter 4 Artificial Neural Networks

Overfitting and Underfitting

Backpropagation.

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Neural networks (1) Traditional multi-layer perceptrons

Artificial Intelligence 10. Neural Networks

Backpropagation David Kauchak CS159 – Fall 2019.

COSC 4335: Part2: Other Classification Techniques

Computer Vision Lecture 19: Object Recognition III

Backpropagation.

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Introduction to Neural Networks

Batch Normalization.

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning

Principles of Back-Propagation

Patterson: Chap 1 A Review of Machine Learning

Presentation transcript:

CSC 578 Neural Networks and Deep Learning Fall 2018/19 4. Deep Neural Networks (Some figures adapted from NNDL book) Noriko Tomuro

Various Approaches to Improve Neural Networks Deep Neural Networks Concepts, Principles Challenges in Deep Neural Networks Overfitting Long computation time Vanishing gradient Hyper-parameters tuning Noriko Tomuro

1 Deep Neural Networks Single vs. Multilayer Neural networks: Deep networks (with larger number of layers) are generally more expressive/powerful. Noriko Tomuro

2 Challenges in Deep Networks Though powerful, deep neural networks have many challenges, especially in training. Notable ones are: Overfitting Long computation time Vanishing Gradient Too many hyper-parameters to tune Noriko Tomuro

2.1 Overfitting Deep neural networks are (naturally) prone to overfitting due to the very large number of nodes/variables and parameters (i.e., degree of freedom). To overcome overfitting, several techniques have bee proposed including: Regularization (e.g. L2, L1) Dropout Early stopping Sampling and/or clipping of training data Noriko Tomuro

2.2 Computation Time In addition to the large network size (the number of layers and units per layer), which takes time to process, deep neural networks usually have many hyper-parameters (e.g. the learning rate η and the regularization parameter λ). Poor parameter values (including initial weights) could make the computation even longer, preventing the weights to converge. There are ways to speed up the network (learning): Mini-batching Early stopping Better parameter selection Utilize hardware support, e.g. GPU Noriko Tomuro

2.3 Vanishing Gradient Observation: Early hidden layers learn much slowly than later hidden layers. Reason: In BP, gradient is propagated backwards from the output layer to the input layer so the early layers receive fraction of the gradient (i.e., vanishing gradient), although in some cases, gradient gets larger as it is propagated backwards (i.e., exploding gradient). In general, gradient is unstable in deep networks. Gradient in the lth layer is Noriko Tomuro

The derivative of the sigmoid function 𝑓 𝑧 = 1 1+ 𝑒 −𝑧 is 𝑓 ′ 𝑧 = 𝑑𝑓(𝑧) 𝑑𝑧 =𝑓(𝑧)∙(1−𝑓(𝑧)) The value of f’ maximizes to 0.25 when z = 0. The value of f’ minimizes close to 0 when z is a large positive or a large negative. Since typical activation (z) of a node is less than 0.25 (especially if the initial weights were given between 0 and 1), successive gradient propagation makes the node activation to decrease – vanishing gradient. On the other hand, if the activation was kept close to 0 (by large and similar values for the weights and bias), the gradient is kept large – exploding gradient. Noriko Tomuro

Some techniques to overcome vanishing gradient: With deep networks with gradient-based cost minimization, vanishing gradient is very difficult to circumvent, unless weights and biases somehow balance out (luckily). Some techniques to overcome vanishing gradient: Cross-Entropy cost function Regularization Noriko Tomuro

2.4 Hyper-Parameter Tuning From NNDL Ch. 3: Strip the problem down Reduce the data (i.e., simplify the problem) Start with a simple network (i.e., simplify the architecture) Speed up testing by monitoring performance frequently Try rough variations of parameters Find a good threshold for learning rate Stop training when performance is good (or shows no improvement) Adaptive learning rate schedule Several algorithms have been applied to systematically search for the optimal hyper-parameter values. Grid search Optimization techniques such as Monte Carlo and Genetic Algorithms Noriko Tomuro