Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Slides:



Advertisements
Similar presentations
Neural Networks: Learning
Advertisements

A Brief Overview of Neural Networks By Rohit Dua, Samuel A. Mulder, Steve E. Watkins, and Donald C. Wunsch.
NEURAL NETWORKS Backpropagation Algorithm
Neural Networks: Backpropagation algorithm Data Mining and Semantic Web University of Belgrade School of Electrical Engineering Chair of Computer Engineering.
Deep Learning and Neural Nets Spring 2015
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Artificial Neural Networks
Artificial Neural Networks
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
S. Mandayam/ ANN/ECE Dept./Rowan University Artificial Neural Networks ECE /ECE Fall 2006 Shreekanth Mandayam ECE Department Rowan University.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Appendix B: An Example of Back-propagation algorithm
Back-Propagation MLP Neural Network Optimizer ECE 539 Andrew Beckwith.
Neural Networks1 Introduction to NETLAB NETLAB is a Matlab toolbox for experimenting with neural networks Available from:
ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –Neural Networks –Backprop –Modular Design.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Today’s Lecture Neural networks Training
Neural networks.
Neural networks and support vector machines
Big data classification using neural network
Multiple-Layer Networks and Backpropagation Algorithms
Artificial Neural Networks
The Gradient Descent Algorithm
Learning with Perceptrons and Neural Networks
ECE 5424: Introduction to Machine Learning
Computer Science and Engineering, Seoul National University
Artificial Intelligence (CS 370D)
Artificial neural networks:
COMP24111: Machine Learning and Optimisation
ECE 6504 Deep Learning for Perception
Artificial neural networks (ANNs)
Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.
Neural Networks and Backpropagation
CSC 578 Neural Networks and Deep Learning
Classification / Regression Neural Networks 2
Department of Electrical and Computer Engineering
CS621: Artificial Intelligence
CSC 578 Neural Networks and Deep Learning
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Introduction to Neural Networks
Goodfellow: Chap 6 Deep Feedforward Networks
Non-linear hypotheses
Artificial Neural Network & Backpropagation Algorithm
ECE 599/692 – Deep Learning Lecture 5 – CNN: The Representative Power
network of simple neuron-like computing elements
Artificial Neural Networks
Neural Networks Geoff Hulten.
Multilayer Perceptron: Learning : {(xi, f(xi)) | i = 1 ~ N} → W
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Neural networks (1) Traditional multi-layer perceptrons
Fig. 1. A diagram of artificial neural network consisting of multilayer perceptron. This simple diagram is for a conceptual explanation. When the logistic.
Backpropagation David Kauchak CS159 – Fall 2019.
Image Classification & Training of Neural Networks
Backpropagation and Neural Nets
Linear Discrimination
David Kauchak CS158 – Spring 2019
Introduction to Neural Networks
Image recognition.
CSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning
Logistic Regression Geoff Hulten.
Overall Introduction for the Lecture
Presentation transcript:

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824 Spring 2019

Neural Networks Origins: Algorithms that try to mimic the brain. What is this? Two ears, two eye, color, coat of hair  puppy Edges, simple pattern  ear/eye Low level feature  high level feature

A single neuron in the brain Input If activated  activation Output Slide credit: Andrew Ng

An artificial neuron: Logistic unit 𝑥 0 𝑥= 𝑥 0 𝑥 1 𝑥 2 𝑥 3 𝜃= 𝜃 0 𝜃 1 𝜃 2 𝜃 3 Sigmoid (logistic) activation function “Bias unit” 𝑥 1 “Weights” “Parameters” “Output” 𝑥 2 ℎ 𝜃 𝑥 = 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 𝑥 3 “Input” Slide credit: Andrew Ng

Visualization of weights, bias, activation function range determined by g(.) bias b only change the position of the hyperplane Slide credit: Hugo Larochelle

Slide credit: Hugo Larochelle Activation - sigmoid Squashes the neuron’s pre- activation between 0 and 1 Always positive Bounded Strictly increasing 𝑔 𝑥 = 1 1+ 𝑒 −𝑥 Slide credit: Hugo Larochelle

Activation - hyperbolic tangent (tanh) Squashes the neuron’s pre- activation between -1 and 1 Can be positive or negative Bounded Strictly increasing 𝑔 𝑥 = tanh 𝑥 = 𝑒 𝑥 − 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 Slide credit: Hugo Larochelle

Activation - rectified linear(relu) Bounded below by 0 always non-negative Not upper bounded Tends to give neurons with sparse activities 𝑔 𝑥 =relu 𝑥 =max 0, 𝑥 Slide credit: Hugo Larochelle

Slide credit: Hugo Larochelle Activation - softmax For multi-class classification: we need multiple outputs (1 output per class) we would like to estimate the conditional probability 𝑝 𝑦=𝑐 | 𝑥 We use the softmax activation function at the output: 𝑔 𝑥 = softmax 𝑥 = 𝑒 𝑥1 𝑐 𝑒 𝑥𝑐 … 𝑒 𝑥𝑐 𝑐 𝑒 𝑥𝑐 Slide credit: Hugo Larochelle

Universal approximation theorem ‘‘a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’ Hornik, 1991 Slide credit: Hugo Larochelle

Neural network – Multilayer 𝑥 0 𝑎 0 (2) 𝑥 1 𝑎 1 (2) “Output” 𝑥 2 𝑎 2 (2) ℎ Θ 𝑥 𝑥 3 𝑎 3 (2) Layer 1 Layer 2 (hidden) Layer 3 Slide credit: Andrew Ng

Slide credit: Andrew Ng Neural network 𝑎 𝑖 (𝑗) = “activation” of unit 𝑖 in layer 𝑗 Θ 𝑗 = matrix of weights controlling function mapping from layer 𝑗 to layer 𝑗+1 𝑎 1 (2) 𝑎 2 (2) 𝑎 3 (2) 𝑎 0 (2) ℎ Θ 𝑥 𝑥 1 𝑥 2 𝑥 3 𝑥 0 𝑠 𝑗 unit in layer 𝑗 𝑠 𝑗+1 units in layer 𝑗+1 Size of Θ 𝑗 ? 𝑎 1 (2) =𝑔 Θ 10 (1) 𝑥 0 + Θ 11 (1) 𝑥 1 + Θ 12 (1) 𝑥 2 + Θ 13 (1) 𝑥 3 𝑎 2 (2) =𝑔 Θ 20 (1) 𝑥 0 + Θ 21 (1) 𝑥 1 + Θ 22 (1) 𝑥 2 + Θ 23 (1) 𝑥 3 𝑎 3 (2) =𝑔 Θ 30 (1) 𝑥 0 + Θ 31 (1) 𝑥 1 + Θ 32 (1) 𝑥 2 + Θ 33 (1) 𝑥 3 ℎ Θ (𝑥)=𝑔 Θ 10 (2) 𝑎 0 (2) + Θ 11 (1) 𝑎 1 (2) + Θ 12 (1) 𝑎 2 (2) + Θ 13 (1) 𝑎 3 (2) 𝑠 𝑗+1 ×( 𝑠 𝑗 +1) Slide credit: Andrew Ng

Slide credit: Andrew Ng Neural network “Pre-activation” 𝑥= 𝑥 0 𝑥 1 𝑥 2 𝑥 3 z (2) = z 1 (2) z 2 (2) z 3 (2) 𝑎 1 (2) 𝑎 2 (2) 𝑎 3 (2) 𝑎 0 (2) ℎ Θ 𝑥 𝑥 1 𝑥 2 𝑥 3 𝑥 0 Why do we need g(.)? 𝑎 1 (2) =𝑔 Θ 10 (1) 𝑥 0 + Θ 11 (1) 𝑥 1 + Θ 12 (1) 𝑥 2 + Θ 13 (1) 𝑥 3 =𝑔( z 1 (2) ) 𝑎 2 (2) =𝑔 Θ 20 (1) 𝑥 0 + Θ 21 (1) 𝑥 1 + Θ 22 (1) 𝑥 2 + Θ 23 (1) 𝑥 3 =𝑔( z 2 (2) ) 𝑎 3 (2) =𝑔 Θ 30 (1) 𝑥 0 + Θ 31 (1) 𝑥 1 + Θ 32 (1) 𝑥 2 + Θ 33 (1) 𝑥 3 =𝑔( z 3 (2) ) ℎ Θ 𝑥 =𝑔 Θ 10 2 𝑎 0 2 + Θ 11 1 𝑎 1 2 + Θ 12 1 𝑎 2 2 + Θ 13 1 𝑎 3 2 =𝑔( 𝑧 (3) ) Non-linear Slide credit: Andrew Ng

Slide credit: Andrew Ng Neural network “Pre-activation” 𝑥= 𝑥 0 𝑥 1 𝑥 2 𝑥 3 z (2) = z 1 (2) z 2 (2) z 3 (2) 𝑎 1 (2) 𝑎 2 (2) 𝑎 3 (2) 𝑎 0 (2) ℎ Θ 𝑥 𝑥 1 𝑥 2 𝑥 3 𝑥 0 𝑧 (2) = Θ (1) 𝑥= Θ (1) 𝑎 (1) 𝑎 (2) =𝑔( 𝑧 (2) ) Add 𝑎 0 (2) =1 𝑧 (3) = Θ (2) 𝑎 (2) ℎ Θ 𝑥 = 𝑎 (3) =𝑔( 𝑧 (3) ) 𝑎 1 (2) =𝑔( z 1 (2) ) 𝑎 2 (2) =𝑔( z 2 (2) ) 𝑎 3 (2) =𝑔( z 3 (2) ) ℎ Θ 𝑥 =𝑔( 𝑧 (3) ) Slide credit: Andrew Ng

Flow graph - Forward propagation 𝑎 (2) 𝑧 (2) 𝑎 (3) ℎ Θ 𝑥 X 𝑊 (1) 𝑏 (1) 𝑧 (3) 𝑊 (2) 𝑏 (2) How do we evaluate our prediction? 𝑧 (2) = Θ (1) 𝑥= Θ (1) 𝑎 (1) 𝑎 (2) =𝑔( 𝑧 (2) ) Add 𝑎 0 (2) =1 𝑧 (3) = Θ (2) 𝑎 (2) ℎ Θ 𝑥 = 𝑎 (3) =𝑔( 𝑧 (3) )

Slide credit: Andrew Ng Cost function Logistic regression: Neural network: Slide credit: Andrew Ng

Slide credit: Andrew Ng Gradient computation Need to compute: Slide credit: Andrew Ng

Slide credit: Andrew Ng Gradient computation Given one training example 𝑥, 𝑦 𝑎 (1) =𝑥 𝑧 (2) = Θ (1) 𝑎 (1) 𝑎 (2) =𝑔( 𝑧 (2) ) (add a 0 (2) ) 𝑧 (3) = Θ (2) 𝑎 (2) 𝑎 (3) =𝑔( 𝑧 (3) ) (add a 0 (3) ) 𝑧 (4) = Θ (3) 𝑎 (3) 𝑎 (4) =𝑔 𝑧 4 = ℎ Θ 𝑥 Slide credit: Andrew Ng

Gradient computation: Backpropagation Intuition: 𝛿 𝑗 (𝑙) = “error” of node 𝑗 in layer 𝑙 For each output unit (layer L = 4) 𝛿 (4) = 𝑎 (4) −𝑦 𝛿 (3) = 𝛿 (4) 𝜕 𝛿 (4) 𝜕 𝑧 (3) = 𝛿 (4) 𝜕 𝛿 (4) 𝜕 𝑎 (4) 𝜕 𝑎 (4) 𝜕 𝑧 (4) 𝜕 𝑧 (4) 𝜕 𝑎 (3) 𝜕 𝑎 (3) 𝜕 𝑧 (3) = 1 * Θ 3 𝑇 𝛿 (4) .∗ 𝑔′ 𝑧 4 .∗𝑔′( 𝑧 (3) ) 𝑧 (3) = Θ (2) 𝑎 (2) 𝑎 (3) =𝑔( 𝑧 (3) ) 𝑧 (4) = Θ (3) 𝑎 (3) 𝑎 (4) =𝑔 𝑧 4 Slide credit: Andrew Ng

Backpropagation algorithm Training set 𝑥 (1) , 𝑦 (1) … 𝑥 (𝑚) , 𝑦 (𝑚) Set Θ (1) = 0 For 𝑖=1 to 𝑚 Set 𝑎 (1) =𝑥 Perform forward propagation to compute 𝑎 (𝑙) for 𝑙=2..𝐿 use 𝑦 (𝑖) to compute 𝛿 (𝐿) = 𝑎 (𝐿) − 𝑦 (𝑖) Compute 𝛿 (𝐿−1) , 𝛿 (𝐿−2) … 𝛿 (2) Θ (𝑙) = Θ (𝑙) − 𝑎 (𝑙) 𝛿 (𝑙+1) Slide credit: Andrew Ng

Slide credit: Hugo Larochelle Activation - sigmoid Partial derivative 𝑔′ 𝑥 = 𝑔 𝑥 1−𝑔 𝑥 𝑔 𝑥 = 1 1+ 𝑒 −𝑥 Slide credit: Hugo Larochelle

Activation - hyperbolic tangent (tanh) Partial derivative 𝑔′ 𝑥 = 1−𝑔 𝑥 2 𝑔 𝑥 = tanh 𝑥 = 𝑒 𝑥 − 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 Slide credit: Hugo Larochelle

Activation - rectified linear(relu) Partial derivative 𝑔′ 𝑥 = 1 𝑥 > 0 𝑔 𝑥 =relu 𝑥 =max 0, 𝑥 Slide credit: Hugo Larochelle

Slide credit: Hugo Larochelle Initialization For bias Initialize all to 0 For weights Can’t initialize all weights to the same value we can show that all hidden units in a layer will always behave the same need to break symmetry Recipe: U[-b, b] the idea is to sample around 0 but break symmetry Slide credit: Hugo Larochelle

Slide credit: Hugo Larochelle Putting it together Pick a network architecture No. of input units: Dimension of features No. output units: Number of classes Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer (usually the more the better) Grid search Slide credit: Hugo Larochelle

Slide credit: Hugo Larochelle Putting it together Early stopping Use a validation set performance to select the best configuration To select the number of epochs, stop training when validation set error increases Slide credit: Hugo Larochelle

Other tricks of the trade Normalizing your (real-valued) data Decaying the learning rate as we get closer to the optimum, makes sense to take smaller update steps mini-batch can give a more accurate estimate of the risk gradient Momentum can use an exponential average of previous gradients Slide credit: Hugo Larochelle

Slide credit: Hugo Larochelle Dropout Idea: «cripple» neural network by removing hidden units each hidden unit is set to 0 with probability 0.5 hidden units cannot co-adapt to other units hidden units must be more generally useful Slide credit: Hugo Larochelle