Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Slides:

Advertisements

Similar presentations

Neural Networks: Learning

Advertisements

A Brief Overview of Neural Networks By Rohit Dua, Samuel A. Mulder, Steve E. Watkins, and Donald C. Wunsch.

NEURAL NETWORKS Backpropagation Algorithm

Neural Networks: Backpropagation algorithm Data Mining and Semantic Web University of Belgrade School of Electrical Engineering Chair of Computer Engineering.

Deep Learning and Neural Nets Spring 2015

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Artificial Neural Networks

Artificial Neural Networks

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

S. Mandayam/ ANN/ECE Dept./Rowan University Artificial Neural Networks ECE /ECE Fall 2006 Shreekanth Mandayam ECE Department Rowan University.

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Appendix B: An Example of Back-propagation algorithm

Back-Propagation MLP Neural Network Optimizer ECE 539 Andrew Beckwith.

Neural Networks1 Introduction to NETLAB NETLAB is a Matlab toolbox for experimenting with neural networks Available from:

ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –Neural Networks –Backprop –Modular Design.

1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.

Multinomial Regression and the Softmax Activation Function Gary Cottrell.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Today’s Lecture Neural networks Training

Neural networks.

Neural networks and support vector machines

Big data classification using neural network

Multiple-Layer Networks and Backpropagation Algorithms

Artificial Neural Networks

The Gradient Descent Algorithm

Learning with Perceptrons and Neural Networks

ECE 5424: Introduction to Machine Learning

Computer Science and Engineering, Seoul National University

Artificial Intelligence (CS 370D)

Artificial neural networks:

COMP24111: Machine Learning and Optimisation

ECE 6504 Deep Learning for Perception

Artificial neural networks (ANNs)

Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.

Neural Networks and Backpropagation

CSC 578 Neural Networks and Deep Learning

Classification / Regression Neural Networks 2

Department of Electrical and Computer Engineering

CS621: Artificial Intelligence

CSC 578 Neural Networks and Deep Learning

Machine Learning Today: Reading: Maria Florina Balcan

CSC 578 Neural Networks and Deep Learning

Introduction to Neural Networks

Goodfellow: Chap 6 Deep Feedforward Networks

Non-linear hypotheses

Artificial Neural Network & Backpropagation Algorithm

ECE 599/692 – Deep Learning Lecture 5 – CNN: The Representative Power

network of simple neuron-like computing elements

Artificial Neural Networks

Neural Networks Geoff Hulten.

Multilayer Perceptron: Learning : {(xi, f(xi)) | i = 1 ~ N} → W

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Neural networks (1) Traditional multi-layer perceptrons

Fig. 1. A diagram of artificial neural network consisting of multilayer perceptron. This simple diagram is for a conceptual explanation. When the logistic.

Backpropagation David Kauchak CS159 – Fall 2019.

Image Classification & Training of Neural Networks

Backpropagation and Neural Nets

Linear Discrimination

David Kauchak CS158 – Spring 2019

Introduction to Neural Networks

Image recognition.

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning

Logistic Regression Geoff Hulten.

Overall Introduction for the Lecture

Presentation transcript:

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824 Spring 2019

Neural Networks Origins: Algorithms that try to mimic the brain. What is this? Two ears, two eye, color, coat of hair  puppy Edges, simple pattern  ear/eye Low level feature  high level feature

A single neuron in the brain Input If activated  activation Output Slide credit: Andrew Ng

An artificial neuron: Logistic unit 𝑥 0 𝑥= 𝑥 0 𝑥 1 𝑥 2 𝑥 3 𝜃= 𝜃 0 𝜃 1 𝜃 2 𝜃 3 Sigmoid (logistic) activation function “Bias unit” 𝑥 1 “Weights” “Parameters” “Output” 𝑥 2 ℎ 𝜃 𝑥 = 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 𝑥 3 “Input” Slide credit: Andrew Ng

Visualization of weights, bias, activation function range determined by g(.) bias b only change the position of the hyperplane Slide credit: Hugo Larochelle

Slide credit: Hugo Larochelle Activation - sigmoid Squashes the neuron’s pre- activation between 0 and 1 Always positive Bounded Strictly increasing 𝑔 𝑥 = 1 1+ 𝑒 −𝑥 Slide credit: Hugo Larochelle

Activation - hyperbolic tangent (tanh) Squashes the neuron’s pre- activation between -1 and 1 Can be positive or negative Bounded Strictly increasing 𝑔 𝑥 = tanh 𝑥 = 𝑒 𝑥 − 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 Slide credit: Hugo Larochelle

Activation - rectified linear(relu) Bounded below by 0 always non-negative Not upper bounded Tends to give neurons with sparse activities 𝑔 𝑥 =relu 𝑥 =max 0, 𝑥 Slide credit: Hugo Larochelle

Slide credit: Hugo Larochelle Activation - softmax For multi-class classification: we need multiple outputs (1 output per class) we would like to estimate the conditional probability 𝑝 𝑦=𝑐 | 𝑥 We use the softmax activation function at the output: 𝑔 𝑥 = softmax 𝑥 = 𝑒 𝑥1 𝑐 𝑒 𝑥𝑐 … 𝑒 𝑥𝑐 𝑐 𝑒 𝑥𝑐 Slide credit: Hugo Larochelle

Universal approximation theorem ‘‘a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’ Hornik, 1991 Slide credit: Hugo Larochelle

Neural network – Multilayer 𝑥 0 𝑎 0 (2) 𝑥 1 𝑎 1 (2) “Output” 𝑥 2 𝑎 2 (2) ℎ Θ 𝑥 𝑥 3 𝑎 3 (2) Layer 1 Layer 2 (hidden) Layer 3 Slide credit: Andrew Ng

Slide credit: Andrew Ng Neural network 𝑎 𝑖 (𝑗) = “activation” of unit 𝑖 in layer 𝑗 Θ 𝑗 = matrix of weights controlling function mapping from layer 𝑗 to layer 𝑗+1 𝑎 1 (2) 𝑎 2 (2) 𝑎 3 (2) 𝑎 0 (2) ℎ Θ 𝑥 𝑥 1 𝑥 2 𝑥 3 𝑥 0 𝑠 𝑗 unit in layer 𝑗 𝑠 𝑗+1 units in layer 𝑗+1 Size of Θ 𝑗 ? 𝑎 1 (2) =𝑔 Θ 10 (1) 𝑥 0 + Θ 11 (1) 𝑥 1 + Θ 12 (1) 𝑥 2 + Θ 13 (1) 𝑥 3 𝑎 2 (2) =𝑔 Θ 20 (1) 𝑥 0 + Θ 21 (1) 𝑥 1 + Θ 22 (1) 𝑥 2 + Θ 23 (1) 𝑥 3 𝑎 3 (2) =𝑔 Θ 30 (1) 𝑥 0 + Θ 31 (1) 𝑥 1 + Θ 32 (1) 𝑥 2 + Θ 33 (1) 𝑥 3 ℎ Θ (𝑥)=𝑔 Θ 10 (2) 𝑎 0 (2) + Θ 11 (1) 𝑎 1 (2) + Θ 12 (1) 𝑎 2 (2) + Θ 13 (1) 𝑎 3 (2) 𝑠 𝑗+1 ×( 𝑠 𝑗 +1) Slide credit: Andrew Ng

Slide credit: Andrew Ng Neural network “Pre-activation” 𝑥= 𝑥 0 𝑥 1 𝑥 2 𝑥 3 z (2) = z 1 (2) z 2 (2) z 3 (2) 𝑎 1 (2) 𝑎 2 (2) 𝑎 3 (2) 𝑎 0 (2) ℎ Θ 𝑥 𝑥 1 𝑥 2 𝑥 3 𝑥 0 Why do we need g(.)? 𝑎 1 (2) =𝑔 Θ 10 (1) 𝑥 0 + Θ 11 (1) 𝑥 1 + Θ 12 (1) 𝑥 2 + Θ 13 (1) 𝑥 3 =𝑔( z 1 (2) ) 𝑎 2 (2) =𝑔 Θ 20 (1) 𝑥 0 + Θ 21 (1) 𝑥 1 + Θ 22 (1) 𝑥 2 + Θ 23 (1) 𝑥 3 =𝑔( z 2 (2) ) 𝑎 3 (2) =𝑔 Θ 30 (1) 𝑥 0 + Θ 31 (1) 𝑥 1 + Θ 32 (1) 𝑥 2 + Θ 33 (1) 𝑥 3 =𝑔( z 3 (2) ) ℎ Θ 𝑥 =𝑔 Θ 10 2 𝑎 0 2 + Θ 11 1 𝑎 1 2 + Θ 12 1 𝑎 2 2 + Θ 13 1 𝑎 3 2 =𝑔( 𝑧 (3) ) Non-linear Slide credit: Andrew Ng

Slide credit: Andrew Ng Neural network “Pre-activation” 𝑥= 𝑥 0 𝑥 1 𝑥 2 𝑥 3 z (2) = z 1 (2) z 2 (2) z 3 (2) 𝑎 1 (2) 𝑎 2 (2) 𝑎 3 (2) 𝑎 0 (2) ℎ Θ 𝑥 𝑥 1 𝑥 2 𝑥 3 𝑥 0 𝑧 (2) = Θ (1) 𝑥= Θ (1) 𝑎 (1) 𝑎 (2) =𝑔( 𝑧 (2) ) Add 𝑎 0 (2) =1 𝑧 (3) = Θ (2) 𝑎 (2) ℎ Θ 𝑥 = 𝑎 (3) =𝑔( 𝑧 (3) ) 𝑎 1 (2) =𝑔( z 1 (2) ) 𝑎 2 (2) =𝑔( z 2 (2) ) 𝑎 3 (2) =𝑔( z 3 (2) ) ℎ Θ 𝑥 =𝑔( 𝑧 (3) ) Slide credit: Andrew Ng

Flow graph - Forward propagation 𝑎 (2) 𝑧 (2) 𝑎 (3) ℎ Θ 𝑥 X 𝑊 (1) 𝑏 (1) 𝑧 (3) 𝑊 (2) 𝑏 (2) How do we evaluate our prediction? 𝑧 (2) = Θ (1) 𝑥= Θ (1) 𝑎 (1) 𝑎 (2) =𝑔( 𝑧 (2) ) Add 𝑎 0 (2) =1 𝑧 (3) = Θ (2) 𝑎 (2) ℎ Θ 𝑥 = 𝑎 (3) =𝑔( 𝑧 (3) )

Slide credit: Andrew Ng Cost function Logistic regression: Neural network: Slide credit: Andrew Ng

Slide credit: Andrew Ng Gradient computation Need to compute: Slide credit: Andrew Ng

Slide credit: Andrew Ng Gradient computation Given one training example 𝑥, 𝑦 𝑎 (1) =𝑥 𝑧 (2) = Θ (1) 𝑎 (1) 𝑎 (2) =𝑔( 𝑧 (2) ) (add a 0 (2) ) 𝑧 (3) = Θ (2) 𝑎 (2) 𝑎 (3) =𝑔( 𝑧 (3) ) (add a 0 (3) ) 𝑧 (4) = Θ (3) 𝑎 (3) 𝑎 (4) =𝑔 𝑧 4 = ℎ Θ 𝑥 Slide credit: Andrew Ng

Gradient computation: Backpropagation Intuition: 𝛿 𝑗 (𝑙) = “error” of node 𝑗 in layer 𝑙 For each output unit (layer L = 4) 𝛿 (4) = 𝑎 (4) −𝑦 𝛿 (3) = 𝛿 (4) 𝜕 𝛿 (4) 𝜕 𝑧 (3) = 𝛿 (4) 𝜕 𝛿 (4) 𝜕 𝑎 (4) 𝜕 𝑎 (4) 𝜕 𝑧 (4) 𝜕 𝑧 (4) 𝜕 𝑎 (3) 𝜕 𝑎 (3) 𝜕 𝑧 (3) = 1 * Θ 3 𝑇 𝛿 (4) .∗ 𝑔′ 𝑧 4 .∗𝑔′( 𝑧 (3) ) 𝑧 (3) = Θ (2) 𝑎 (2) 𝑎 (3) =𝑔( 𝑧 (3) ) 𝑧 (4) = Θ (3) 𝑎 (3) 𝑎 (4) =𝑔 𝑧 4 Slide credit: Andrew Ng

Backpropagation algorithm Training set 𝑥 (1) , 𝑦 (1) … 𝑥 (𝑚) , 𝑦 (𝑚) Set Θ (1) = 0 For 𝑖=1 to 𝑚 Set 𝑎 (1) =𝑥 Perform forward propagation to compute 𝑎 (𝑙) for 𝑙=2..𝐿 use 𝑦 (𝑖) to compute 𝛿 (𝐿) = 𝑎 (𝐿) − 𝑦 (𝑖) Compute 𝛿 (𝐿−1) , 𝛿 (𝐿−2) … 𝛿 (2) Θ (𝑙) = Θ (𝑙) − 𝑎 (𝑙) 𝛿 (𝑙+1) Slide credit: Andrew Ng

Slide credit: Hugo Larochelle Activation - sigmoid Partial derivative 𝑔′ 𝑥 = 𝑔 𝑥 1−𝑔 𝑥 𝑔 𝑥 = 1 1+ 𝑒 −𝑥 Slide credit: Hugo Larochelle

Activation - hyperbolic tangent (tanh) Partial derivative 𝑔′ 𝑥 = 1−𝑔 𝑥 2 𝑔 𝑥 = tanh 𝑥 = 𝑒 𝑥 − 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 Slide credit: Hugo Larochelle

Activation - rectified linear(relu) Partial derivative 𝑔′ 𝑥 = 1 𝑥 > 0 𝑔 𝑥 =relu 𝑥 =max 0, 𝑥 Slide credit: Hugo Larochelle

Slide credit: Hugo Larochelle Initialization For bias Initialize all to 0 For weights Can’t initialize all weights to the same value we can show that all hidden units in a layer will always behave the same need to break symmetry Recipe: U[-b, b] the idea is to sample around 0 but break symmetry Slide credit: Hugo Larochelle

Slide credit: Hugo Larochelle Putting it together Pick a network architecture No. of input units: Dimension of features No. output units: Number of classes Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer (usually the more the better) Grid search Slide credit: Hugo Larochelle

Slide credit: Hugo Larochelle Putting it together Early stopping Use a validation set performance to select the best configuration To select the number of epochs, stop training when validation set error increases Slide credit: Hugo Larochelle

Other tricks of the trade Normalizing your (real-valued) data Decaying the learning rate as we get closer to the optimum, makes sense to take smaller update steps mini-batch can give a more accurate estimate of the risk gradient Momentum can use an exponential average of previous gradients Slide credit: Hugo Larochelle

Slide credit: Hugo Larochelle Dropout Idea: «cripple» neural network by removing hidden units each hidden unit is set to 0 with probability 0.5 hidden units cannot co-adapt to other units hidden units must be more generally useful Slide credit: Hugo Larochelle