Presentation is loading. Please wait.

Presentation is loading. Please wait.

Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

Similar presentations


Presentation on theme: "Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1."— Presentation transcript:

1 Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1

2 Review: The Main Idea of Gradient Descent 2 Goal: minimizing the error:

3 Review: The Main Idea of Gradient Descent 3

4 Gradient Descent 4 Derive the equations for updating the weights of the simple neuron using the gradient descent technique

5 Gradient Descent (with Linear Transfer function) 5

6 Review: Batch vs. Incremental Gradient Descent 6

7 Gradient Descent (with Linear Transfer function) 7 Is it batch or incremental mode?

8 Forward Equation 8

9 Backpropagation Learning Rule Each weight changed by (if all the neurons are sigmoid units): where l is the layer number η is a constant called the learning rate t j is the correct teacher output for output unit j δ (l) j is the error measure for unit j in the l th layer 9 Usually the output neurons are linear & hidden neurons are sigmoid units. Then, what will be changed in the above equations?

10 10 Backpropagation Learning Rule Current output: o j =0.2 Correct output: t j =1.0 output hidden input

11 Error Backpropagation First calculate error of output units and use this to change the top layer of weights. 11 Current output: o j =0.2 Correct output: t j =1.0 = 0.2(1–0.2)(1–0.2)=0.128 output hidden input Update weights into j

12 Error Backpropagation Next calculate error for hidden units based on errors on the output units it feeds into. output hidden input 12

13 Error Backpropagation Finally update bottom layer of weights based on errors calculated for hidden units. output hidden input Update weights into j 13

14 Comments on Training Algorithm Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely. However, in practice, does converge to low error for many large networks on real data. Many epochs (thousands) may be required, hours or days of training for large networks. To avoid local-minima problems, run several trials starting with different random weights (random restarts). – Take results of trial with lowest training set error. – Build a committee of results from multiple trials. 14

15 Representational Power Boolean functions: Any boolean function can be represented by a two-layer network with sufficient hidden units. Continuous functions: Any bounded continuous function can be approximated with arbitrarily small error by a two-layer network. – Sigmoid functions can act as a set of basis functions for composing more complex functions, like sine waves in Fourier analysis. Arbitrary function: Any function can be approximated to arbitrary accuracy by a three-layer network. 15

16 How many hidden layers and hidden units per layer? Theoretically, one hidden layer (possibly with many hidden units) is sufficient for most problems There is no theoretical results on minimum necessary # of hidden units (either problem dependent or independent) Practical rule of thumb: – n = # of input units; p = # of hidden units – For binary/bipolar data: p = 2n – For real data: p >> 2n Multiple hidden layers with fewer units may be trained faster for similar quality in some applications 16

17 Data sets to handle over-fitting & # of hidden neurons Training set: – A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier. Validation set: – A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network or handle over- fitting. Test set (completely unseen data): – A set of examples used only to assess the performance [generalization] of a fully specified classifier. Usually all the data set is divided by 60%-20%-20% or 70%- 20%-10% 17

18 Over-training/over-fitting The meaning of over-fitting: – Trained net fits very well with the training samples (total error almost zero), but not with new input patterns Over-training may become serious if – Training samples were not obtained properly – Training samples have noise Control over-training for better generalization – Cross-validation: dividing the samples into two sets, - 80% into training set: used to train the network - 20% into test set: used to validate training results periodically test the trained net with test samples, stop training when test results start to deteriorating, repeat the process for many times, report the average results. 18

19 Over-Training Prevention Running too many epochs can result in over-fitting. Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error. error on training data on test data 0 # training epochs 19

20 Determining the Best Number of Hidden Units Too few hidden units prevents the network from adequately fitting the data. Too many hidden units can result in over-fitting. Use internal cross-validation to empirically determine an optimal number of hidden units. error on training data on test data 0 # hidden units 20

21 Learning Rate 21 weights training error كاهش تغييرات وزن افزايش تغييرات گراديان

22 Momentum Improving the gradient descent to escape the local minima Adds a percentage of the last movement to the current movement 22

23 23 Typical Learning Curve

24 24 Typical learning with adaptive learning rate

25 25 Typical Learning with adaptive learning rate plus momentum

26 Hidden Unit Representations Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space. On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc.. However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature. 26

27 Appropriate problems for NN Instances are represented by many attribute-value pairs The target function output may be discrete-valued, real-valued or a vector of several real- or discrete- valued attributes The dataset may contains errors Long training time are acceptable Fast evaluation (test phase) is required The ability of humans to understand the learned target function is not important 27

28 Successful Applications Text to Speech (NetTalk) Fraud detection Financial Applications Chemical Plant Control Automated Vehicles Game Playing – Neurogammon (a neural-network backgammon program) Handwriting recognition 28

29 More Issues in Neural Nets More efficient training methods: – Quickprop – Conjugate gradient (exploits 2 nd derivative) Learning the proper network architecture: – Grow network until able to fit data – Shrink large network until unable to fit data Recurrent networks that use feedback and can learn finite state machines with “backpropagation through time.” 29

30 More Issues in Neural Nets (cont.) More biologically plausible learning algorithms based on Hebbian learning. Unsupervised Learning – Self-Organizing Feature Maps (SOMs) Reinforcement Learning – Frequently used as function approximators for learning value functions. 30

31 Assignment 2 Tom Mitchelle problems – 4.1 – 4.2 – 4.5 – 4.7 – 4.8 – 4.10 31

32 Remaining Course Plan Chapter 9. Genetic Algorithms Chapter 13. Reinforcement Learning Clustering Dimension Reduction Algorithms Support Vector Machine Cellular Automata Other Biologically-inspired optimization algorithms (Ant Colony, PSO, Simulated annealing, …) Active Learning 32


Download ppt "Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1."

Similar presentations


Ads by Google