11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.
Kostas Kontogiannis E&CE
Machine Learning Neural Networks
Lecture 14 – Neural Networks
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
The back-propagation training algorithm
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Chapter 5 NEURAL NETWORKS
Back-Propagation Algorithm
Artificial Neural Networks
Data Mining with Neural Networks (HK: Chapter 7.5)
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Artificial Neural Networks
Biointelligence Laboratory, Seoul National University
Classification Part 3: Artificial Neural Networks
Artificial Neural Networks
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
1 Chapter 6: Artificial Neural Networks Part 2 of 3 (Sections 6.4 – 6.6) Asst. Prof. Dr. Sukanya Pongsuparb Dr. Srisupa Palakvangsa Na Ayudhya Dr. Benjarath.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Artificial Neural Network Supervised Learning دكترمحسن كاهاني
NEURAL NETWORKS FOR DATA MINING
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Classification / Regression Neural Networks 2
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Artificial Intelligence Techniques Multilayer Perceptrons.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
Last lecture summary. biologically motivated synapses Neuron accumulates (Σ) positive/negative stimuli from other neurons. Then Σ is processed further.
Akram Bitar and Larry Manevitz Department of Computer Science
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CS621 : Artificial Intelligence
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
EEE502 Pattern Recognition
Chapter 8: Adaptive Networks
Hazırlayan NEURAL NETWORKS Backpropagation Network PROF. DR. YUSUF OYSAL.
Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.
Chapter 6 Neural Network.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Neural networks.
Learning with Perceptrons and Neural Networks
第 3 章 神经网络.
CSE 4705 Artificial Intelligence
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
CSC 578 Neural Networks and Deep Learning
Artificial Intelligence Chapter 3 Neural Networks
Artificial Neural Networks
Lecture Notes for Chapter 4 Artificial Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Multilayer Perceptron: Learning : {(xi, f(xi)) | i = 1 ~ N} → W
Artificial Intelligence Chapter 3 Neural Networks
Neural networks (1) Traditional multi-layer perceptrons
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

2 Practical issues in the training l Underfitting l Overfitting Before introducing these important concept, let us study a simple regression algorithm – linear regression

3 Least squares l We wish to use some real-valued input variables x to predict the value of a target y We collect training data of pairs ( x i, y i ), i=1,…N Suppose we have a model f that maps each x example to a value of y’ l Sum of squares function: –Sum of the squares of the deviation between the observed target value y and the predicted value y’

4 Least squares Find a function f such that the sum of squares is minimized For example, your function is in the form of linear functions f (x) = w T x l Least squares with a linear function of parameters w is called “linear regression”

5 Linear regression l Linear regression has a closed-form solution for w l The minimum is attained at the zero derivative

6 l x is evenly distributed from [0,1] l y = f(x) + random error l y = sin(2πx) + ε, ε ~ N(0,σ) Polynomial Curve Fitting

7

8 Sum-of-Squares Error Function

9 0 th Order Polynomial

10 1 st Order Polynomial

11 3 rd Order Polynomial

12 9 th Order Polynomial

13 Over-fitting Root-Mean-Square (RMS) Error:

14 Polynomial Coefficients

15 Data Set Size: 9 th Order Polynomial

16 Data Set Size: 9 th Order Polynomial

17 Regularization l Penalize large coefficient values l Ridge regression

18 Regularization:

19 Regularization:

20 Regularization: vs.

21 Polynomial Coefficients

22 Ridge Regression l Derive the analytic solution to the optimization problem for ridge regression Using KKT condition – first order derivative = 0

23 Neural networks l Introduction l Different designs of NN l Feed-forward Network (MLP) l Network Training l Error Back-propagation l Regularization

24 Introduction l Neuroscience studies how networks of neurons produce intellectual behavior, cognition, emotion and physiological responses l Computer science studies how to simulate knowledge in cognitive science, including the way neurons process signals l Artificial neural networks simulate the connectivity in the neural system, the way it passes through signal, and mimic the massively parallel operations of the human brain

25 Common features Dentrites

26 Different types of NN l Adaptive NN: have a set of adjustable parameters that can be tuned l Topological NN l Recurrent NN

27 Different types of NN l Feed-forward NN l Multi-layer perceptron l Linear perceptron Hidden Layer Input Layer Output Layer

28 Different types of NN l Radial basis function NN (RBFN)

29 Multi-Layer Perceptron l Layered perceptron networks can realize any logical function, however there is no simple way to estimate the parameters/generalize the (single layer) Perceptron convergence procedure l Multi-layer perceptron (MLP) networks are a class of models that are formed from layered sigmoidal nodes, which can be used for regression or classification purposes. l They are commonly trained using gradient descent on a mean squared error performance function, using a technique known as error back propagation in order to calculate the gradients. l Widely applied to many prediction and classification problems over the past 15 years.

30 Linear perceptron Input layeroutput layer :::: x1 wt w2 w1 xt x2 y Σ y = w1*x1 + w2*x2 + … + wt*xt Many functions can not be approximated using perceptron

31 Multi-Layer Perceptron l XOR (exclusive OR) problem l 0+0=0 l 1+1=2=0 mod 2 l 1+0=1 l 0+1=1 l Perceptron does not work here! Single layer generates a linear decision boundary

32 Multi-Layer Perceptron :::: x1 wt1 (1) W21 (1) W11 (1) xt x2 y f(Σ)f(Σ) f(Σ)f(Σ) f(Σ)f(Σ) W11 (2) W22 (1) W21 (2) Each link is associated with a weight, and these weights are the tuning parameters to be learned Each neuron except ones in the input layer receives inputs from the previous layer, and reports an output to next layer Input layer Hidden layeroutput layer

33 Each neuron wnwn w2w2  w1w1 summation f is Activation function o The activation function f can be Identity function f(x) = x Sigmoid function Hyperbolic tangent

34 1 st layer 2 nd layer 3 rd layer Universal Approximation: Three-layer network can in principle approximate any function with any accuracy! Universal Approximation of MLP

35 Feed-forward network function l The output from each hidden node l The final output x1 xt x2 y N nodes M nodes Signal flows

36 Network Training l A supervised neural network is a function h(x;w) that maps from inputs x to target y l Usually training a NN does not involve the change of NN structures (such as how many hidden layers or how many hidden nodes) l Training NN refers to adjusting the values of connection weights so that h(x;w) adapts to the problem l Use sum of squares as the error metric Use gradient descent

37 Gradient descent l Review of gradient descent l Iterative algorithm containing many iterations l Each iteration, the weights w receive a small update l Terminate –until the network is stable (in other words, the training error cannot be reduced further) E(w new ) < E(w) not hold –until the error on a validation set starts to climb up (early stopping)

38 Error Back-propagation The update of the weights goes backwards because we have to use the chain rule to evaluate the gradient of E(w) Learning is backwards x1 xt x2 y=h(x;w) M nodes N nodes Signal flows forwards W ij W jk

39 Error Back-propagation l Update the weights in the output layer first l Propagate errors from the high layer to low layer l Recall Learning is backwards x1 xt x2 y=h(x;w) M nodes N nodes W ij W jk

40 Evaluate gradient l First compute the partial derivatives for weights in the output layer l Second compute the partial derivatives for weights in the hidden layer

41 Back-propagation algorithm l Design the structure of NN l Initialize all connection weights l For t = 1, to T –Present training examples, propagate forwards from input layer to output layer, compute y, and evaluate the errors –Pass errors backwards through the network to recursively compute derivatives, and use them to update weights –If termination rule is met, stop; or continue l end

42 Notes on back-propagation l Note that these rules apply to different kinds of feed-forward networks. It is possible for connections to skip layers, or to have mixtures. However, errors always start at the highest layer and propagate backwards

43 Activation and Error back-propagation

44 Two schemes of training l There are two schemes of updating weights –Batch: Update weights after all examples have been presented (epoch). –Online: Update weights after each example is presented. l Although the batch update scheme implements the true gradient descent, the second scheme is often preferred since –it requires less storage, –it has more noise, hence is less likely to get stuck in a local minima (which is a problem with nonlinear activation functions). In the online update scheme, order of presentation matters!

45 Problems of back-propagation l It is extremely slow, if it does converge. l It may get stuck in a local minima. l It is sensitive to initial conditions. l It may start oscillating.

46 Overfitting – number of hidden units Over-fitting Sinusoidal data set used in polynomial curve fitting example

47 Regularization (1) l How to adjust the number of hidden units to get the best performance while avoiding over-fitting l Add a penalty term to the error function l The simplest regularizer is the weight decay:

48 Regularization (2) l A method to Early Stopping –obtain good generalization performance and –control the effective complexity of the network l Instead of iteratively reducing the error until a minimum error on the training data set has been reached l We have a validation set of data available l Stop when the NN achieves the smallest error w.r.t. the validation data set

49 Effect of early stopping Validation Set Training Set Error vs. Number of iterations A slight increase in the validation set error