Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Slides:

Advertisements

Similar presentations

Multi-Layer Perceptron (MLP)

Advertisements

Beyond Linear Separability

NEURAL NETWORKS Backpropagation Algorithm

Neural networks Introduction Fitting neural networks

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Data mining in 1D: curve fitting

The loss function, the normal equation,

Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.

Machine Learning Neural Networks

Artificial Neural Networks

Lecture 14 – Neural Networks

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)

Chapter 5 NEURAL NETWORKS

Artificial Neural Networks

Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.

MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Radial Basis Function Networks

Artificial Neural Networks

Biointelligence Laboratory, Seoul National University

Classification Part 3: Artificial Neural Networks

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.

Classification / Regression Neural Networks 2

Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.

Non-Bayes classifiers. Linear discriminants, neural networks.

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.

Artificial Neural Network

EEE502 Pattern Recognition

Machine Learning 5. Parametric Methods.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)

CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.

129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Deep Feedforward Networks

The Gradient Descent Algorithm

Learning with Perceptrons and Neural Networks

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

Classification / Regression Neural Networks 2

Machine Learning Today: Reading: Maria Florina Balcan

Artificial Neural Networks

Lecture Notes for Chapter 4 Artificial Neural Networks

Machine Learning: Lecture 4

Review for test #2 Fundamentals of ANN Dimensionality reduction

Neural networks (1) Traditional multi-layer perceptrons

Presentation transcript:

Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com

Review: Neuron analogy to linear models Dot product w T x is a way of combining attributes with bias (x 0 =1) into a scalar signal s, which is used to form an hypothesis about x. Analogy is called a “perceptron” sigmoid(s)

What can a perceptron do in 1D? Fit a line to data: y=wx+w 0 Use y=wx+w 0 as a discriminant 3 w w0w0 y x x 0 =+1 w w0w0 y x s w0w0 y x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) S = sigmoid(y) If y > 0 -> S > 0.5; chose green otherwise chose red x

4 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) x is input vector w is weight vector y = w T x Perceptron can do the same thing in dD: fit plane to data and use a plane as a discriminant for binary classification For regression output is y For classification output is sigmoid(y)

Boolean AND: linearly separable 2D binary classification problem 5 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Truth table S x 1 +x 2 =1.5 is an acceptable linear discriminant

6 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) linear discriminant w T x = 0 x1x2rrequiredchoice 000w0 <0w0= w2 + w0 <0w1= w1+ w0 <0w2= w1 + w2 + w0>0 y=w T xw T x 0 → r = 1 Derive the linear discriminant x 1 + x = 0

7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Other linear discriminants are possible We have not yet specified an optimization condition Truth table S w T x=0 Boolean AND: linearly separable 2D binary classification problem

data table graphical representation Application of perceptron not possible in attribute space Solution: transform to linearly separable feature space Boolean XOR: linearly inseparable 2D binary classification problem

XOR in Gaussian Feature space This transformation puts examples (0,1) and (1,0) at the same point in feature space Perceptron could be applied to find a linear discriminant  1 = exp(-|X – [1,1]| 2 )  2 = exp(-|X – [0,0]| 2 ) X  1  2 (1,1) (0,1) (0,0) (1,0)

Review: XOR in Gaussian feature space This transformation puts examples (0,1) and (1,0) at the same point in feature space Perceptron could be applied to find a linear discriminant  1 = exp(-|X – [1,1]| 2 )  2 = exp(-|X – [0,0]| 2 ) X  1  2 (1,1) (0,1) (0,0) (1,0) XOR data r=1 r=0

11 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Add a “hidden” layer to the perceptron Derive 2 weight vectors connecting input to hidden layer that define linearly separable features. Derive 1 weight vector connecting hidden layer to output that defines a linear discriminant separating features S SS

Consider hidden units z h as features. Choose weight vectors w h so that in feature space (0,0) and (1,1) are close to the same point. z2z2 z1z1 attribute space ideal feature space

data table in feature space rz1z2 0~0~0 1~0~1 1~1~0 0~0~0 If w h T x << 0 → z h ~ 0 If w h T x >> 0 → z h ~ 1 z2z2 z1z1

x1x2z1w 1 T xrequiredchoice 00~0<0w0 <0w0= ~0<0w2 + w0 <0w2= -1 10~1>0w1 + w0 >0w1= 1 11~0<0w1 + w2 + w0<0 Find weights vectors for linearly separable features x1x2z2w 2 T xrequiredchoice 00~0<0w0 <0w0= ~1>0w2 + w0 >0w2= 1 10~0<0w1 + w0 <0w1= -1 11~0<0w1 + w2 + w0<0

Transformation of input by hidden layer x1x2arg1z1arg2z2r z1 = sigmoid(x1-x2-0.5) z2 = sigmoid(-x1+x2-0.5) z2z2 z1z1 XOR transformed by hidden layer

Find weights connecting hidden layer to output that define linear discriminant in feature space Denote weight vector by v Output transformed by y = sigmoid(v T z) z 1 z 2 rv T zrequiredchoice <0.38v v 2 +v 0 <0v 0 = >0.18v v 2 +v 0 >0v 2 = >0.18v v 2 +v 0 >0 v 1 = <0.38v v 2 +v 0 <0 z2z2 z1z1

17 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Not the only solution to XOR classification problem Define an optimization condition that enables learning optimum weights for both layers S SS Data will determine the best transform of the input

Training a neural network by back-propagation Initialize weights randomly Need a rule that relates changes in weights to the difference between output and target.

If the expression for in-sample error is simple (e.g. squared residuals) and network not too complex (e.g. < 3 hidden layers), then an analytical expression for the rate of change of error with change in weights can be derived.

This example is instructive but not relevant. Normal equations are a better way to find optimum weights in this case. Simplest case: multivariate linear regression In-sample error is squared residuals and no hidden layers

Approaches to Training 21 Online: weights updated based on training-set examples seen one by one in random order Batch: weights updated based on whole training set after summing deviations from individual examples Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Weight-update formulas are simpler for “online” approach Formulas for “batch” can be derived from “online” formula

Weight-update rule: Perceptron regression Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Contribution to sum of squared residuals from single example w j is the j th component of weight vector w connecting attribute vector x to scalar output y E t depends on w j through y t = w T x t ; hence use chain rule

Weight update formula called “stochastic gradient decent” Proportionality constant  is called “learning rate” Since  w j is proportional x j, all attributes should be roughly the same size. Normalized to achieve this may be helpful

Momentum parameter 24 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) How do learning rate and momentum affect training? As learning rate → 1, back-propagation becomes deterministic Each example determines a set of weights optimal for itself only As learning rate → 0, probably local minimum trapping → 1 because step size of weight change is so small Large momentum parameter reduces trapping at small learning rate but increases likelihood that single outlier with dramatically affect weight optimization Opinions differ on best choice of learning rate and momentum Keep part of previous update

Multivariate linear dichotomizer This example is equivalent to logistic regression but with r t = {0,1} instead of (-1,1} r t = {0,1} y = sigmoid(w T x) In-sample error: cross entropy Weight vector w connects input to output, which is transformed y sigmoid function

26 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Assume that r t is drawn from Bernoulli distributed with parameter p 0 for the probability that r t = 1 p(r) = p o r (1 – p o ) (1 – r) p(r =0) = 1 – p o Let y =sigmoid(w T x) be the MLE of p 0 p(r) = y r (1 – y ) (1 – r) Weight update for optimization by back propagation

27 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Assume that r t is drawn from Bernoulli distributed with parameter p 0 for the probability that r t = 1 and y =sigmoid(w T x) the MLE of p 0 Let L (w| X ) be the log-likelihood that weight vector w results from training set X Weight update for optimization by back propagation

y t = sigmoid(w T x t ) betweem 0 and 1; hence L (w| X ) < 0 Therefore to maximize L (w| X ), minimize Like the sum of squared residuals, cross entropy depends on weights w through y t Unlike fitting a plane to data, where y t = w T x, y t = sigmoid(w T x), which puts an additional factor in the chain rule when deriving the back propagation formula.

By much tedious algebra, you can show that result has the same form as for online training in regression y t = sigmoid(w T x t ) Result can be generalized to multi-class classification

Perceptron classification with k >1 classes 30 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Weight vector w i (column of weight matrix W) connects input vector x to output node y i w ij is the jth component of w i Assign to example to class with largest y i

Review: Multilayer Perceptrons (MLP) 31 Layers between input and output called “hidden” Both number of hidden layers and number of hidden units in a layer are variables in the structure of an MLP Less complex structures improve generalization More than one hidden layer is seldom necessary Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

32 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: MLP solution of XOR classification problem Nodes in hidden layer are linearly separable transforms of inputs S SS Data will determine the transform that minimizes in-sample error Weights that connect hidden layer to output node define a linear discriminant in feature space, v T z=0. Transform output by sigmoid function. If S>0.5 assign class = 1

33 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Solution of XOR problem by nonlinear regression Do not transform the output node. Change labels to +1 Fit v T z to class labels by nonlinear regression. If v T z >0 assign class 1 SS Data will determine the best level of nonlinearity Let in-sample error be sum of squared residuals. Use back propagation to optimize weights Derive formulas for updating weights in both layers

34 Forward Backward x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: weight update rules for nonlinear regression Each pass through all training data called “epoch” Given a weights w h and v Transform hidden layer by Sigmoid. h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output.

35 Forward Backward x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: weight update rules for nonlinear regression Consider a momentum parameter Given a weights w h and v Transform hidden layer by Sigmoid. h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output. Calculate changes to w h vectors before changing v Learning rates can be different for  hj and  v h

36 Forward Backward x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: weight update rules for nonlinear regression normalized to [0,1] Given a weights w h and v Transform hidden layer by Sigmoid. h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output. It may be helpful to normalize attributes Other normalization method can be used Transforms in the hidden layer do not require normalization

37 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Above h ~ 15, E val increasing while E in is flat No significant decrease in E val or E in after h=3 Use E val to determine number of nodes in the hidden layer Favor small h to avoid overfitting

38 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Beyond elbow, Eval~Ein for ~ 200 e Above e ~ 600 evidence for overfitting Expect overfitting at e(10 x elbow) Set e stop when you see evidence of overfitting elbow Stop early to avoid overfitting

39 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Use a validation set to detect overfitting Validation error calculated after each epoch x t = U(-0.5, 0.5) y t = sin(6x t ) + N (0, 0.1) Stop early to avoid overfitting Fit looks better at 300 e Why stop at 200 e?

40 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: Solution of XOR problem by nonlinear regression SS A possible structure of MLP to solve the XOR problem. Should we consider a structure with 2 nodes in the hidden layer?

Edom’s code for solution of XOR by non-linear regression with 3 nodes in the hidden layer Note: no momentum parameter and no binning of y(t) to predict class label

Edom’s solution of XOR problem by nonlinear regression Fit is adequate to assign class labels

43 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) No summation; K=1 In classification by regression, bin y before calculating difference from target Allows number misclassified to be used as error For regression, K =1 Bin y to get a class assignment

Binning creates flat regions From Heng’s HW6 elbow

Assignment 6 due Use dataset randomized shortened glassdata.csv to develop a classifier for beer-bottle glass by ANN non-linear regression. Keep the class labels as 1, 2, and 6. With validations set of 100 examples and training set of 74 examples, select the best number of hidden nodes in a single hidden layer and the best number of epochs for weight refinement. Use all the data to optimize weights at the selected structure and training time. Calculate confusion matrix and accuracy of prediction. Use 10-fold cross validation to estimate the accuracy of a test set. When should you start with random [-0.01,0.01] weight?

46 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Sigmoid output models P(C 1 |x) Minimize cross entropy in batch update Weight update formulas are same as for regression Assign examples to C 1 if output > 0.5 Two-Class Discrimination with one hidden layer

K>2 Classes: one hidden layer 47 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Minimize in-sample cross entropy by batch update Assign examples to class with largest output v i is weight vector connection nodes of hidden layer to output of class i Note sum over i