Download presentation
Presentation is loading. Please wait.
Published byRalph Perkins Modified over 9 years ago
1
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com
2
Review: Neuron analogy to linear models Dot product w T x is a way of combining attributes with bias (x 0 =1) into a scalar signal s, which is used to form an hypothesis about x. Analogy is called a “perceptron” sigmoid(s)
3
What can a perceptron do in 1D? Fit a line to data: y=wx+w 0 Use y=wx+w 0 as a discriminant 3 w w0w0 y x x 0 =+1 w w0w0 y x s w0w0 y x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) S = sigmoid(y) If y > 0 -> S > 0.5; chose green otherwise chose red x
4
4 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) x is input vector w is weight vector y = w T x Perceptron can do the same thing in dD: fit plane to data and use a plane as a discriminant for binary classification For regression output is y For classification output is sigmoid(y)
5
Boolean AND: linearly separable 2D binary classification problem 5 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Truth table S x 1 +x 2 =1.5 is an acceptable linear discriminant
6
6 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) linear discriminant w T x = 0 x1x2rrequiredchoice 000w0 <0w0=-1.5 010 w2 + w0 <0w1= 1 100 w1+ w0 <0w2= 1 111 w1 + w2 + w0>0 y=w T xw T x 0 → r = 1 Derive the linear discriminant x 1 + x 2 -1.5 = 0
7
7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Other linear discriminants are possible We have not yet specified an optimization condition Truth table S w T x=0 Boolean AND: linearly separable 2D binary classification problem
8
data table graphical representation Application of perceptron not possible in attribute space Solution: transform to linearly separable feature space Boolean XOR: linearly inseparable 2D binary classification problem
9
XOR in Gaussian Feature space This transformation puts examples (0,1) and (1,0) at the same point in feature space Perceptron could be applied to find a linear discriminant 1 = exp(-|X – [1,1]| 2 ) 2 = exp(-|X – [0,0]| 2 ) X 1 2 (1,1)10.1353 (0,1)0.36780.3678 (0,0)0.13531 (1,0)0.36780.3678
10
Review: XOR in Gaussian feature space This transformation puts examples (0,1) and (1,0) at the same point in feature space Perceptron could be applied to find a linear discriminant 1 = exp(-|X – [1,1]| 2 ) 2 = exp(-|X – [0,0]| 2 ) X 1 2 (1,1)10.1353 (0,1)0.36780.3678 (0,0)0.13531 (1,0)0.36780.3678 XOR data r=1 r=0
11
11 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Add a “hidden” layer to the perceptron -0.78 Derive 2 weight vectors connecting input to hidden layer that define linearly separable features. Derive 1 weight vector connecting hidden layer to output that defines a linear discriminant separating features S SS
12
Consider hidden units z h as features. Choose weight vectors w h so that in feature space (0,0) and (1,1) are close to the same point. z2z2 z1z1 attribute space ideal feature space
13
data table in feature space rz1z2 0~0~0 1~0~1 1~1~0 0~0~0 If w h T x << 0 → z h ~ 0 If w h T x >> 0 → z h ~ 1 z2z2 z1z1
14
x1x2z1w 1 T xrequiredchoice 00~0<0w0 <0w0=-0.5 01~0<0w2 + w0 <0w2= -1 10~1>0w1 + w0 >0w1= 1 11~0<0w1 + w2 + w0<0 Find weights vectors for linearly separable features x1x2z2w 2 T xrequiredchoice 00~0<0w0 <0w0=-0.5 01~1>0w2 + w0 >0w2= 1 10~0<0w1 + w0 <0w1= -1 11~0<0w1 + w2 + w0<0
15
Transformation of input by hidden layer x1x2arg1z1arg2z2r 00-0.50.38-0.50.380 01-1.50.180.50.621 100.50.62-1.50.181 11-0.50.38-0.50.380 z1 = sigmoid(x1-x2-0.5) z2 = sigmoid(-x1+x2-0.5) z2z2 z1z1 XOR transformed by hidden layer
16
Find weights connecting hidden layer to output that define linear discriminant in feature space Denote weight vector by v Output transformed by y = sigmoid(v T z) z 1 z 2 rv T zrequiredchoice 0.380.380<0.38v 1 +.38v 2 +v 0 <0v 0 = -.78 0.180.621>0.18v 1 +.62v 2 +v 0 >0v 2 = 1 0.620.181>0.18v 1 +.62v 2 +v 0 >0 v 1 = 1 0.380.380<0.38v 1 +.38v 2 +v 0 <0 z2z2 z1z1
17
17 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Not the only solution to XOR classification problem -0.78 Define an optimization condition that enables learning optimum weights for both layers S SS Data will determine the best transform of the input
18
Training a neural network by back-propagation Initialize weights randomly Need a rule that relates changes in weights to the difference between output and target.
19
If the expression for in-sample error is simple (e.g. squared residuals) and network not too complex (e.g. < 3 hidden layers), then an analytical expression for the rate of change of error with change in weights can be derived.
20
This example is instructive but not relevant. Normal equations are a better way to find optimum weights in this case. Simplest case: multivariate linear regression In-sample error is squared residuals and no hidden layers
21
Approaches to Training 21 Online: weights updated based on training-set examples seen one by one in random order Batch: weights updated based on whole training set after summing deviations from individual examples Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Weight-update formulas are simpler for “online” approach Formulas for “batch” can be derived from “online” formula
22
Weight-update rule: Perceptron regression Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Contribution to sum of squared residuals from single example w j is the j th component of weight vector w connecting attribute vector x to scalar output y E t depends on w j through y t = w T x t ; hence use chain rule
23
Weight update formula called “stochastic gradient decent” Proportionality constant is called “learning rate” Since w j is proportional x j, all attributes should be roughly the same size. Normalized to achieve this may be helpful
24
Momentum parameter 24 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) How do learning rate and momentum affect training? As learning rate → 1, back-propagation becomes deterministic Each example determines a set of weights optimal for itself only As learning rate → 0, probably local minimum trapping → 1 because step size of weight change is so small Large momentum parameter reduces trapping at small learning rate but increases likelihood that single outlier with dramatically affect weight optimization Opinions differ on best choice of learning rate and momentum Keep part of previous update
25
Multivariate linear dichotomizer This example is equivalent to logistic regression but with r t = {0,1} instead of (-1,1} r t = {0,1} y = sigmoid(w T x) In-sample error: cross entropy Weight vector w connects input to output, which is transformed y sigmoid function
26
26 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Assume that r t is drawn from Bernoulli distributed with parameter p 0 for the probability that r t = 1 p(r) = p o r (1 – p o ) (1 – r) p(r =0) = 1 – p o Let y =sigmoid(w T x) be the MLE of p 0 p(r) = y r (1 – y ) (1 – r) Weight update for optimization by back propagation
27
27 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Assume that r t is drawn from Bernoulli distributed with parameter p 0 for the probability that r t = 1 and y =sigmoid(w T x) the MLE of p 0 Let L (w| X ) be the log-likelihood that weight vector w results from training set X Weight update for optimization by back propagation
28
y t = sigmoid(w T x t ) betweem 0 and 1; hence L (w| X ) < 0 Therefore to maximize L (w| X ), minimize Like the sum of squared residuals, cross entropy depends on weights w through y t Unlike fitting a plane to data, where y t = w T x, y t = sigmoid(w T x), which puts an additional factor in the chain rule when deriving the back propagation formula.
29
By much tedious algebra, you can show that result has the same form as for online training in regression y t = sigmoid(w T x t ) Result can be generalized to multi-class classification
30
Perceptron classification with k >1 classes 30 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Weight vector w i (column of weight matrix W) connects input vector x to output node y i w ij is the jth component of w i Assign to example to class with largest y i
31
Review: Multilayer Perceptrons (MLP) 31 Layers between input and output called “hidden” Both number of hidden layers and number of hidden units in a layer are variables in the structure of an MLP Less complex structures improve generalization More than one hidden layer is seldom necessary Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
32
32 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: MLP solution of XOR classification problem -0.78 Nodes in hidden layer are linearly separable transforms of inputs S SS Data will determine the transform that minimizes in-sample error Weights that connect hidden layer to output node define a linear discriminant in feature space, v T z=0. Transform output by sigmoid function. If S>0.5 assign class = 1
33
33 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Solution of XOR problem by nonlinear regression -0.78 Do not transform the output node. Change labels to +1 Fit v T z to class labels by nonlinear regression. If v T z >0 assign class 1 SS Data will determine the best level of nonlinearity Let in-sample error be sum of squared residuals. Use back propagation to optimize weights Derive formulas for updating weights in both layers
34
34 Forward Backward x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: weight update rules for nonlinear regression Each pass through all training data called “epoch” Given a weights w h and v Transform hidden layer by Sigmoid. h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output.
35
35 Forward Backward x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: weight update rules for nonlinear regression Consider a momentum parameter Given a weights w h and v Transform hidden layer by Sigmoid. h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output. Calculate changes to w h vectors before changing v Learning rates can be different for hj and v h
36
36 Forward Backward x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: weight update rules for nonlinear regression normalized to [0,1] Given a weights w h and v Transform hidden layer by Sigmoid. h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output. It may be helpful to normalize attributes Other normalization method can be used Transforms in the hidden layer do not require normalization
37
37 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Above h ~ 15, E val increasing while E in is flat No significant decrease in E val or E in after h=3 Use E val to determine number of nodes in the hidden layer Favor small h to avoid overfitting
38
38 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Beyond elbow, Eval~Ein for ~ 200 e Above e ~ 600 evidence for overfitting Expect overfitting at e(10 x elbow) Set e stop when you see evidence of overfitting elbow Stop early to avoid overfitting
39
39 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Use a validation set to detect overfitting Validation error calculated after each epoch x t = U(-0.5, 0.5) y t = sin(6x t ) + N (0, 0.1) Stop early to avoid overfitting Fit looks better at 300 e Why stop at 200 e?
40
40 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: Solution of XOR problem by nonlinear regression SS A possible structure of MLP to solve the XOR problem. Should we consider a structure with 2 nodes in the hidden layer?
41
Edom’s code for solution of XOR by non-linear regression with 3 nodes in the hidden layer Note: no momentum parameter and no binning of y(t) to predict class label
42
Edom’s solution of XOR problem by nonlinear regression Fit is adequate to assign class labels
43
43 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) No summation; K=1 In classification by regression, bin y before calculating difference from target Allows number misclassified to be used as error For regression, K =1 Bin y to get a class assignment
44
Binning creates flat regions From Heng’s HW6 elbow
45
Assignment 6 due 11-13-15 Use dataset randomized shortened glassdata.csv to develop a classifier for beer-bottle glass by ANN non-linear regression. Keep the class labels as 1, 2, and 6. With validations set of 100 examples and training set of 74 examples, select the best number of hidden nodes in a single hidden layer and the best number of epochs for weight refinement. Use all the data to optimize weights at the selected structure and training time. Calculate confusion matrix and accuracy of prediction. Use 10-fold cross validation to estimate the accuracy of a test set. When should you start with random [-0.01,0.01] weight?
46
46 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Sigmoid output models P(C 1 |x) Minimize cross entropy in batch update Weight update formulas are same as for regression Assign examples to C 1 if output > 0.5 Two-Class Discrimination with one hidden layer
47
K>2 Classes: one hidden layer 47 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Minimize in-sample cross entropy by batch update Assign examples to class with largest output v i is weight vector connection nodes of hidden layer to output of class i Note sum over i
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.