INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for
CHAPTER 11: Multilayer Perceptrons
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3 Neural Networks Networks of processing units (neurons) with connections (synapses) between them Large number of neurons: Large connectitivity: 10 5 Parallel processing Distributed computation/memory Robust to noise, failures
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 4 Understanding the Brain Levels of analysis (Marr, 1982) 1. Computational theory 2. Representation and algorithm 3. Hardware implementation Reverse engineering: From hardware to theory Parallel processing: SIMD vs MIMD Neural net: SIMD with modifiable local memory Learning: Update by training/experience
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 5 Perceptron (Rosenblatt, 1962)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 6 What a Perceptron Does Regression: y=wx+w 0 Classification: y=1(wx+w 0 >0) w w0w0 y x x 0 =+1 w w0w0 y x s w0w0 y x
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 7 K Outputs Classification : Regression :
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 8 Training Online (instances seen one by one) vs batch (whole sample) learning: No need to store the whole sample Problem may change in time Wear and degradation in system components Stochastic gradient-descent: Update after a single pattern Generic update rule (LMS rule):
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 9 Training a Perceptron: Regression Regression (Linear output):
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 10 Classification Single sigmoid output K>2 softmax outputs
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11 Sigmoid Unit x1x1 x2x2 xnxn w1w1 w2w2 wnwn w0w0 x 0=1 net= i=0 n w i x i o o= (net)=1/(1+e -net ) (x) is the sigmoid function: 1/(1+e -x) d (x)/dx= (x) (1- (x)) Derive gradient decent rules to train: one sigmoid function E/ w i = - d (t d -o d ) o d (1-o d ) x i Multilayer networks of sigmoid units backpropagation:
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 12 Sigmoid Unit x1x1 x2x2 xnxn w1w1 w2w2 wnwn w0w0 x 0=1 net= i=0 n w i x i o o= (net)=1/(1+e -net ) (x) is the sigmoid function: 1/(1+e -x) d (x)/dx= (x) (1- (x)) Derive gradient decent rules to train: one sigmoid function E/ w i = - d (t d -o d ) o d (1-o d ) x i Multilayer networks of sigmoid units backpropagation:
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13 Learning Boolean AND
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 14 XOR No w 0, w 1, w 2 satisfy: (Minsky and Papert, 1969)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 15 Multi-Layer Networks input layer hidden layer output layer
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 16 Multilayer Perceptrons (Rumelhart et al., 1986)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17 x 1 XOR x 2 = (x 1 AND ~x 2 ) OR (~x 1 AND x 2 )
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 18 Backpropagation
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 19 Backpropagation Algorithm Initialize each w i to some small random value Until the termination condition is met, Do For each training example Do Input the instance (x 1,…,x n ) to the network and compute the network outputs o k For each output unit k k =o k (1-o k )(t k -o k ) For each hidden unit h h =o h (1-o h ) k w h,k k For each network weight w,j Do w i,j =w i,j + w i,j where w i,j = j x i,j
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 20 Backpropagation Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum -in practice often works well (can be invoked multiple times with different initial weights) Often include weight momentum term w i,j (n)= j x i,j + w i,j (n-1) Minimizes error training examples Will it generalize well to unseen instances (over-fitting)? Training can be slow typical iterations (use Levenberg-Marquardt instead of gradient descent) Using network after training is fast
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 21 Regression Forward Backward x
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 22 Regression with Multiple Outputs zhzh v ih yiyi xjxj w hj
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 23
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 24 8 inputs 3 hidden8 outputs Binary Encoder -Decoder Hidden values
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 25 Sum of Squared Errors for the Output Units
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 26 Hidden Unit Encoding for Input
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 27 Convergence of Backprop Gradient descent to some local minimum Perhaps not global minimum Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 28 Expressive Capabilities of ANN Boolean functions Every boolean function can be represented by network with single hidden layer But might require exponential (in number of inputs) hidden units Continuous functions Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 29
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 30 whx+w0whx+w0 zhzh vhzhvhzh
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 31 Two-Class Discrimination One sigmoid output y t for P(C 1 |x t ) and P(C 2 |x t ) ≡ 1-y t
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 32 K>2 Classes
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 33 Multiple Hidden Layers MLP with one hidden layer is a universal approximator (Hornik et al., 1989), but using multiple layers may lead to simpler networks
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 34 Improving Convergence Momentum Adaptive learning rate
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 35 Overfitting/Overtraining Number of weights: H (d+1)+(H+1)*K
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 36
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 37 Structured MLP (Le Cun et al, 1989)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 38 Weight Sharing
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 39 Hints Invariance to translation, rotation, size Virtual examples Augmented error: E’=E+ λ h E h If x’ and x are the “same”: E h =[g(x| θ )- g(x’| θ )] 2 Approximation hint: (Abu-Mostafa, 1995)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 40 Tuning the Network Size Destructive Weight decay: Constructive Growing networks (Ash, 1989) (Fahlman and Lebiere, 1989)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 41 Bayesian Learning Consider weights w i as random vars, prior p(w i ) Weight decay, ridge regression, regularization cost=data-misfit + λ complexity
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 42 Dimensionality Reduction
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 43
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 44 Learning Time Applications: Sequence recognition: Speech recognition Sequence reproduction: Time-series prediction Sequence association Network architectures Time-delay networks (Waibel et al., 1989) Recurrent networks (Rumelhart et al., 1986)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 45 Time-Delay Neural Networks
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 46 Recurrent Networks
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 47 Unfolding in Time