Download presentation
Presentation is loading. Please wait.
Published byClaude Buck Gallagher Modified over 9 years ago
1
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Lecture Slides for
2
CHAPTER 11: Multilayer Perceptrons
3
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3 Neural Networks Networks of processing units (neurons) with connections (synapses) between them Large number of neurons: 10 10 Large connectitivity: 10 5 Parallel processing Distributed computation/memory Robust to noise, failures
4
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 4 Understanding the Brain Levels of analysis (Marr, 1982) 1. Computational theory 2. Representation and algorithm 3. Hardware implementation Reverse engineering: From hardware to theory Parallel processing: SIMD vs MIMD Neural net: SIMD with modifiable local memory Learning: Update by training/experience
5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 5 Perceptron (Rosenblatt, 1962)
6
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 6 What a Perceptron Does Regression: y=wx+w 0 Classification: y=1(wx+w 0 >0) w w0w0 y x x 0 =+1 w w0w0 y x s w0w0 y x
7
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 7 K Outputs Classification : Regression :
8
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 8 Training Online (instances seen one by one) vs batch (whole sample) learning: No need to store the whole sample Problem may change in time Wear and degradation in system components Stochastic gradient-descent: Update after a single pattern Generic update rule (LMS rule):
9
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 9 Training a Perceptron: Regression Regression (Linear output):
10
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 10 Classification Single sigmoid output K>2 softmax outputs
11
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11 Sigmoid Unit x1x1 x2x2 xnxn...... w1w1 w2w2 wnwn w0w0 x 0=1 net= i=0 n w i x i o o= (net)=1/(1+e -net ) (x) is the sigmoid function: 1/(1+e -x) d (x)/dx= (x) (1- (x)) Derive gradient decent rules to train: one sigmoid function E/ w i = - d (t d -o d ) o d (1-o d ) x i Multilayer networks of sigmoid units backpropagation:
12
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 12 Sigmoid Unit x1x1 x2x2 xnxn...... w1w1 w2w2 wnwn w0w0 x 0=1 net= i=0 n w i x i o o= (net)=1/(1+e -net ) (x) is the sigmoid function: 1/(1+e -x) d (x)/dx= (x) (1- (x)) Derive gradient decent rules to train: one sigmoid function E/ w i = - d (t d -o d ) o d (1-o d ) x i Multilayer networks of sigmoid units backpropagation:
13
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13 Learning Boolean AND
14
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 14 XOR No w 0, w 1, w 2 satisfy: (Minsky and Papert, 1969)
15
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 15 Multi-Layer Networks input layer hidden layer output layer
16
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 16 Multilayer Perceptrons (Rumelhart et al., 1986)
17
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17 x 1 XOR x 2 = (x 1 AND ~x 2 ) OR (~x 1 AND x 2 )
18
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 18 Backpropagation
19
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 19 Backpropagation Algorithm Initialize each w i to some small random value Until the termination condition is met, Do For each training example Do Input the instance (x 1,…,x n ) to the network and compute the network outputs o k For each output unit k k =o k (1-o k )(t k -o k ) For each hidden unit h h =o h (1-o h ) k w h,k k For each network weight w,j Do w i,j =w i,j + w i,j where w i,j = j x i,j
20
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 20 Backpropagation Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum -in practice often works well (can be invoked multiple times with different initial weights) Often include weight momentum term w i,j (n)= j x i,j + w i,j (n-1) Minimizes error training examples Will it generalize well to unseen instances (over-fitting)? Training can be slow typical 1000-10000 iterations (use Levenberg-Marquardt instead of gradient descent) Using network after training is fast
21
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 21 Regression Forward Backward x
22
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 22 Regression with Multiple Outputs zhzh v ih yiyi xjxj w hj
23
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 23
24
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 24 8 inputs 3 hidden8 outputs 8-3-8 Binary Encoder -Decoder Hidden values.89.04.08.01.11.88.01.97.27.99.97.71.03.05.02.22.99.99.80.01.98.60.94.01
25
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 25 Sum of Squared Errors for the Output Units
26
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 26 Hidden Unit Encoding for Input 0100000
27
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 27 Convergence of Backprop Gradient descent to some local minimum Perhaps not global minimum Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses
28
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 28 Expressive Capabilities of ANN Boolean functions Every boolean function can be represented by network with single hidden layer But might require exponential (in number of inputs) hidden units Continuous functions Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]
29
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 29
30
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 30 whx+w0whx+w0 zhzh vhzhvhzh
31
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 31 Two-Class Discrimination One sigmoid output y t for P(C 1 |x t ) and P(C 2 |x t ) ≡ 1-y t
32
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 32 K>2 Classes
33
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 33 Multiple Hidden Layers MLP with one hidden layer is a universal approximator (Hornik et al., 1989), but using multiple layers may lead to simpler networks
34
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 34 Improving Convergence Momentum Adaptive learning rate
35
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 35 Overfitting/Overtraining Number of weights: H (d+1)+(H+1)*K
36
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 36
37
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 37 Structured MLP (Le Cun et al, 1989)
38
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 38 Weight Sharing
39
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 39 Hints Invariance to translation, rotation, size Virtual examples Augmented error: E’=E+ λ h E h If x’ and x are the “same”: E h =[g(x| θ )- g(x’| θ )] 2 Approximation hint: (Abu-Mostafa, 1995)
40
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 40 Tuning the Network Size Destructive Weight decay: Constructive Growing networks (Ash, 1989) (Fahlman and Lebiere, 1989)
41
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 41 Bayesian Learning Consider weights w i as random vars, prior p(w i ) Weight decay, ridge regression, regularization cost=data-misfit + λ complexity
42
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 42 Dimensionality Reduction
43
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 43
44
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 44 Learning Time Applications: Sequence recognition: Speech recognition Sequence reproduction: Time-series prediction Sequence association Network architectures Time-delay networks (Waibel et al., 1989) Recurrent networks (Rumelhart et al., 1986)
45
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 45 Time-Delay Neural Networks
46
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 46 Recurrent Networks
47
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 47 Unfolding in Time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.