Download presentation
Presentation is loading. Please wait.
Published byMaximillian Scot Ford Modified over 8 years ago
1
129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University
2
130 Binary Classification Example Value of predictor 1 Value of predictor 2 Example: Classification to malignant or benign breast cancer from mammograms Predictor 1: lump thickness Predictor 2: single epithelial cell size
3
131 Possible Decision Areas Value of predictor 1 Value of predictor 2 Class area: red circles Class area: Green triangles
4
132 Possible Decision Areas Value of predictor 1 Value of predictor 2
5
133 Possible Decision Areas Value of predictor 1 Value of predictor 2
6
134 Binary Classification Example Value of predictor 1 Value of predictor 2 The simplest non-trivial decision function is the straight line (in general a hyperplane) One decision surface Decision surface partitions space into two subspaces
7
135 Specifying a Line Line equation: w 2 x 2 +w 1 x 1 +w 0 = 0 Classifier: If w 2 x 2 +w 1 x 1 +w 0 0 Output Else Output x1x1 x2x2 w 2 x 2 +w 1 x 1 +w 0 = 0 w 2 x 2 +w 1 x 1 +w 0 > 0 w 2 x 2 +w 1 x 1 +w 0 < 0
8
136 Classifying with Linear Surfaces Use 1 for Use -1 for Classifier becomes sgn(w 2 x 2 +w 1 x 1 +w 0 )= sgn(w 2 x 2 +w 1 x 1 +w 0 x 0 ), with x 0 =1 always x1x1 x2x2 w 2 x 2 +w 1 x 1 +w 0 = 0 w 2 x 2 +w 1 x 1 +w 0 > 0 w 2 x 2 +w 1 x 1 +w 0 < 0
9
137 The Perceptron 1 Input: (attributes of patient to classify) Output: classification of patient (malignant or benign) W 4 =3 W 2 =0 W 2 =-2 W 1 =4 W 0 =2 x0x0 x4x4 x3x3 x2x2 x1x1
10
138 The Perceptron 23.1410 Input: (attributes of patient to classify) Output: classification of patient (malignant or benign) W 4 =3 W 2 =0 W 2 =-2 W 1 =4 W 0 =2 x0x0 x4x4 x3x3 x2x2 x1x1
11
139 The Perceptron 33.1410 Input: (attributes of patient to classify) Output: classification of patient (malignant or benign) W 4 =3 W 2 =0 W 2 =-2 W 1 =4 W 0 =2 x0x0 x4x4 x3x3 x2x2 x1x1
12
140 The Perceptron 23.1410 Input: (attributes of patient to classify) Output: classification of patient (malignant or benign) W 4 =3 W 2 =0 W 2 =-2 W 1 =4 W 0 =2 x0x0 x4x4 x3x3 x2x2 x1x1 sgn(3)=1
13
141 The Perceptron Input: (attributes of patient to classify) Output: classification of patient (malignant or benign) 23.1410 W 4 =3 W 2 =0 W 2 =-2 W 1 =4 W 0 =2 x0x0 x4x4 x3x3 x2x2 x1x1 sgn(3)=1 1
14
142 Learning with Perceptrons Hypotheses Space: Inductive Bias: Prefer hypotheses that do not misclassify any of the training instances (or minimize an error function). Search method: perceptron training rule, gradient descent, etc. Remember: the problem is to find “good” weights
15
143 Training Perceptrons Start with random weights Update in an intelligent way to improve them Intuitively: Decrease the weights that increase the sum Increase the weights that decrease the sum Repeat for all training instances until convergence 23.1410 3 0-2 4 2 x0x0 x4x4 x3x3 x2x2 x1x1 sgn(3)=1 1 True Output: -1
16
144 Perceptron Training Rule t : output the current example should have o: output of the perceptron x i : value of predictor variable i t=o : No change (for correctly classified examples)
17
145 Perceptron Training Rule t=-1, o=1 : sum will decrease t=1, o=-1 : sum will increase
18
146 Perceptron Training Rule In vector form:
19
147 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1
20
148 Example of Perceptron Training: The OR function Initial random weights: Define line: x 1 =0.5 x1x1 x2x2 01 1 11 1
21
149 Example of Perceptron Training: The OR function Initial random weights: Defines line: x 1 =0.5 x1x1 x2x2 01 1 11 1 x 1 =0.5 Area where classifier outputs 1 Area where classifier outputs -1
22
150 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1 x 1 =0.5 Only misclassified example x 1 =0, x 2 =1
23
151 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1 x 1 =0.5 Only misclassified example x 1 =0, x 2 =1
24
152 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1 x 1 =0.5 Only misclassified example x 1 =0, x 2 =1 For x 2 =0, x 1 =-0.5 For x 1 =0, x 2 =-0.5 So, new line is:
25
153 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1 x 1 =0.5 For x 2 =0, x 1 =-0.5 For x 1 =0, x 2 =-0.5 -0.5
26
154 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1 -0.5 Misclassified example Next iteration:
27
155 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1 -0.5 Perfect classification No change occurs next
28
156 Analysis of the Perceptron Training Rule Algorithm will always converge within finite number of iterations if the data are linearly separable. Otherwise, it may oscillate (no convergence)
29
157 Training by Gradient Descent Similar but: Always converges Generalizes to training networks of perceptrons (neural networks) Idea: Define an error function Search for weights that minimize the error, i.e., find weights that zero the error gradient
30
158 Setting Up the Gradient Descent Mean Squared Error: Minima exist where gradient is zero:
31
159 The Sign Function is not Differentiable x2x2 01 1 11 1 x 1 =0.5
32
160 Use Differentiable Transfer Functions Replace with the sigmoid
33
161 Calculating the Gradient
34
162 Updating the Weights with Gradient Descent Each weight update goes through all training instances Each weight update more expensive but more accurate Always converges to a local minimum regardless of the data
35
163 Multiclass Problems E.g., 4 classes Two possible solutions Use one perceptron (output unit) and interpret the results as follows: Output 0-0.25, class 1 Output 0.25 – 0.5, class 2 Output 0.5 – 0.75, class 3 Output 0.75 – 1, class 4 Use four output units and a binary encoding of the output (1-of-m encoding).
36
164 1-of-M Encoding Assign to class with largest output x0x0 x4x4 x3x3 x2x2 x1x1 Output for Class 1 Output for Class 2 Output for Class 3 Output for Class 4
37
165 Feed-Forward Neural Networks x0x0 x4x4 x3x3 x2x2 x1x1 Output Layer Hidden Layer 2 Hidden Layer 1 Input Layer
38
166 Increased Expressiveness Example: Exclusive OR x2x2 01 1 1 1 No line (no set of three weights) can separate the training examples (learn the true function). x2x2 x1x1 x0x0 w2w2 w1w1 w0w0
39
167 Increased Expressiveness Example x2x2 01 1 1 1 x2x2 x1x1 x0x0 w 2,1 w 1,1 w 0,1 w 2,2 w 1,2 w 0,2 w’ 1,1 w’ 2,1
40
168 Increased Expressiveness Example x2x2 01 1 1 1 H1H1 x2x2 x1x1 x0x0 1 0.5 H2H2 1 0.5 O 1 1 X1X1 X2X2 C 00 011 101 11
41
169 Increased Expressiveness Example x2x2 1 1 H1H1 x2x2 x1x1 x0x0 1 0.5 H2H2 1 0.5 O 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 T3T3 101 T4T4 11 x1x1
42
170 Increased Expressiveness Example x2x2 1 1 00 H1H1 x2x2 x1x1 1 x0x0 1 -0.5 H2H2 1 -0.5 O 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 T3T3 101 T4T4 11 x1x1
43
171 Increased Expressiveness Example x2x2 1 1 00 x2x2 x1x1 1 x0x0 1 -0.5 1 -0.5 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 T3T3 101 T4T4 11 x1x1
44
172 Increased Expressiveness Example x2x2 1 1 01 x2x2 x1x1 1 x0x0 1 -0.5 1 1 -0.5 1 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 11 T3T3 101 T4T4 11 x1x1
45
173 Increased Expressiveness Example x2x2 1 1 10 1 x2x2 x1x1 1 x0x0 1 -0.5 1 -0.5 1 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 11 T3T3 1011 1 T4T4 11 x1x1
46
174 Increased Expressiveness Example x2x2 1 1 11 x2x2 x1x1 1 x0x0 1 -0.5 1 -0.5 1 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 11 T3T3 1011 1 T4T4 11 x1x1
47
175 Increased Expressiveness Example x2x2 1 1 11 x2x2 x1x1 1 x0x0 1 -0.5 1 -0.5 1 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 11 T3T3 1011 1 T4T4 11 x1x1
48
176 Increased Expressiveness Example x2x2 1 1 11 x2x2 x1x1 1 x0x0 1 -0.5 1 -0.5 1 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 11 T3T3 1011 1 T4T4 11 x1x1
49
177 From the Viewpoint of the Output Layer H1H1 H2H2 O 1 1 CH1H1 H2H2 O T1T1 T2T2 1 11 T3T3 11 1 T4T4 T1T1 T3T3 T2T2 x1x1 x2x2 T4T4 Mapped By Hidden Layer to: T1T1 T3T3 T2T2 H1H1 H2H2 T4T4
50
178 From the Viewpoint of the Output Layer T1T1 T3T3 T2T2 x1x1 x2x2 T4T4 Mapped By Hidden Layer to: T1T1 T3T3 T2T2 H1H1 H2H2 T4T4 Each hidden layer maps to a new instance space Each hidden node is a new constructed feature Original Problem may become separable (or easier)
51
179 How to Train Multi-Layered Networks Select a network structure (number of hidden layers, hidden nodes, and connectivity). Select transfer functions that are differentiable. Define a (differentiable) error function. Search for weights that minimize the error function, using gradient descent or other optimization method. BACKPROPAGATION
52
180 How to Train Multi-Layered Networks Select a network structure (number of hidden layers, hidden nodes, and connectivity). Select transfer functions that are differentiable. Define a (differentiable) error function. Search for weights that minimize the error function, using gradient descent or other optimization method. x2x2 x1x1 x0x0 w 2,1 w 1,1 w 0,1 w 2,2 w 1,2 w 0,2 w’ 1,1 w’ 2,1
53
181 Back-Propagating the Error x2x2 x1x1 x0x0 Output unit(s): Hidden unit(s): w j i : weight vector from unit j to unit i x j : jth output i, index for output layer j, index for hidden layer k, index for input layer
54
182 Back-Propagating the Error x2x2 x1x1 x0x0
55
183 Back-Propagating the Error x2x2 x1x1 x0x0
56
184 Back-Propagating the Error x2x2 x1x1 x0x0
57
185 Back-Propagating the Error x2x2 x1x1 x0x0
58
186 Back-Propagating the Error x2x2 x1x1 x0x0
59
187 Training with BackPropagation Going once through all training examples and updating the weights: one epoch Iterate until a stopping criterion is satisfied The hidden layers learn new features and map to new spaces Training reaches a local minimum of the error surface
60
188 Overfitting with Neural Networks If number of hidden units (and weights) is large, it is easy to memorize the training set (or parts of it) and not generalize Typically, the optimal number of hidden units is much smaller than the input units Each hidden layer maps to a space of smaller dimension Training is stopped before local minimum is reached
61
189 Typical Training Curve Epoch Error Real Error or on an independent validation set Error on Training Set Ideal training stoppage
62
190 Example of Training Stopping Criteria Split data to train-validation-test sets Train on train, until error in validation set is increasing (more than epsilon the last m iterations, or until a maximum number of epochs is reached Evaluate final performance on test set
63
191 Classifying with Neural Networks Determine representation of input: E.g., Religion {Christian, Muslim, Jewish} Represent as one input taking three different values, e.g. 0.2, 0.5, 0.8 Represent as three inputs, taking 0/1 values Determine representation of output (for multiclass problems) Single output unit vs Multiple binary units
64
192 Classifying with Neural Networks Select Number of hidden layers Number of hidden units Connectivity Typically: one hidden layer, hidden units is a small fraction of the input units, full connectivity Select error function Typically: minimize mean squared error or maximize log likelihood of the data
65
193 Classifying with Neural Networks Select a training method: Typically gradient descent (corresponds to vanilla Backpropagation) Other optimization methods can be used: Backpropagation with momentum Trust-Region Methods Line-Search Methods Congugate Gradient methods Newton and Quasi-Newton Methods Select stopping criterion
66
194 Classifying with Neural Networks Select a training method: May include also searching for optimal structure May include extensions to avoid getting stuck in local minima Simulated annealing Random restarts with different weights
67
195 Classifying with Neural Networks Split data to: Training set: used to update the weights Validation set: used in the stopping criterion Test set: used in evaluating generalization error (performance)
68
196 Other Error Functions in Neural Networks Adding a penalty term for weight magnitude Minimizing cross entropy with respect to target values network outputs interpretable as probability estimates
69
197 Representational Power Perceptron: Can learn only linearly separable functions Boolean Functions: learnable by a NN with one hidden layer Continuous Functions: learnable with a NN with one hidden layer and sigmoid units Arbitrary Functions: learnable with a NN with two hidden layers and sigmoid units Number of hidden units in all cases unknown
70
198 Issues with Neural Networks No principled method for selecting number of layers and units Tiling: start with a small network and keep adding units Optimal brain damage: start with a large network and keep removing weights and units Evolutionary methods: search in the space of structures for one that generalizes well No principled method for most other design choices
71
199 Important but not Covered in This Tutorial Very hard to understand the classification logic from direct examination of the weights Large recent body of work in extracting symbolic rules and information from Neural Networks Recurrent Networks, Associative Networks, Self-Organizing Maps, Committees or Networks, Adaptive Resonance Theory etc.
72
200 Why the Name Neural Networks? Initial models that simulate real neurons to use for classification Efforts to improve and understand classification independent of similarity to biological neural networks Efforts to simulate and understand biological neural networks to a larger degree
73
201 Conclusions Can deal with both real and discrete domains Can also perform density or probability estimation Very fast classification time Relatively slow training time (does not scale to thousands of inputs) One of the most successful classifiers yet Successful design choices still a black art Easy to overfit or underfit if care is not applied
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.