129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.

129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University

130 Binary Classification Example Value of predictor 1 Value of predictor 2 Example: Classification to malignant or benign breast cancer from mammograms Predictor 1: lump thickness Predictor 2: single epithelial cell size

131 Possible Decision Areas Value of predictor 1 Value of predictor 2 Class area: red circles Class area: Green triangles

132 Possible Decision Areas Value of predictor 1 Value of predictor 2

133 Possible Decision Areas Value of predictor 1 Value of predictor 2

134 Binary Classification Example Value of predictor 1 Value of predictor 2 The simplest non-trivial decision function is the straight line (in general a hyperplane) One decision surface Decision surface partitions space into two subspaces

135 Specifying a Line Line equation: w 2 x 2 +w 1 x 1 +w 0 = 0 Classifier: If w 2 x 2 +w 1 x 1 +w 0  0 Output Else Output x1x1 x2x2 w 2 x 2 +w 1 x 1 +w 0 = 0 w 2 x 2 +w 1 x 1 +w 0 > 0 w 2 x 2 +w 1 x 1 +w 0 < 0

136 Classifying with Linear Surfaces Use 1 for Use -1 for Classifier becomes sgn(w 2 x 2 +w 1 x 1 +w 0 )= sgn(w 2 x 2 +w 1 x 1 +w 0 x 0 ), with x 0 =1 always x1x1 x2x2 w 2 x 2 +w 1 x 1 +w 0 = 0 w 2 x 2 +w 1 x 1 +w 0 > 0 w 2 x 2 +w 1 x 1 +w 0 < 0

137 The Perceptron 1 Input: (attributes of patient to classify) Output: classification of patient (malignant or benign) W 4 =3 W 2 =0 W 2 =-2 W 1 =4 W 0 =2 x0x0 x4x4 x3x3 x2x2 x1x1

138 The Perceptron 23.1410 Input: (attributes of patient to classify) Output: classification of patient (malignant or benign) W 4 =3 W 2 =0 W 2 =-2 W 1 =4 W 0 =2 x0x0 x4x4 x3x3 x2x2 x1x1

139 The Perceptron 33.1410 Input: (attributes of patient to classify) Output: classification of patient (malignant or benign) W 4 =3 W 2 =0 W 2 =-2 W 1 =4 W 0 =2 x0x0 x4x4 x3x3 x2x2 x1x1

140 The Perceptron 23.1410 Input: (attributes of patient to classify) Output: classification of patient (malignant or benign) W 4 =3 W 2 =0 W 2 =-2 W 1 =4 W 0 =2 x0x0 x4x4 x3x3 x2x2 x1x1 sgn(3)=1

141 The Perceptron Input: (attributes of patient to classify) Output: classification of patient (malignant or benign) 23.1410 W 4 =3 W 2 =0 W 2 =-2 W 1 =4 W 0 =2 x0x0 x4x4 x3x3 x2x2 x1x1 sgn(3)=1 1

142 Learning with Perceptrons Hypotheses Space: Inductive Bias: Prefer hypotheses that do not misclassify any of the training instances (or minimize an error function). Search method: perceptron training rule, gradient descent, etc. Remember: the problem is to find “good” weights

143 Training Perceptrons Start with random weights Update in an intelligent way to improve them Intuitively: Decrease the weights that increase the sum Increase the weights that decrease the sum Repeat for all training instances until convergence 23.1410 3 0-2 4 2 x0x0 x4x4 x3x3 x2x2 x1x1 sgn(3)=1 1 True Output: -1

144 Perceptron Training Rule t : output the current example should have o: output of the perceptron x i : value of predictor variable i t=o : No change (for correctly classified examples)

145 Perceptron Training Rule t=-1, o=1 : sum will decrease t=1, o=-1 : sum will increase

146 Perceptron Training Rule In vector form:

147 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1

148 Example of Perceptron Training: The OR function Initial random weights: Define line: x 1 =0.5 x1x1 x2x2 01 1 11 1

149 Example of Perceptron Training: The OR function Initial random weights: Defines line: x 1 =0.5 x1x1 x2x2 01 1 11 1 x 1 =0.5 Area where classifier outputs 1 Area where classifier outputs -1

150 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1 x 1 =0.5 Only misclassified example x 1 =0, x 2 =1

151 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1 x 1 =0.5 Only misclassified example x 1 =0, x 2 =1

152 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1 x 1 =0.5 Only misclassified example x 1 =0, x 2 =1 For x 2 =0, x 1 =-0.5 For x 1 =0, x 2 =-0.5 So, new line is:

153 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1 x 1 =0.5 For x 2 =0, x 1 =-0.5 For x 1 =0, x 2 =-0.5 -0.5

154 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1 -0.5 Misclassified example Next iteration:

155 Example of Perceptron Training: The OR function x1x1 x2x2 01 1 11 1 -0.5 Perfect classification No change occurs next

156 Analysis of the Perceptron Training Rule Algorithm will always converge within finite number of iterations if the data are linearly separable. Otherwise, it may oscillate (no convergence)

157 Training by Gradient Descent Similar but: Always converges Generalizes to training networks of perceptrons (neural networks) Idea: Define an error function Search for weights that minimize the error, i.e., find weights that zero the error gradient

158 Setting Up the Gradient Descent Mean Squared Error: Minima exist where gradient is zero:

159 The Sign Function is not Differentiable x2x2 01 1 11 1 x 1 =0.5

160 Use Differentiable Transfer Functions Replace with the sigmoid

161 Calculating the Gradient

162 Updating the Weights with Gradient Descent Each weight update goes through all training instances Each weight update more expensive but more accurate Always converges to a local minimum regardless of the data

163 Multiclass Problems E.g., 4 classes Two possible solutions Use one perceptron (output unit) and interpret the results as follows: Output 0-0.25, class 1 Output 0.25 – 0.5, class 2 Output 0.5 – 0.75, class 3 Output 0.75 – 1, class 4 Use four output units and a binary encoding of the output (1-of-m encoding).

164 1-of-M Encoding Assign to class with largest output x0x0 x4x4 x3x3 x2x2 x1x1 Output for Class 1 Output for Class 2 Output for Class 3 Output for Class 4

165 Feed-Forward Neural Networks x0x0 x4x4 x3x3 x2x2 x1x1 Output Layer Hidden Layer 2 Hidden Layer 1 Input Layer

166 Increased Expressiveness Example: Exclusive OR x2x2 01 1 1 1 No line (no set of three weights) can separate the training examples (learn the true function). x2x2 x1x1 x0x0 w2w2 w1w1 w0w0

167 Increased Expressiveness Example x2x2 01 1 1 1 x2x2 x1x1 x0x0 w 2,1 w 1,1 w 0,1 w 2,2 w 1,2 w 0,2 w’ 1,1 w’ 2,1

168 Increased Expressiveness Example x2x2 01 1 1 1 H1H1 x2x2 x1x1 x0x0 1 0.5 H2H2 1 0.5 O 1 1 X1X1 X2X2 C 00 011 101 11

169 Increased Expressiveness Example x2x2 1 1 H1H1 x2x2 x1x1 x0x0 1 0.5 H2H2 1 0.5 O 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 T3T3 101 T4T4 11 x1x1

170 Increased Expressiveness Example x2x2 1 1 00 H1H1 x2x2 x1x1 1 x0x0 1 -0.5 H2H2 1 -0.5 O 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 T3T3 101 T4T4 11 x1x1

171 Increased Expressiveness Example x2x2 1 1 00 x2x2 x1x1 1 x0x0 1 -0.5 1 -0.5 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 T3T3 101 T4T4 11 x1x1

172 Increased Expressiveness Example x2x2 1 1 01 x2x2 x1x1 1 x0x0 1 -0.5 1 1 -0.5 1 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 11 T3T3 101 T4T4 11 x1x1

173 Increased Expressiveness Example x2x2 1 1 10 1 x2x2 x1x1 1 x0x0 1 -0.5 1 -0.5 1 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 11 T3T3 1011 1 T4T4 11 x1x1

174 Increased Expressiveness Example x2x2 1 1 11 x2x2 x1x1 1 x0x0 1 -0.5 1 -0.5 1 1 1 X1X1 X2X2 CH1H1 H2H2 O T1T1 00 T2T2 011 11 T3T3 1011 1 T4T4 11 x1x1

177 From the Viewpoint of the Output Layer H1H1 H2H2 O 1 1 CH1H1 H2H2 O T1T1 T2T2 1 11 T3T3 11 1 T4T4 T1T1 T3T3 T2T2 x1x1 x2x2 T4T4 Mapped By Hidden Layer to: T1T1 T3T3 T2T2 H1H1 H2H2 T4T4

178 From the Viewpoint of the Output Layer T1T1 T3T3 T2T2 x1x1 x2x2 T4T4 Mapped By Hidden Layer to: T1T1 T3T3 T2T2 H1H1 H2H2 T4T4 Each hidden layer maps to a new instance space Each hidden node is a new constructed feature Original Problem may become separable (or easier)

179 How to Train Multi-Layered Networks Select a network structure (number of hidden layers, hidden nodes, and connectivity). Select transfer functions that are differentiable. Define a (differentiable) error function. Search for weights that minimize the error function, using gradient descent or other optimization method. BACKPROPAGATION

180 How to Train Multi-Layered Networks Select a network structure (number of hidden layers, hidden nodes, and connectivity). Select transfer functions that are differentiable. Define a (differentiable) error function. Search for weights that minimize the error function, using gradient descent or other optimization method. x2x2 x1x1 x0x0 w 2,1 w 1,1 w 0,1 w 2,2 w 1,2 w 0,2 w’ 1,1 w’ 2,1

181 Back-Propagating the Error x2x2 x1x1 x0x0 Output unit(s): Hidden unit(s): w j  i : weight vector from unit j to unit i x j : jth output i, index for output layer j, index for hidden layer k, index for input layer

182 Back-Propagating the Error x2x2 x1x1 x0x0

187 Training with BackPropagation Going once through all training examples and updating the weights: one epoch Iterate until a stopping criterion is satisfied The hidden layers learn new features and map to new spaces Training reaches a local minimum of the error surface

188 Overfitting with Neural Networks If number of hidden units (and weights) is large, it is easy to memorize the training set (or parts of it) and not generalize Typically, the optimal number of hidden units is much smaller than the input units Each hidden layer maps to a space of smaller dimension Training is stopped before local minimum is reached

189 Typical Training Curve Epoch Error Real Error or on an independent validation set Error on Training Set Ideal training stoppage

190 Example of Training Stopping Criteria Split data to train-validation-test sets Train on train, until error in validation set is increasing (more than epsilon the last m iterations, or until a maximum number of epochs is reached Evaluate final performance on test set

191 Classifying with Neural Networks Determine representation of input: E.g., Religion {Christian, Muslim, Jewish} Represent as one input taking three different values, e.g. 0.2, 0.5, 0.8 Represent as three inputs, taking 0/1 values Determine representation of output (for multiclass problems) Single output unit vs Multiple binary units

192 Classifying with Neural Networks Select Number of hidden layers Number of hidden units Connectivity Typically: one hidden layer, hidden units is a small fraction of the input units, full connectivity Select error function Typically: minimize mean squared error or maximize log likelihood of the data

193 Classifying with Neural Networks Select a training method: Typically gradient descent (corresponds to vanilla Backpropagation) Other optimization methods can be used: Backpropagation with momentum Trust-Region Methods Line-Search Methods  Congugate Gradient methods  Newton and Quasi-Newton Methods Select stopping criterion

194 Classifying with Neural Networks Select a training method: May include also searching for optimal structure May include extensions to avoid getting stuck in local minima Simulated annealing Random restarts with different weights

195 Classifying with Neural Networks Split data to: Training set: used to update the weights Validation set: used in the stopping criterion Test set: used in evaluating generalization error (performance)

196 Other Error Functions in Neural Networks Adding a penalty term for weight magnitude Minimizing cross entropy with respect to target values network outputs interpretable as probability estimates

197 Representational Power Perceptron: Can learn only linearly separable functions Boolean Functions: learnable by a NN with one hidden layer Continuous Functions: learnable with a NN with one hidden layer and sigmoid units Arbitrary Functions: learnable with a NN with two hidden layers and sigmoid units Number of hidden units in all cases unknown

198 Issues with Neural Networks No principled method for selecting number of layers and units Tiling: start with a small network and keep adding units Optimal brain damage: start with a large network and keep removing weights and units Evolutionary methods: search in the space of structures for one that generalizes well No principled method for most other design choices

199 Important but not Covered in This Tutorial Very hard to understand the classification logic from direct examination of the weights Large recent body of work in extracting symbolic rules and information from Neural Networks Recurrent Networks, Associative Networks, Self-Organizing Maps, Committees or Networks, Adaptive Resonance Theory etc.

200 Why the Name Neural Networks? Initial models that simulate real neurons to use for classification Efforts to improve and understand classification independent of similarity to biological neural networks Efforts to simulate and understand biological neural networks to a larger degree

201 Conclusions Can deal with both real and discrete domains Can also perform density or probability estimation Very fast classification time Relatively slow training time (does not scale to thousands of inputs) One of the most successful classifiers yet Successful design choices still a black art Easy to overfit or underfit if care is not applied

129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.

Similar presentations

Presentation on theme: "129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.

Similar presentations

Presentation on theme: "129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback