Multilayer Perceptron: Learning : {(xi, f(xi)) | i = 1 ~ N} → W Concept Map for Ch.3 Feed forward Network Nonlayered Layered Learning by BP Sigmoid Multilayer Perceptron: y = F(x,W) f(x) ALC Single Layer Multilayer Ch2,1 Ch 2 Ch 1 Learning : {(xi, f(xi)) | i = 1 ~ N} → W Old W Gradient Descent Actual Output Min E(W) Input - Backpropagation (BP) + Desired Output Scalar wij Matrix-Vector W New W
Chapter 3. Multilayer Perceptron MLP Architecture – Extension of Perceptron to Many layers and Sigmoidal Activation functions – for real-valued mapping/classification
Learning: Discrete → Find W* → Continuous F(x, W*) f(x)
u j S 1 -1 1 Smaller Logistic Hyperbolic Tangent
NN Approximating Function 2. Weight Learning Rule – Backpropagation of Error Training Data ( ) Weights (W) : Curve (Data) Fitting (Modeling, NL Regression) NN Approximating Function True Function (2) Mean Squared Error E for 1-D function as an Example
Iteration = One scan of the training set (3) Gradient Descent Learning (4) Learning Curve Number of Iterations , n E{ W(n), weight track } E Iteration = One scan of the training set (Epoch)
) ( y - xi d w (5) Backpropagation Learning Rule u Features: Locality of Computation, No Centralized Control, 2-Pass j d ) ( k y - ij w jk u i xi A. Output Layer Weights B. Inner Layer Weights where where (Credit assignment)
Water Flow Analogy to Backpropagation ( Drop Object Here ) River Flow w1 Input Flow wl - Many weights (Flows) - If the error is very sensitive to a weight change, then change that weight a lot, and vice versa. → Gradient Descent , Minimum Disturbance Principle ( Fetch Object Here ) Output
h (6) Computation Example : MLP(2-1-2) A. Forward Processing : Comp. Function Signals No desired response is needed for hidden nodes. must exist = sigmoid [tanh or logistic] For classification, d = ± 0.9 for tanh, d = 0.1, 0.9 for logistic. h
v w sum y d e - = h B. Backward Processing - Comp. Error Signals 1 v w 2 sum 22 21 y d e - = h has been computed in forward processing
If we knew f(x,y), it would be a lot faster to use it to calculate the output than to use the NN.
Student Questions: Does the output error become more uncertain in case of complex multilayer than simple layer ? Should we use only up to 3 layers ? Why can oscillation occur in the learning curve ? Do we use the old weights for calculating the error signal δ ? What does ANN mean ? Which makes more sense, error gradient or the weight gradient considering the equation for weight change ? What becomes the error signal to train the weights in forward mode ?