Neural Networks and Backpropagation Sebastian Thrun , Fall 2000
Outline Perceptrons Learning Hidden Layer Representations Speeding Up Training Bias, Overfitting and Early Stopping (Example: Face Recognition)
ALVINN drives 70mph on highways Dean Pomerleau CMU
ALVINN drives 70mph on highways
Human Brain
Neurons
Human Learning Number of neurons:~ Connections per neuron:~ 10 4 to 10 5 Neuron switching time:~ second Scene recognition time:~ 0.1 second 100 inference steps doesn’t seem much
The “Bible” (1986)
Perceptron w2w2 wnwn w1w1 w0w0 x 0 =1 o u t p u t o x2x2 xnxn x1x1... i n p u t x 1 if net > 0 0 otherwise {
Inverter input x1 output x1x1 w 1 = 1 1 w 0 =
Boolean OR input x1 input x2 ouput x2x2 x1x1 w 2 =1w 1 =1 w 0 = 0.5 1
Boolean AND input x1 input x2 ouput x2x2 x1x1 w 2 =1w 1 =1 w 0 = 1.5 1
Boolean XOR input x1 input x2 ouput x2x2 x1x1 Eeek!
Linear Separability x1x1 x2x2 OR
Linear Separability x1x1 x2x2 AND
Linear Separability x1x1 x2x2 XOR
Boolean XOR input x1 input x2 ouput h1h1 x1x1 o x1x1 h1h1 1 1.5 AND 1 1 0.5 OR 1 1 0.5 XOR 11
Perceptron Training Rule step size perceptron output input target increment new weightincrementold weight
Converges, if… … training data linearly separable … step size sufficiently small … no “hidden” units
How To Train Multi-Layer Perceptrons? Gradient descent h1h1 x1x1 o x1x1 h1h1
Sigmoid Squashing Function w2w2 wnwn w1w1 w0w0 x 0 =1 o u t p u t x2x2 xnxn x1x1... i n p u t
Sigmoid Squashing Function x (x)
Gradient Descent Learn w i ’s that minimize squared error D = training data
Gradient Descent Gradient: Training rule:
Gradient Descent (single layer)
Batch Learning Initialize each w i to small random value Repeat until termination: w i = 0 For each training example d do o d ( i w i x i,d ) w i w i + (t d o d ) o d (1-o d ) x i,d w i w i + w i
Incremental (Online) Learning Initialize each w i to small random value Repeat until termination: For each training example d do w i = 0 o d i w i x i,d w i w i + (t d o d ) o d (1-o d ) x i,d w i w i + w i
Backpropagation Algorithm Generalization to multiple layers and multiple output units
Backpropagation Algorithm Initialize all weights to small random numbers For each training example do –For each hidden unit h: –For each output unit k: –For each hidden unit h: –Update each network weight w ij : with
Backpropagation Algorithm “activations” “errors”
Can This Be Learned? InputOutput
Learned Hidden Layer Representation InputOutput
Training: Internal Representation
Training: Error
Training: Weights
ANNs in Speech Recognition [Haung/Lippman 1988]
Speeding It Up: Momentum error E weight w ij w ij w ij new Gradient descent GD with Momentum
Convergence May get stuck in local minima Weights may diverge …but works well in practice
Overfitting in ANNs
Early Stopping (Important!!!) Stop training when error goes up on validation set
Sigmoid Squashing Function x (x) Linear range, # of hidden units doesn’t really matter
left strt right up Typical input images Head pose (1-of-4): 90% accuracy Face recognition (1-of-20): 90% accuracy ANNs for Face Recognition
left strt right up
Recurrent Networks