Download presentation
Presentation is loading. Please wait.
1
Backpropagation
2
Last time…
3
Correlational learning: Hebb rule
What Hebb actually said: When an axon of cell A is near enough to excite a cell B and repeatedly and consistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficacy, as one of the cells firing B, is increased. The minimal version of the Hebb rule: When there is a synapse between cell A and cell B, increment the strength of the synapse whenever A and B fire together (or in close succession). The minimal Hebb rule as implemented in a network:
4
Limitations of Hebbian learning
Many association problems it cannot solve Especially where similar input patterns must produce quite different outputs. Without further constraints, weights grow without bound Each weight is learned independently of all others Weight changes are exactly the same from pass to pass.
5
Predictive error-driven learning with linear units…
6
1 2 3 w1 w2
7
Why don’t we just use two-layer perceptrons?
8
aj = bw + ai * wij i b j w1 bw
9
1 2 3 w1 w2 b Input 1 Input 2 Output 1 𝑎 3 = 𝑏 𝑤 +𝑎 1 𝑤 1 + 𝑎 2 𝑤 2 1
10
1 2 3 w1 w2 b Input 1 Input 2 Output 1 𝑎 3 = 𝑏 𝑤 +𝑎 1 𝑤 1 + 𝑎 2 𝑤 2 1
11
Linear separability problem: with linear units (and thresholded linear units), *no solutions exist* for input/output mappings that are not linearly separable. What if we add an extra layer between input and output?
12
5 𝑎 5 = 𝑎 3 𝑤 5 + 𝑎 4 𝑤 6 w5 w6 = (𝑎 1 𝑤 1 + 𝑎 2 𝑤 3 ) 𝑤 5 + (𝑎 1 𝑤 2 + 𝑎 2 𝑤 4 ) 𝑤 6 3 4 = 𝑎 1 ( 𝑤 1 𝑤 5 + 𝑤 2 𝑤 6 )+ 𝑎 2 ( 𝑤 3 𝑤 5 + 𝑤 4 𝑤 6 ) = 𝑎 1 𝑤 𝑥 + 𝑎 2 𝑤 𝑦 w2 w3 w1 w4 1 2 Same as a linear network without any hidden layer!
13
What if we use thresholded units?
14
5 𝑛𝑒𝑡 𝑗 = 𝑖 𝑎 𝑖 𝑤 𝑖𝑗 w5 w6 If netj > thresh, aj = 1 Else aj = 0 3 4 w2 w3 w1 w4 1 2
15
𝑛𝑒𝑡 𝑗 = 𝑖 𝑎 𝑖 𝑤 𝑖𝑗 5 If netj > 9.9, aj = 1 Else aj = 0 10 -10 3 4 00 11 01 10 Unit 3 Unit 4 10 10 5 5 1 2
16
00 11 01 10 Unit 3 Unit 4 Input space 00 11 01 10 Input space Hidden space 11 01 00 10
17
So with thresholded units and a hidden layer, solutions exist…
…and solutions can be viewed as “re-representing” the inputs, so as to make the mapping to the output unit learnable. BUT, how can we learn the correct weights instead of just setting them by hand?
18
But what if: 𝐸= 𝑗 ( 𝑡 𝑗 − 𝑎 𝑗 ) 2 𝑑𝐸 𝑑 𝑤 𝑖𝑗 =− ( 𝑡 𝑗 −𝑎 𝑗 )× 𝑎 𝑖 𝑑𝐸 𝑑 𝑤 𝑖𝑗 = 𝑑𝐸 𝑑 𝑎 𝑗 𝑑 𝑎 𝑗 𝑑 𝑤 𝑖𝑗 Simple delta rule: ∆ 𝑤 𝑖𝑗 =𝛼( 𝑡 𝑗 − 𝑎 𝑗 )× 𝑎 𝑖 𝑎 𝑗 = 𝑖 𝑎 𝑖 𝑤 𝑖𝑗 𝑛𝑒𝑡 𝑗 = 𝑖 𝑎 𝑖 𝑤 𝑖𝑗 𝑎 𝑗 =𝑓( 𝑛𝑒𝑡 𝑗 ) 𝑑𝐸 𝑑 𝑤 𝑖𝑗 = 𝑑𝐸 𝑑 𝑎 𝑗 𝑑 𝑎 𝑗 𝑑 𝑛𝑒𝑡 𝑗 𝑑 𝑛𝑒𝑡 𝑗 𝑑 𝑤 𝑖𝑗 …What function should we use for aj?
19
Can’t use a threshold function---why not?
Derivative = 0 here Derivative is infinite here Activation Net i
20
𝑎 𝑗 = 1 1+ 𝑒 − 𝑛𝑒𝑡 𝑗 Sigmoid function: Net input Change in activation
1.00 0.90 0.80 0.70 Change in activation 0.60 0.50 0.40 Activation 0.30 0.20 0.10 0.00 -10 -5 5 10 Net input
21
𝐸= 𝑗 ( 𝑡 𝑗 − 𝑎 𝑗 ) 2 𝑑𝐸 𝑑 𝑤 𝑖𝑗 =− ( 𝑡 𝑗 −𝑎 𝑗 )× 𝑎 𝑖 𝑑𝐸 𝑑 𝑤 𝑖𝑗 = 𝑑𝐸 𝑑 𝑎 𝑗 𝑑 𝑎 𝑗 𝑑 𝑤 𝑖𝑗 Simple delta rule: ∆ 𝑤 𝑖𝑗 =𝛼( 𝑡 𝑗 − 𝑎 𝑗 )× 𝑎 𝑖 𝑎 𝑗 = 𝑖 𝑎 𝑖 𝑤 𝑖𝑗 𝑎 𝑗 = 1 1+ 𝑒 − 𝑛𝑒𝑡 𝑗 𝑑 𝑎 𝑗 𝑑 𝑛𝑒𝑡 𝑗 = 𝑎 𝑗 (1− 𝑎 𝑗 ) 𝑑𝐸 𝑑 𝑤 𝑖𝑗 = 𝑑𝐸 𝑑 𝑎 𝑗 𝑑 𝑎 𝑗 𝑑 𝑛𝑒𝑡 𝑗 𝑑 𝑛𝑒𝑡 𝑗 𝑑 𝑤 𝑖𝑗 𝑑𝐸 𝑑 𝑤 𝑖𝑗 =− ( 𝑡 𝑗 −𝑎 𝑗 )× 𝑎 𝑗 (1− 𝑎 𝑗 )×𝑎 𝑖 ∆ 𝑤 𝑖𝑗 =𝛼 ( 𝑡 𝑗 −𝑎 𝑗 )× 𝑎 𝑗 (1− 𝑎 𝑗 )×𝑎 𝑖
22
𝛿 5 𝛿 5 * w5 𝑑𝐸 𝑑 𝑤 5 = 𝑑𝐸 𝑑 𝑎 5 𝑑 𝑎 5 𝑑 𝑛𝑒𝑡 5 𝑑 𝑛𝑒𝑡 5 𝑑 𝑤 5 5
𝑑𝐸 𝑑 𝑤 5 = 𝑑𝐸 𝑑 𝑎 5 𝑑 𝑎 5 𝑑 𝑛𝑒𝑡 5 𝑑 𝑛𝑒𝑡 5 𝑑 𝑤 5 5 𝛿 5 𝑛𝑒𝑡 𝑗 = 𝑖 𝑎 𝑖 𝑤 𝑖𝑗 𝑑𝐸 𝑑 𝑤 1 w5 w6 3 4 w2 w3 𝑑 𝑛𝑒𝑡 5 𝑑 𝑎 3 = 𝑤 5 𝑑𝐸 𝑑 𝑎 3 = 𝛿 5 * w5 w1 w4 1 2 𝑑 𝑎 3 𝑑 𝑛𝑒𝑡 3 = 𝑎 3 (1− 𝑎 3 ) 𝑑 𝑛𝑒𝑡 3 𝑑 𝑤 1 = 𝑎 1 𝑑𝐸 𝑑 𝑤 1 = 𝛿 5 × 𝑑 𝑛𝑒𝑡 5 𝑑 𝑎 3 𝑑 𝑎 3 𝑑 𝑛𝑒𝑡 3 𝑑 𝑛𝑒𝑡 3 𝑑 𝑤 1
23
5 6 Targets For outputs delta computed directly based on error. Delta is stored at each unit and also used directly to adjust each incoming weight. 3 4 5 1 2 6 Output For hidden units, there are no targets; “error” signal is instead the sum of the output unit deltas. These are used to compute deltas for the hidden units, which are again stored with unit and used to directly change incoming weights. Hidden Input Deltas, and hence error signal at output, can propagate backward through network through many layers until it reaches the input.
24
Alternative error functions.
25
𝐸= (𝑡−𝑎) 2 Sum-squared error: 𝑑𝐸 𝑑𝑎 =−(𝑡−𝑎) 𝑑𝐸 𝑑𝑛𝑒𝑡 =− 𝑡−𝑎 𝑎(1−𝑎) 5 w5 w6 3 4 w2 w3 w1 w4 1 2 𝐸=−𝑡 ln 𝑎 −(1−𝑡) ln (1−𝑎) Cross-entropy error: 𝑑𝐸 𝑑𝑛𝑒𝑡 =𝑎−𝑡
26
5 w5 w6 3 4 w2 w3 w1 w4 1 2
27
Input 1 Input 2 New input Output 1 1 3 w1 w2 1 2 2
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.