Download presentation
Presentation is loading. Please wait.
1
Lecture 25: Backprop and convnets
CS5670: Computer Vision Noah Snavely Lecture 25: Backprop and convnets Image credit: Aphex34, [CC BY-SA 4.0 ( Slides from Andrej Karpathy and Fei-Fei Li
2
Review: Setup (2 ) (1) h(1) h(2 ) x s y L
Function h(1) Function h(2 ) x s y L - Goal: Find a value for parameters ( (1,) (2 ), …), so that the loss (L) is small
3
Review: Setup (2 ) W (1), b(1) h(1) h(2 ) W (1)x b(1) x s y L
Function h(2 ) x s y L Toy Example:
4
Review: Setup (2 ) W (1), b(1) h(1) h(2 ) W (1) W (1)x b(1) x s y
Function h(2 ) x s y L L Toy Example: Loss W (1) 12 A weight somewhere in the network
5
Review: Setup (2 ) W (1), b(1) h(1) h(2 ) W (1) W (1)x b(1) x s y
Function h(2 ) x s y L L Toy Example: Loss W (1) 12 A weight somewhere in the network
6
Review: Setup (2 ) W (1), b(1) h(1) h(2 ) W (1) W (1)x b(1) x s y
Function h(2 ) x s y L L Toy Example: Loss W (1) 12 A weight somewhere in the network
7
Review: Setup (2 ) W (1), b(1) h(1) h(2 ) W (1) W (1)x b(1) x s y
Function h(2 ) x s y L L Toy Example: Loss W (1) 12 A weight somewhere in the network
8
Review: Setup (2 ) W (1) W (1), b(1) h(1) h(2 ) W (1) W (1)x b(1)
Function h(2 ) x s y L L L W (1) Toy Example: 12 Loss 1 W (1) 12 A weight somewhere in the network
9
Review: Setup (2 ) W (1) W (1), b(1) h(1) h(2 ) W (1) W (1)x b(1)
Function h(2 ) x s y L L L W (1) Toy Example: (Gradient) 12 Loss 1 W (1) 12 A weight somewhere in the network
10
Review: Setup (2 ) W (1) W (1), b(1) h(1) h(2 ) W (1) W (1)x b(1)
Function h(2 ) x s y L L L W (1) Toy Example: (Gradient) 12 Loss Take a step 1 W (1) 12 A weight somewhere in the network
11
Review: Setup (2 ) W (1) W (1), b(1) h(1) h(2 ) W (1) W (1)x b(1)
Function h(2 ) x s y L L L W (1) Toy Example: (Gradient) 12 Loss 1 How do we get the gradient? Backpropagation W (1) 12 A weight somewhere in the network
12
Backprop It’s just the chain rule
13
Backpropagation [Rumelhart, Hinton, Williams. Nature 1986]
14
Chain rule recap L L h x h x
I hope everyone remembers the chain rule: L L h x h x
15
Chain rule recap L L h x h x h x L L h x
I hope everyone remembers the chain rule: L L h x h x x Forward propagation: h L x L h Backward propagation:
16
Chain rule recap L L h x h x h x L L h x
I hope everyone remembers the chain rule: L L h x h x x Forward propagation: h L x L h Backward propagation: (extends easily to multi-dimensional x and y)
17
Slide from Karpathy 2016
18
Slide from Karpathy 2016
19
Slide from Karpathy 2016
20
Slide from Karpathy 2016
21
Slide from Karpathy 2016
22
Slide from Karpathy 2016
23
Gradients add at branches
Activation
24
Gradients add at branches
Activation Gradient
25
Gradients add at branches
Activation Gradient +
26
Gradients copy through sums
Activation +
27
Gradients copy through sums
Activation + Gradient
28
Gradients copy through sums
Activation + Gradient
29
Gradients copy through sums
Activation + Gradient The gradient flows through both branches at “full strength”
30
Symmetry between forward and backward
+ + Forward: copy Backward: add Forward: add Backward: copy
31
Forward Propagation: (1) (n ) Function h(1) Function s x L
32
(1) (n ) h(1) L s x Forward Propagation: Backward Propagation:
Function h(1) Function s x L Backward Propagation:
33
(1) (n ) h(1) L L s x Forward Propagation: Backward Propagation:
Function h(1) Function s x L Backward Propagation: L
34
(1) (n ) h(1) L L s x L s Forward Propagation:
Function h(1) Function s x L Backward Propagation: L s L
35
(1) (n ) h(1) (n ) L L s x L L s Forward Propagation:
Function h(1) Function s x L Backward Propagation: L (n ) L s Function L
36
(1) (n ) h(1) (n ) h(1) L L s x L L L s
Forward Propagation: (1) (n ) Function h(1) Function s x L Backward Propagation: L (n ) L h(1) L s Function L
37
(1) (n ) h(1) (1) (n ) h(1) L L s x L L L L L x s
Forward Propagation: (1) (n ) Function h(1) Function s x L Backward Propagation: L (1) L (n ) L x L h(1) L s Function Function L
38
What to do for each layer
39
L (n ) L h(n1) L h(n ) Layer n Layer n +1
40
(n ) h(n1) h(n ) L L L This is what we want for each layer
Layer n Layer n +1
41
(n ) h(n ) h(n1) L L L This is what we want for each layer
To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1
42
(n ) h(n ) h(n1) L L L This is what we want for each layer
To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 For each layer:
43
(n ) h(n ) (n ) (n ) h(n ) h(n1) L L h(n ) L L
This is what we want for each layer To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 For each layer: L L h(n ) (n ) h(n ) (n ) What we want
44
h(n ) h(n ) (n ) (n ) (n ) h(n ) h(n1) L L L L
This is what we want for each layer To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 given to us For each layer: h(n ) L h(n ) L (n ) (n ) What we want
45
h(n ) (n ) h(n ) (n ) (n ) h(n ) h(n1) L L L L
This is what we want for each layer To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 given to us For each layer: L h(n ) h(n ) (n ) L (n ) What we want This is just the local gradient of layer n
46
h(n ) h(n ) (n ) (n ) h(n1) h(n1) h(n ) (n ) h(n )
L (n ) This is what we want for each layer To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 given to us For each layer: L h(n ) h(n ) (n ) L (n ) L L h(n ) h(n1) h(n1) h(n ) What we want This is just the local gradient of layer n
47
h(n ) h(n ) h(n ) (n ) h(n ) (n ) h(n1) h(n1) (n )
L (n ) This is what we want for each layer To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 given to us For each layer: h(n ) L h(n ) h(n ) (n ) L h(n ) L (n ) L h(n1) h(n1) What we want This is just the local gradient of layer n
48
h(n ) h(n ) (n ) h(n ) h(n ) h(n1) (n ) h(n1) (n )
L (n ) This is what we want for each layer To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 given to us For each layer: L h(n ) h(n ) (n ) L h(n ) h(n ) h(n1) L (n ) L h(n1) What we want This is just the local gradient of layer n
49
Summary Propagated gradient to the left
For each layer, we compute: Propagated gradient to the left Propagated gradient from rightLocal gradient
50
Summary Propagated gradient to the left
For each layer, we compute: Propagated gradient to the left Propagated gradient from rightLocal gradient (Can compute immediately)
51
Summary Propagated gradient to the left
For each layer, we compute: Propagated gradient to the left Propagated gradient from rightLocal gradient (Received during backprop) (Can compute immediately)
52
30s cat picture break
53
Backprop in N-dimensions just add more subscripts and more summations
54
Backprop in N-dimensions just add more subscripts and more summations
L L h x h x x, h scalars (L is always scalar)
55
Backprop in N-dimensions just add more subscripts and more summations
L L h x h x x, h scalars (L is always scalar) x, h 1D arrays (vectors) L L hi xj hi xj i
56
Backprop in N-dimensions just add more subscripts and more summations
L L h x h x x, h scalars (L is always scalar) x, h 1D arrays (vectors) x, h 2D arrays L L hi xj hi xj i L L hij xab hij xab i j
57
Backprop in N-dimensions just add more subscripts and more summations
L L h x h x x, h scalars (L is always scalar) x, h 1D arrays (vectors) x, h 2D arrays x, h 3D arrays L L hi xj hi xj i L L hij xab hij xab i j L L hijk xabc hijk xabc i j k
58
Examples
59
Example: Mean Subtraction (for a single input)
60
Example: Mean Subtraction (for a single input)
Example layer: mean subtraction: •
61
Example: Mean Subtraction (for a single input)
Example layer: mean subtraction: • 1 k hi xi x D k
62
Example: Mean Subtraction (for a single input)
Example layer: mean subtraction: • 1 k hi xi x (here, “i” and “k” are channels) D k
63
Example: Mean Subtraction (for a single input)
Example layer: mean subtraction: • 1 hi xi x (here, “i” and “k” are channels) D k k Always start with the chain rule (this one is for 1D): • L L hi xj hi xj i
64
Example: Mean Subtraction (for a single input)
Example layer: mean subtraction: • 1 hi xi x (here, “i” and “k” are channels) D k k Always start with the chain rule (this one is for 1D): • L L hi xj hi xj i • Note: Be very careful with your subscripts! Introduce new variables and don’t re-use letters.
65
Example: Mean Subtraction (for a single input)
∂L ∑ ∂L ∂hi (backprop ∂xj i ∂hi ∂xj aka chain rule) ∂L ⎛ δ − 1 ⎞ ∑ ⎜ ij ⎟ i ∂hi ⎝ D ⎠ ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi ∂L − 1 ∑ ∂L ∂hj D i ∂hi
66
Example: Mean Subtraction
(for a single input) 1 k hi xi x Forward: • D k ∂L ∑ ∂L ∂hi (backprop ∂xj i ∂hi ∂xj aka chain rule) ∂L ⎛ δ − 1 ⎞ ∑ ⎜ ij ⎟ i ∂hi ⎝ D ⎠ ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi ∂L − 1 ∑ ∂L ∂hj D i ∂hi
67
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k Taking the derivative of the layer: • ∂L ∑ ∂L ∂hi (backprop ∂xj i ∂hi ∂xj aka chain rule) ∂L ⎛ δ − 1 ⎞ ∑ ⎜ ij ⎟ i ∂hi ⎝ D ⎠ ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi ∂L − 1 ∑ ∂L ∂hj D i ∂hi
68
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k hi 1 Taking the derivative of the layer: x ij D • j ∂L ∑ ∂L ∂hi (backprop ∂xj i ∂hi ∂xj aka chain rule) ∂L ⎛ δ − 1 ⎞ ∑ ⎜ ij ⎟ i ∂hi ⎝ D ⎠ ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi ∂L − 1 ∑ ∂L ∂hj D i ∂hi
69
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k hi 1 Taking the derivative of the layer: x ij D • j ∂L ∑ ∂L ∂hi (backprop ∂xj i ∂hi ∂xj aka chain rule) ⎛ ⎧ ⎪ 1 i j ⎞ ∂L ⎛ δ − 1 ⎞ ∑ ⎜ ij ⎟ i ∂hi ⎝ D ⎠ ⎜ ⎨ ⎟ ij ⎪⎩ 0 else ⎝ ⎠ ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi ∂L − 1 ∑ ∂L ∂hj D i ∂hi
70
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k hi 1 Taking the derivative of the layer: x ij D • j L L hi xj hi xj i (backprop aka chain rule) ⎛ ⎧ ⎞ i j ⎪⎩ 0 else ∂L ⎛ δ − 1 ⎞ ∑ ⎜ ij ⎟ i ∂hi ⎝ D ⎠ ⎪ 1 ⎜ ⎨ ⎟ ij ⎝ ⎠ ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi ∂L − 1 ∑ ∂L ∂hj D i ∂hi
71
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k hi 1 Taking the derivative of the layer: x ij D • j L L hi xj hi xj i (backprop aka chain rule) ⎛ ⎧ ⎞ ⎨ i j 0 else ⎪ 1 L ⎛ 1 ⎞ ⎜ ⎟ i ij ⎝ ⎪⎩ ⎠ ⎜ ⎟ h ⎝ ij D ⎠ i ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi ∂L − 1 ∑ ∂L ∂hj D i ∂hi
72
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k hi 1 Taking the derivative of the layer: x ij D • j L L hi xj hi xj i (backprop aka chain rule) ⎛ ⎧ ⎞ ⎨ i j 0 else ⎪ 1 L ⎛ 1 ⎞ ⎜ ⎟ i ij ⎝ ⎪⎩ ⎠ ⎜ ⎟ h ⎝ ij D ⎠ i L 1 L i i h ij D h i i ∂L − 1 ∑ ∂L ∂hj D i ∂hi
73
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k hi 1 Taking the derivative of the layer: x ij D • j L L hi xj hi xj i (backprop aka chain rule) ⎛ ⎧ ⎞ ⎨ i j 0 else ⎪ 1 L ⎛ 1 ⎞ ⎜ ⎟ i ij ⎝ ⎪⎩ ⎠ ⎜ ⎟ h ⎝ ij D ⎠ i L 1 L h ij D h i i i i L 1 L ∂h j ∂h i D i
74
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k hi 1 Taking the derivative of the layer: x ij D • j L L hi xj hi xj i (backprop aka chain rule) ⎛ ⎧ ⎞ ⎨ i j 0 else ⎪ 1 L ⎛ 1 ⎞ ⎜ ⎟ i ij ⎝ ⎪⎩ ⎠ ⎜ ⎟ h ⎝ ij D ⎠ i L 1 L h ij D h i i i i L 1 L Done! ∂h j ∂h i D i
75
Example: Mean Subtraction
(for a single input) 1 hi xi x D k k L L 1 L xi hi D hk k
76
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k L L 1 L Backward: • xi hi D hk k
77
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k L L 1 L Backward: • xi hi D hk k In this case, they’re identical operations! •
78
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k L L 1 L Backward: • xi hi D hk k In this case, they’re identical operations! • • Usually the forwards pass and backwards pass are similar but not the same.
79
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k L L 1 L Backward: • xi hi D hk k In this case, they’re identical operations! • • Usually the forwards pass and backwards pass are similar but not the same. Derive it by hand, and check it numerically •
80
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k Let’s code this up in NumPy:
81
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k Let’s code this up in NumPy:
82
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k Let’s code this up in NumPy: Dimension mismatch
83
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k Let’s code this up in NumPy: Dimension mismatch You need to broadcast properly:
84
Example: Mean Subtraction
(for a single input) 1 hi xi x Forward: • D k k Let’s code this up in NumPy: Dimension mismatch You need to broadcast properly: This also works:
85
Example: Mean Subtraction (for a single input)
The backward pass is easy: (Remember they’re usually not the same)
86
Example: Euclidean Loss
87
Example: Euclidean Loss
Euclidean loss layer: •
88
Example: Euclidean Loss
Euclidean loss layer: • z y Euclidean Loss L
89
Example: Euclidean Loss
Euclidean loss layer: • z y 1 2 )2 Euclidean Loss j L L (z y i i, j i, j
90
Example: Euclidean Loss
Euclidean loss layer: • z y 1 2 )2 Euclidean Loss L L (z y i i, j i, j j (“i” is the batch index, “j” is the channel)
91
Example: Euclidean Loss
Euclidean loss layer: • z y 1 2 )2 Euclidean Loss L L (z y i i, j i, j j (“i” is the batch index, “j” is the channel) The total loss is the average over N examples: •
92
Example: Euclidean Loss
Euclidean loss layer: • z y 1 2 )2 Euclidean Loss L L (z y i i, j i, j j (“i” is the batch index, “j” is the channel) The total loss is the average over N examples: • 1 i L L N i
93
Example: Euclidean Loss
94
Example: Euclidean Loss
• Used for regression, e.g. predicting an adjustment to box coordinates when detecting objects:
95
Example: Euclidean Loss
• Used for regression, e.g. predicting an adjustment to box coordinates when detecting objects: Bounding box regression from the R-CNN object detector [Girshick 2014]
96
Example: Euclidean Loss
• Used for regression, e.g. predicting an adjustment to box coordinates when detecting objects: Bounding box regression from the R-CNN object detector [Girshick 2014] Note: Can be unstable and other losses often work better. Alternatives: L1 distance (instead of L2), discretizing into category bins and using softmax •
97
Example: Euclidean Loss
98
Example: Euclidean Loss
1 (zi, j yi, j ) j Li 2 Forward: • 2
99
Example: Euclidean Loss
1 (zi, j yi, j ) j Li 2 Forward: • 2 Backward: •
100
Example: Euclidean Loss
1 (zi, j yi, j ) j Li 2 Forward: • 2 Li Backward: • zi, j yi, j z i, j
101
Example: Euclidean Loss
1 (zi, j yi, j ) j Li 2 Forward: • 2 Li Backward: • zi, j yi, j z i, j Li y z y i, j i, j i, j
102
Example: Euclidean Loss
1 (zi, j yi, j ) j Li 2 Forward: • 2 Li Backward: • zi, j yi, j z i, j Li y z y i, j i, j i, j • Q: If you scale the loss by C, what happens to gradient computed in the backwards pass?
103
Example: Euclidean Loss
1 (zi, j yi, j ) j Li 2 Forward: • 2 Li Backward: • zi, j yi, j z i, j (note that this is with respect to Li, not L) L i y z y i, j i, j i, j • Q: If you scale the loss by C, what happens to gradient computed in the backwards pass?
104
Example: Euclidean Loss
105
Example: Euclidean Loss
Forward pass, for a batch of N inputs: •
106
Example: Euclidean Loss
Forward pass, for a batch of N inputs: • 1 2 L 1 L )2 j L (z y N i i i, j i, j i
107
Example: Euclidean Loss
Forward pass, for a batch of N inputs: • 1 2 L 1 L )2 j L (z y N i i i, j i, j i Backward pass: •
108
Example: Euclidean Loss
Forward pass, for a batch of N inputs: • 1 2 L 1 L )2 j L (z y N i i i, j i, j i Backward pass: • L zi, j yi, j L yi, j zi, j N N xi, j yi, j
109
Example: Euclidean Loss
Forward pass, for a batch of N inputs: • 1 2 L 1 L )2 j L (z y N i i i, j i, j i Backward pass: • L zi, j yi, j L yi, j zi, j N N xi, j yi, j (You should be able to derive this)
110
Example: Softmax (for N inputs)
Remember Softmax? It’s a loss function for predicting categories?
111
Example: Softmax (for N inputs)
Remember Softmax? It’s a loss function for predicting categories? yi Softmax Cross- Entropy x s p L i i i i
112
Example: Softmax (for N inputs)
Remember Softmax? It’s a loss function for predicting categories? (ground truth labels) yi Softmax Cross- Entropy x s p L i i i i (input) (scores) (probabilities) (loss)
113
Example: Softmax (for N inputs)
Remember Softmax? It’s a loss function for predicting categories? (here, “i” are different examples) (ground truth labels) y i Softmax Cross- Entropy x s p L i i i i (input) (scores) (probabilities) (loss)
114
Example: Softmax (for N inputs)
Remember Softmax? It’s a loss function for predicting categories? (here, “i” are different examples) (ground truth labels) y i Softmax Cross- Entropy x s p L i i i i (input) (scores) esi , j (probabilities) (loss) pi, j e s i ,k k (Softmax)
115
Example: Softmax (for N inputs)
Remember Softmax? It’s a loss function for predicting categories? (here, “i” are different examples) (ground truth labels) y i Softmax Cross- Entropy x s p L i i i i (input) (scores) esi , j (probabilities) (loss) pi, j L log p e si ,k i i,yi k (Softmax) (Cross-entropy)
116
It’s a loss function for predicting categories?
Example: Softmax (for N inputs) Remember Softmax? It’s a loss function for predicting categories? (here, “i” are different examples) (ground truth labels) y i Softmax Cross- Entropy x s p L i i i i (input) (scores) esi , j (probabilities) (loss) 1 N pi, j e L log p L L s i ,k i i,yi i i k (Softmax) (Cross-entropy) (Avg. over examples)
117
Example: Softmax (for N inputs)
yi Softmax Cross- Entropy x s p L i i i i
118
Example: Softmax (for N inputs)
yi Softmax Cross- Entropy x s p L i i i i L s pi, j ti, j Derivative: N i, j
119
Example: Softmax (for N inputs)
yi Softmax Cross- Entropy x s p L i i i i L pi, j ti, j where ti [ ] Derivative: N (Entry yi si, j set to 1)
120
Example: Softmax (for N inputs)
yi Softmax Cross- Entropy x si p L i i i L si, j pi, j ti, j where ti [ ] Derivative: N (Entry yi set to 1)
121
Example: Softmax (for N inputs)
yi Softmax Cross- Entropy x si p L i i i L si, j pi, j ti, j where ti [ ] Derivative: N (Entry yi set to 1) (You will derive this in PA5)
122
Example: Softmax (for N inputs)
yi Softmax Cross- Entropy x si p L i i i L si, j pi, j ti, j where ti [ ] Derivative: N (Entry yi set to 1) (You will derive this in PA5) Now we can continue backpropagating to the layer before “f”
123
What about the weights? To get the derivative of the weights, use the chain rule again!
124
What about the weights? To get the derivative of the weights, use the chain rule again! Example: 2D weights, 1D bias, 1D hidden activations:
125
What about the weights? W ,b h h h(x;W ) x
To get the derivative of the weights, use the chain rule again! Example: 2D weights, 1D bias, 1D hidden activations: W ,b Layer x h h h(x;W )
126
What about the weights? L L hk Wij hk Wij W ,b h h h(x;W )
To get the derivative of the weights, use the chain rule again! Example: 2D weights, 1D bias, 1D hidden activations: W ,b Layer x h h h(x;W ) L L hk Wij hk Wij k
127
What about the weights? L L hk L L hk Wij hk Wij
To get the derivative of the weights, use the chain rule again! Example: 2D weights, 1D bias, 1D hidden activations: W ,b Layer x h h h(x;W ) L L hk L L hk Wij hk Wij bi hk bi k k
128
What about the weights? L L hk L L hk
To get the derivative of the weights, use the chain rule again! Example: 2D weights, 1D bias, 1D hidden activations: W ,b Layer x h h h(x;W ) L L hk L L hk Wij hk Wij bi hk bi k k (the number of subscripts and summations changes depending on your layer and parameter sizes)
129
ConvNets They’re just neural networks with 3D activations and weight sharing
130
What shape should the activations have?
Layer h(1) Layer h(2 ) x f - The input is an image, which is 3D (RGB channel, height, width)
131
What shape should the activations have?
Layer h(1) Layer h(2 ) x f The input is an image, which is 3D (RGB channel, height, width) We could flatten it to a 1D vector, but then we lose structure
132
What shape should the activations have?
Layer h(1) Layer h(2 ) x f The input is an image, which is 3D (RGB channel, height, width) We could flatten it to a 1D vector, but then we lose structure What about keeping everything in 3D?
133
3D Activations x h1 h2 (1D vectors) (3D arrays)
Figure: Andrej Karpathy
134
3D Activations h1 h2 x (1D vectors) (3D arrays)
Figure: Andrej Karpathy
135
3D Activations Figure: Andrej Karpathy
136
3D Activations For example, a CIFAR-10 image is a 3x32x32 volume (3 depth — RGB channels, 32 height, 32 width) Figure: Andrej Karpathy
137
3D Activations 1D Activations: Figure: Andrej Karpathy
138
3D Activations 1D Activations: 3D Activations: Figure: Andrej Karpathy
139
3D Activations The input is 3x32x32
- - This neuron depends on a 3x5x5 chunk of the input 5 5 - The neuron also has a 3x5x5 set of weights and a bias (scalar) Figure: Andrej Karpathy
140
3D Activations xr hr Example: consider the region of the input “ xr”
With output neuron hr xr 5 hr 5 Figure: Andrej Karpathy
141
3D Activations xr hr hr xr W b
Example: consider the region of the input “ xr” With output neuron hr xr 5 hr 5 Then the output is: hr xr W b ijk ijk ijk Figure: Andrej Karpathy
142
3D Activations xr hr hr xr W b
Example: consider the region of the input “ xr” With output neuron hr xr 5 hr 5 Then the output is: hr xr W b ijk ijk ijk Sum over 3 axes Figure: Andrej Karpathy
143
3D Activations xr 5 hr 1 5 Figure: Andrej Karpathy
144
3D Activations xr 5 hr r 1 h 2 5 Figure: Andrej Karpathy
145
3D Activations xr hr hr hr hr xr W b xr W b
With 2 output neurons xr hr xr W b ijk 1 ijk 1ijk 1 5 hr hr hr xr W b ijk 5 1 2 2 ijk 2ijk 2 Figure: Andrej Karpathy
146
3D Activations xr hr hr hr hr xr W b xr W b
With 2 output neurons xr hr xr W b ijk 1 ijk 1ijk 1 5 hr hr hr xr W b ijk 5 1 2 2 ijk 2ijk 2 Figure: Andrej Karpathy
147
3D Activations Figure: Andrej Karpathy
148
3D Activations We can keep adding more outputs
These form a column in the output volume: [depth x 1 x 1] Figure: Andrej Karpathy
149
Each neuron has its own 3D filter and own (scalar) bias
3D Activations We can keep adding more outputs These form a column in the output volume: [depth x 1 x 1] Each neuron has its own 3D filter and own (scalar) bias Figure: Andrej Karpathy
150
3D Activations Now repeat this across the input
D sets of weights (also called filters) Figure: Andrej Karpathy
151
3D Activations Now repeat this across the input Weight sharing:
Each filter shares the same weights (but each depth index has its own set of weights) D sets of weights (also called filters) Figure: Andrej Karpathy
152
3D Activations D sets of weights (also called filters)
Figure: Andrej Karpathy
153
3D Activations With weight sharing, this is called convolution
D sets of weights (also called filters) Figure: Andrej Karpathy
154
3D Activations With weight sharing, this is called convolution
Without weight sharing, this is called a locally connected layer D sets of weights (also called filters) Figure: Andrej Karpathy
155
3D Activations One set of weights gives one slice in the output
Output of one filter One set of weights gives one slice in the output To get a 3D output of depth D, use D different filters In practice, ConvNets use many filters (~64 to 1024) (input depth) (output depth)
156
3D Activations One set of weights gives one slice in the output
Output of one filter One set of weights gives one slice in the output To get a 3D output of depth D, use D different filters In practice, ConvNets use many filters (~64 to 1024) (input depth) (output depth) All together, the weights are 4 dimensional: (output depth, input depth, kernel height, kernel width)
157
3D Activations Let’s code this up in NumPy
158
3D Activations Let’s code this up in NumPy nth example
159
3D Activations Let’s code this up in NumPy first filter nth example
160
3D Activations Let’s code this up in NumPy nth output position
first filter nth example
161
3D Activations Let’s code this up in NumPy nth output position
first filter nth example
162
3D Activations xr Let’s code this up in NumPy nth output position
first filter nth example
163
3D Activations xr Let’s code this up in NumPy nth nth output position
first filter nth example nth example
164
3D Activations xr Let’s code this up in NumPy nth nth output position
first filter all input channels nth example nth example
165
3D Activations xr Let’s code this up in NumPy nth nth input region
output position first filter all input channels nth example nth example
166
3D Activations xr Let’s code this up in NumPy nth nth input region
output position first filter all input channels nth example nth example
167
3D Activations xr Let’s code this up in NumPy nth nth input region
output position first filter all input channels nth example nth example first filter
168
3D Activations xr Let’s code this up in NumPy nth nth input region
output position first filter all input channels all channels nth example nth example first filter
169
3D Activations xr Let’s code this up in NumPy nth nth input region
output position first filter all positions all channels all input channels nth example nth example first filter
170
3D Activations xr Let’s code this up in NumPy nth nth bias
input region output position first filter all positions all channels all input channels nth example nth example first filter
171
3D Activations (Input) (32 filters, each 3x5x5)
We can unravel the 3D cube and show each layer separately: (Input) (32 filters, each 3x5x5) Figure: Andrej Karpathy
172
3D Activations (Input) (32 filters, each 3x5x5)
We can unravel the 3D cube and show each layer separately: (Input) (32 filters, each 3x5x5) Figure: Andrej Karpathy
173
3D Activations (Input) (32 filters, each 3x5x5)
We can unravel the 3D cube and show each layer separately: (Input) (32 filters, each 3x5x5) Figure: Andrej Karpathy
174
3D Activations (Input) (32 filters, each 3x5x5)
We can unravel the 3D cube and show each layer separately: (Input) (32 filters, each 3x5x5) Figure: Andrej Karpathy
175
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.