Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 25: Backprop and convnets

Similar presentations


Presentation on theme: "Lecture 25: Backprop and convnets"— Presentation transcript:

1 Lecture 25: Backprop and convnets
CS5670: Computer Vision Noah Snavely Lecture 25: Backprop and convnets Image credit: Aphex34, [CC BY-SA 4.0 ( Slides from Andrej Karpathy and Fei-Fei Li

2 Review: Setup  (2 )  (1) h(1) h(2 ) x s y L
Function h(1) Function h(2 ) x s y L - Goal: Find a value for parameters ( (1,)  (2 ), …), so that the loss (L) is small

3 Review: Setup  (2 ) W (1), b(1) h(1) h(2 ) W (1)x  b(1) x s y L
Function h(2 ) x s y L Toy Example:

4 Review: Setup  (2 ) W (1), b(1) h(1) h(2 ) W (1) W (1)x  b(1) x s y
Function h(2 ) x s y L L Toy Example: Loss W (1) 12 A weight somewhere in the network

5 Review: Setup  (2 ) W (1), b(1) h(1) h(2 ) W (1) W (1)x  b(1) x s y
Function h(2 ) x s y L L Toy Example: Loss W (1) 12 A weight somewhere in the network

6 Review: Setup  (2 ) W (1), b(1) h(1) h(2 ) W (1) W (1)x  b(1) x s y
Function h(2 ) x s y L L Toy Example: Loss W (1) 12 A weight somewhere in the network

7 Review: Setup  (2 ) W (1), b(1) h(1) h(2 ) W (1) W (1)x  b(1) x s y
Function h(2 ) x s y L L Toy Example: Loss W (1) 12 A weight somewhere in the network

8 Review: Setup  (2 ) W (1) W (1), b(1) h(1) h(2 ) W (1) W (1)x  b(1)
Function h(2 ) x s y L L L W (1) Toy Example: 12 Loss 1 W (1) 12 A weight somewhere in the network

9 Review: Setup  (2 ) W (1) W (1), b(1) h(1) h(2 ) W (1) W (1)x  b(1)
Function h(2 ) x s y L L L W (1) Toy Example: (Gradient) 12 Loss 1 W (1) 12 A weight somewhere in the network

10 Review: Setup  (2 ) W (1) W (1), b(1) h(1) h(2 ) W (1) W (1)x  b(1)
Function h(2 ) x s y L L L W (1) Toy Example: (Gradient) 12 Loss Take a step 1 W (1) 12 A weight somewhere in the network

11 Review: Setup  (2 ) W (1) W (1), b(1) h(1) h(2 ) W (1) W (1)x  b(1)
Function h(2 ) x s y L L L W (1) Toy Example: (Gradient) 12 Loss 1 How do we get the gradient? Backpropagation W (1) 12 A weight somewhere in the network

12 Backprop It’s just the chain rule

13 Backpropagation [Rumelhart, Hinton, Williams. Nature 1986]

14 Chain rule recap L  L h x h x
I hope everyone remembers the chain rule: L  L h x h x

15 Chain rule recap L  L h x h x h x L L h x
I hope everyone remembers the chain rule: L  L h x h x x Forward propagation: h L x L h Backward propagation:

16 Chain rule recap L  L h x h x h x L L h x
I hope everyone remembers the chain rule: L  L h x h x x Forward propagation: h L x L h Backward propagation: (extends easily to multi-dimensional x and y)

17 Slide from Karpathy 2016

18 Slide from Karpathy 2016

19 Slide from Karpathy 2016

20 Slide from Karpathy 2016

21 Slide from Karpathy 2016

22 Slide from Karpathy 2016

23 Gradients add at branches
Activation

24 Gradients add at branches
Activation Gradient

25 Gradients add at branches
Activation Gradient +

26 Gradients copy through sums
Activation +

27 Gradients copy through sums
Activation + Gradient

28 Gradients copy through sums
Activation + Gradient

29 Gradients copy through sums
Activation + Gradient The gradient flows through both branches at “full strength”

30 Symmetry between forward and backward
+ + Forward: copy Backward: add Forward: add Backward: copy

31 Forward Propagation:  (1)  (n ) Function h(1) Function s x L

32  (1)  (n ) h(1) L s x Forward Propagation: Backward Propagation:
Function h(1) Function s x L Backward Propagation:

33  (1)  (n ) h(1) L L s x Forward Propagation: Backward Propagation:
Function h(1) Function s x L Backward Propagation: L

34  (1)  (n ) h(1) L L s x L s Forward Propagation:
Function h(1) Function s x L Backward Propagation: L s L

35  (1)  (n ) h(1)  (n ) L L s x L L s Forward Propagation:
Function h(1) Function s x L Backward Propagation: L  (n ) L s Function L

36  (1)  (n ) h(1)  (n ) h(1) L L s x L L L s
Forward Propagation:  (1)  (n ) Function h(1) Function s x L Backward Propagation: L  (n ) L h(1) L s Function L

37  (1)  (n ) h(1)  (1)  (n ) h(1) L L s x L L L L L x s
Forward Propagation:  (1)  (n ) Function h(1) Function s x L Backward Propagation: L  (1) L  (n ) L x L h(1) L s Function Function L

38 What to do for each layer

39 L  (n ) L h(n1) L h(n ) Layer n Layer n +1

40  (n ) h(n1) h(n ) L L L This is what we want for each layer
Layer n Layer n +1

41  (n ) h(n ) h(n1) L L L This is what we want for each layer
To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1

42  (n ) h(n ) h(n1) L L L This is what we want for each layer
To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 For each layer:

43  (n ) h(n )  (n )  (n ) h(n ) h(n1) L  L  h(n ) L L
This is what we want for each layer To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 For each layer: L  L  h(n )  (n ) h(n )  (n ) What we want

44  h(n ) h(n )  (n )  (n )  (n ) h(n ) h(n1) L L  L L
This is what we want for each layer To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 given to us For each layer:  h(n ) L h(n ) L  (n )  (n ) What we want

45 h(n )  (n ) h(n )  (n )  (n ) h(n ) h(n1) L L   L L
This is what we want for each layer To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 given to us For each layer: L h(n ) h(n )  (n ) L  (n ) What we want This is just the local gradient of layer n

46 h(n ) h(n )  (n )  (n ) h(n1) h(n1) h(n )  (n ) h(n )
L  (n ) This is what we want for each layer To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 given to us For each layer: L h(n ) h(n )  (n ) L  (n ) L  L  h(n ) h(n1) h(n1) h(n ) What we want This is just the local gradient of layer n

47  h(n ) h(n ) h(n )  (n ) h(n )  (n ) h(n1) h(n1)  (n )
L  (n ) This is what we want for each layer To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 given to us For each layer:  h(n ) L h(n ) h(n )  (n ) L h(n ) L  (n ) L  h(n1) h(n1) What we want This is just the local gradient of layer n

48 h(n ) h(n )  (n ) h(n ) h(n ) h(n1)  (n ) h(n1)  (n )
L  (n ) This is what we want for each layer To compute it, we need to propagate this gradient L h(n ) L h(n1) Layer n Layer n +1 given to us For each layer: L h(n ) h(n )  (n ) L h(n ) h(n ) h(n1) L  (n ) L  h(n1) What we want This is just the local gradient of layer n

49 Summary Propagated gradient to the left 
For each layer, we compute: Propagated gradient to the left  Propagated gradient from rightLocal gradient

50 Summary Propagated gradient to the left 
For each layer, we compute: Propagated gradient to the left  Propagated gradient from rightLocal gradient (Can compute immediately)

51 Summary Propagated gradient to the left 
For each layer, we compute: Propagated gradient to the left  Propagated gradient from rightLocal gradient (Received during backprop) (Can compute immediately)

52 30s cat picture break

53 Backprop in N-dimensions just add more subscripts and more summations

54 Backprop in N-dimensions just add more subscripts and more summations
L  L h x h x x, h scalars (L is always scalar)

55 Backprop in N-dimensions just add more subscripts and more summations
L  L h x h x x, h scalars (L is always scalar) x, h 1D arrays (vectors) L   L hi xj hi xj i

56 Backprop in N-dimensions just add more subscripts and more summations
L  L h x h x x, h scalars (L is always scalar) x, h 1D arrays (vectors) x, h 2D arrays L   L hi xj hi xj i L   L hij xab hij xab i j

57 Backprop in N-dimensions just add more subscripts and more summations
L  L h x h x x, h scalars (L is always scalar) x, h 1D arrays (vectors) x, h 2D arrays x, h 3D arrays L   L hi xj hi xj i L   L hij xab hij xab i j L   L hijk xabc hijk xabc i j k

58 Examples

59 Example: Mean Subtraction (for a single input)

60 Example: Mean Subtraction (for a single input)
Example layer: mean subtraction:

61 Example: Mean Subtraction (for a single input)
Example layer: mean subtraction: 1 k hi  xi  x D k

62 Example: Mean Subtraction (for a single input)
Example layer: mean subtraction: 1 k hi  xi  x (here, “i” and “k” are channels) D k

63 Example: Mean Subtraction (for a single input)
Example layer: mean subtraction: 1 hi  xi  x (here, “i” and “k” are channels) D k k Always start with the chain rule (this one is for 1D): L   L hi xj hi xj i

64 Example: Mean Subtraction (for a single input)
Example layer: mean subtraction: 1 hi  xi  x (here, “i” and “k” are channels) D k k Always start with the chain rule (this one is for 1D): L   L hi xj hi xj i Note: Be very careful with your subscripts! Introduce new variables and don’t re-use letters.

65 Example: Mean Subtraction (for a single input)
∂L  ∑ ∂L ∂hi (backprop ∂xj i ∂hi ∂xj aka chain rule)  ∂L ⎛ δ − 1 ⎞ ∑ ⎜ ij ⎟ i ∂hi ⎝ D ⎠  ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi  ∂L − 1 ∑ ∂L ∂hj D i ∂hi

66 Example: Mean Subtraction
(for a single input) 1 k hi  xi  x Forward: D k ∂L  ∑ ∂L ∂hi (backprop ∂xj i ∂hi ∂xj aka chain rule)  ∂L ⎛ δ − 1 ⎞ ∑ ⎜ ij ⎟ i ∂hi ⎝ D ⎠  ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi  ∂L − 1 ∑ ∂L ∂hj D i ∂hi

67 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k Taking the derivative of the layer: ∂L  ∑ ∂L ∂hi (backprop ∂xj i ∂hi ∂xj aka chain rule)  ∂L ⎛ δ − 1 ⎞ ∑ ⎜ ij ⎟ i ∂hi ⎝ D ⎠  ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi  ∂L − 1 ∑ ∂L ∂hj D i ∂hi

68 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k hi 1 Taking the derivative of the layer: x  ij  D j ∂L  ∑ ∂L ∂hi (backprop ∂xj i ∂hi ∂xj aka chain rule)  ∂L ⎛ δ − 1 ⎞ ∑ ⎜ ij ⎟ i ∂hi ⎝ D ⎠  ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi  ∂L − 1 ∑ ∂L ∂hj D i ∂hi

69 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k hi 1 Taking the derivative of the layer: x  ij  D j ∂L  ∑ ∂L ∂hi (backprop ∂xj i ∂hi ∂xj aka chain rule) 1 i  j ⎞  ∂L ⎛ δ − 1 ⎞ ∑ ⎜ ij ⎟ i ∂hi ⎝ D ⎠   ij ⎪⎩ 0 else  ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi  ∂L − 1 ∑ ∂L ∂hj D i ∂hi

70 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k hi 1 Taking the derivative of the layer: x  ij  D j L L hi xj hi xj   i (backprop aka chain rule) i  j ⎪⎩ 0 else  ∂L ⎛ δ − 1 ⎞ ∑ ⎜ ij ⎟ i ∂hi ⎝ D ⎠   ⎪ 1 ij  ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi  ∂L − 1 ∑ ∂L ∂hj D i ∂hi

71 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k hi 1 Taking the derivative of the layer: x  ij  D j L L hi xj hi xj   i (backprop aka chain rule) ⎨ i  j 0 else   ⎪ 1 L ⎛ 1 ⎞ i ij   ⎝ ⎪⎩ h ij D i  ∂L δ − 1 ∂L ∑ ij ∑ i ∂hi D i ∂hi  ∂L − 1 ∑ ∂L ∂hj D i ∂hi

72 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k hi 1 Taking the derivative of the layer: x  ij  D j L L hi xj hi xj   i (backprop aka chain rule) ⎨ i  j 0 else   ⎪ 1 L ⎛ 1 ⎞ i ij   ⎝ ⎪⎩ h ij D i L 1 L i i   h ij D h i i  ∂L − 1 ∑ ∂L ∂hj D i ∂hi

73 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k hi 1 Taking the derivative of the layer: x  ij  D j L L hi xj hi xj   i (backprop aka chain rule) ⎨ i  j 0 else   ⎪ 1 L ⎛ 1 ⎞ i ij   ⎝ ⎪⎩ h ij D i L 1 L   h ij D h i i i i  L  1  L ∂h j ∂h i D i

74 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k hi 1 Taking the derivative of the layer: x  ij  D j L L hi xj hi xj   i (backprop aka chain rule) ⎨ i  j 0 else   ⎪ 1 L ⎛ 1 ⎞ i ij   ⎝ ⎪⎩ h ij D i L 1 L   h ij D h i i i i  L  1  L Done! ∂h j ∂h i D i

75 Example: Mean Subtraction
(for a single input) 1 hi  xi  x D k k L  L  1  L xi hi D hk k

76 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k L  L  1  L Backward: xi hi D hk k

77 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k L  L  1  L Backward: xi hi D hk k In this case, they’re identical operations!

78 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k L  L  1  L Backward: xi hi D hk k In this case, they’re identical operations! Usually the forwards pass and backwards pass are similar but not the same.

79 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k L  L  1  L Backward: xi hi D hk k In this case, they’re identical operations! Usually the forwards pass and backwards pass are similar but not the same. Derive it by hand, and check it numerically

80 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k Let’s code this up in NumPy:

81 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k Let’s code this up in NumPy:

82 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k Let’s code this up in NumPy: Dimension mismatch

83 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k Let’s code this up in NumPy: Dimension mismatch You need to broadcast properly:

84 Example: Mean Subtraction
(for a single input) 1 hi  xi  x Forward: D k k Let’s code this up in NumPy: Dimension mismatch You need to broadcast properly: This also works:

85 Example: Mean Subtraction (for a single input)
The backward pass is easy: (Remember they’re usually not the same)

86 Example: Euclidean Loss

87 Example: Euclidean Loss
Euclidean loss layer:

88 Example: Euclidean Loss
Euclidean loss layer: z y Euclidean Loss L

89 Example: Euclidean Loss
Euclidean loss layer: z y 1 2 )2 Euclidean Loss j L L  (z  y i i, j i, j

90 Example: Euclidean Loss
Euclidean loss layer: z y 1 2 )2 Euclidean Loss L L  (z  y i i, j i, j j (“i” is the batch index, “j” is the channel)

91 Example: Euclidean Loss
Euclidean loss layer: z y 1 2 )2 Euclidean Loss L L  (z  y i i, j i, j j (“i” is the batch index, “j” is the channel) The total loss is the average over N examples:

92 Example: Euclidean Loss
Euclidean loss layer: z y 1 2 )2 Euclidean Loss L L  (z  y i i, j i, j j (“i” is the batch index, “j” is the channel) The total loss is the average over N examples: 1 i L  L N i

93 Example: Euclidean Loss

94 Example: Euclidean Loss
Used for regression, e.g. predicting an adjustment to box coordinates when detecting objects:

95 Example: Euclidean Loss
Used for regression, e.g. predicting an adjustment to box coordinates when detecting objects: Bounding box regression from the R-CNN object detector [Girshick 2014]

96 Example: Euclidean Loss
Used for regression, e.g. predicting an adjustment to box coordinates when detecting objects: Bounding box regression from the R-CNN object detector [Girshick 2014] Note: Can be unstable and other losses often work better. Alternatives: L1 distance (instead of L2), discretizing into category bins and using softmax

97 Example: Euclidean Loss

98 Example: Euclidean Loss
1 (zi, j  yi, j ) j Li  2 Forward: 2

99 Example: Euclidean Loss
1 (zi, j  yi, j ) j Li  2 Forward: 2 Backward:

100 Example: Euclidean Loss
1 (zi, j  yi, j ) j Li  2 Forward: 2 Li Backward:  zi, j  yi, j z i, j

101 Example: Euclidean Loss
1 (zi, j  yi, j ) j Li  2 Forward: 2 Li Backward:  zi, j  yi, j z i, j Li  y  z y i, j i, j i, j

102 Example: Euclidean Loss
1 (zi, j  yi, j ) j Li  2 Forward: 2 Li Backward:  zi, j  yi, j z i, j Li  y  z y i, j i, j i, j Q: If you scale the loss by C, what happens to gradient computed in the backwards pass?

103 Example: Euclidean Loss
1 (zi, j  yi, j ) j Li  2 Forward: 2 Li Backward:  zi, j  yi, j z i, j (note that this is with respect to Li, not L) L i  y  z y i, j i, j i, j Q: If you scale the loss by C, what happens to gradient computed in the backwards pass?

104 Example: Euclidean Loss

105 Example: Euclidean Loss
Forward pass, for a batch of N inputs:

106 Example: Euclidean Loss
Forward pass, for a batch of N inputs: 1 2 L  1  L )2 j L  (z  y N i i i, j i, j i

107 Example: Euclidean Loss
Forward pass, for a batch of N inputs: 1 2 L  1  L )2 j L  (z  y N i i i, j i, j i Backward pass:

108 Example: Euclidean Loss
Forward pass, for a batch of N inputs: 1 2 L  1  L )2 j L  (z  y N i i i, j i, j i Backward pass: L  zi, j  yi, j L  yi, j  zi, j N N xi, j yi, j

109 Example: Euclidean Loss
Forward pass, for a batch of N inputs: 1 2 L  1  L )2 j L  (z  y N i i i, j i, j i Backward pass: L  zi, j  yi, j L  yi, j  zi, j N N xi, j yi, j (You should be able to derive this)

110 Example: Softmax (for N inputs)
Remember Softmax? It’s a loss function for predicting categories?

111 Example: Softmax (for N inputs)
Remember Softmax? It’s a loss function for predicting categories? yi Softmax Cross- Entropy x s p L i i i i

112 Example: Softmax (for N inputs)
Remember Softmax? It’s a loss function for predicting categories? (ground truth labels) yi Softmax Cross- Entropy x s p L i i i i (input) (scores) (probabilities) (loss)

113 Example: Softmax (for N inputs)
Remember Softmax? It’s a loss function for predicting categories? (here, “i” are different examples) (ground truth labels) y i Softmax Cross- Entropy x s p L i i i i (input) (scores) (probabilities) (loss)

114 Example: Softmax (for N inputs)
Remember Softmax? It’s a loss function for predicting categories? (here, “i” are different examples) (ground truth labels) y i Softmax Cross- Entropy x s p L i i i i (input) (scores) esi , j (probabilities) (loss) pi, j   e s i ,k k (Softmax)

115 Example: Softmax (for N inputs)
Remember Softmax? It’s a loss function for predicting categories? (here, “i” are different examples) (ground truth labels) y i Softmax Cross- Entropy x s p L i i i i (input) (scores) esi , j (probabilities) (loss) pi, j   L   log p e si ,k i i,yi k (Softmax) (Cross-entropy)

116 It’s a loss function for predicting categories?
Example: Softmax (for N inputs) Remember Softmax? It’s a loss function for predicting categories? (here, “i” are different examples) (ground truth labels) y i Softmax Cross- Entropy x s p L i i i i (input) (scores) esi , j (probabilities) (loss) 1 N pi, j   e L   log p L  L s i ,k i i,yi i i k (Softmax) (Cross-entropy) (Avg. over examples)

117 Example: Softmax (for N inputs)
yi Softmax Cross- Entropy x s p L i i i i

118 Example: Softmax (for N inputs)
yi Softmax Cross- Entropy x s p L i i i i L s pi, j  ti, j Derivative: N i, j

119 Example: Softmax (for N inputs)
yi Softmax Cross- Entropy x s p L i i i i L  pi, j  ti, j where ti  [ ] Derivative: N (Entry yi si, j set to 1)

120 Example: Softmax (for N inputs)
yi Softmax Cross- Entropy x si p L i i i L si, j  pi, j  ti, j where ti  [ ] Derivative: N (Entry yi set to 1)

121 Example: Softmax (for N inputs)
yi Softmax Cross- Entropy x si p L i i i L si, j  pi, j  ti, j where ti  [ ] Derivative: N (Entry yi set to 1) (You will derive this in PA5)

122 Example: Softmax (for N inputs)
yi Softmax Cross- Entropy x si p L i i i L si, j  pi, j  ti, j where ti  [ ] Derivative: N (Entry yi set to 1) (You will derive this in PA5) Now we can continue backpropagating to the layer before “f”

123 What about the weights? To get the derivative of the weights, use the chain rule again!

124 What about the weights? To get the derivative of the weights, use the chain rule again! Example: 2D weights, 1D bias, 1D hidden activations:

125 What about the weights? W ,b h h  h(x;W ) x
To get the derivative of the weights, use the chain rule again! Example: 2D weights, 1D bias, 1D hidden activations: W ,b Layer x h h  h(x;W )

126 What about the weights? L   L hk Wij hk Wij W ,b h h  h(x;W )
To get the derivative of the weights, use the chain rule again! Example: 2D weights, 1D bias, 1D hidden activations: W ,b Layer x h h  h(x;W ) L   L hk Wij hk Wij k

127 What about the weights? L   L hk L   L hk Wij hk Wij
To get the derivative of the weights, use the chain rule again! Example: 2D weights, 1D bias, 1D hidden activations: W ,b Layer x h h  h(x;W ) L   L hk L   L hk Wij hk Wij bi hk bi k k

128 What about the weights? L   L hk L   L hk
To get the derivative of the weights, use the chain rule again! Example: 2D weights, 1D bias, 1D hidden activations: W ,b Layer x h h  h(x;W ) L   L hk L   L hk Wij hk Wij bi hk bi k k (the number of subscripts and summations changes depending on your layer and parameter sizes)

129 ConvNets They’re just neural networks with 3D activations and weight sharing

130 What shape should the activations have?
Layer h(1) Layer h(2 ) x f - The input is an image, which is 3D (RGB channel, height, width)

131 What shape should the activations have?
Layer h(1) Layer h(2 ) x f The input is an image, which is 3D (RGB channel, height, width) We could flatten it to a 1D vector, but then we lose structure

132 What shape should the activations have?
Layer h(1) Layer h(2 ) x f The input is an image, which is 3D (RGB channel, height, width) We could flatten it to a 1D vector, but then we lose structure What about keeping everything in 3D?

133 3D Activations x h1 h2 (1D vectors) (3D arrays)
Figure: Andrej Karpathy

134 3D Activations h1 h2 x (1D vectors) (3D arrays)
Figure: Andrej Karpathy

135 3D Activations Figure: Andrej Karpathy

136 3D Activations For example, a CIFAR-10 image is a 3x32x32 volume (3 depth — RGB channels, 32 height, 32 width) Figure: Andrej Karpathy

137 3D Activations 1D Activations: Figure: Andrej Karpathy

138 3D Activations 1D Activations: 3D Activations: Figure: Andrej Karpathy

139 3D Activations The input is 3x32x32
- - This neuron depends on a 3x5x5 chunk of the input 5 5 - The neuron also has a 3x5x5 set of weights and a bias (scalar) Figure: Andrej Karpathy

140 3D Activations xr hr Example: consider the region of the input “ xr”
With output neuron hr xr 5 hr 5 Figure: Andrej Karpathy

141 3D Activations xr hr hr   xr W  b
Example: consider the region of the input “ xr” With output neuron hr xr 5 hr 5 Then the output is: hr   xr W  b ijk ijk ijk Figure: Andrej Karpathy

142 3D Activations xr hr hr   xr W  b
Example: consider the region of the input “ xr” With output neuron hr xr 5 hr 5 Then the output is: hr   xr W  b ijk ijk ijk Sum over 3 axes Figure: Andrej Karpathy

143 3D Activations xr 5 hr 1 5 Figure: Andrej Karpathy

144 3D Activations xr 5 hr r 1 h 2 5 Figure: Andrej Karpathy

145 3D Activations xr hr hr hr hr   xr W  b   xr W  b
With 2 output neurons xr hr   xr W  b ijk 1 ijk 1ijk 1 5 hr hr hr   xr W  b ijk 5 1 2 2 ijk 2ijk 2 Figure: Andrej Karpathy

146 3D Activations xr hr hr hr hr   xr W  b   xr W  b
With 2 output neurons xr hr   xr W  b ijk 1 ijk 1ijk 1 5 hr hr hr   xr W  b ijk 5 1 2 2 ijk 2ijk 2 Figure: Andrej Karpathy

147 3D Activations Figure: Andrej Karpathy

148 3D Activations We can keep adding more outputs
These form a column in the output volume: [depth x 1 x 1] Figure: Andrej Karpathy

149 Each neuron has its own 3D filter and own (scalar) bias
3D Activations We can keep adding more outputs These form a column in the output volume: [depth x 1 x 1] Each neuron has its own 3D filter and own (scalar) bias Figure: Andrej Karpathy

150 3D Activations Now repeat this across the input
D sets of weights (also called filters) Figure: Andrej Karpathy

151 3D Activations Now repeat this across the input Weight sharing:
Each filter shares the same weights (but each depth index has its own set of weights) D sets of weights (also called filters) Figure: Andrej Karpathy

152 3D Activations D sets of weights (also called filters)
Figure: Andrej Karpathy

153 3D Activations With weight sharing, this is called convolution
D sets of weights (also called filters) Figure: Andrej Karpathy

154 3D Activations With weight sharing, this is called convolution
Without weight sharing, this is called a locally connected layer D sets of weights (also called filters) Figure: Andrej Karpathy

155 3D Activations One set of weights gives one slice in the output
Output of one filter One set of weights gives one slice in the output To get a 3D output of depth D, use D different filters In practice, ConvNets use many filters (~64 to 1024) (input depth) (output depth)

156 3D Activations One set of weights gives one slice in the output
Output of one filter One set of weights gives one slice in the output To get a 3D output of depth D, use D different filters In practice, ConvNets use many filters (~64 to 1024) (input depth) (output depth) All together, the weights are 4 dimensional: (output depth, input depth, kernel height, kernel width)

157 3D Activations Let’s code this up in NumPy

158 3D Activations Let’s code this up in NumPy nth example

159 3D Activations Let’s code this up in NumPy first filter nth example

160 3D Activations Let’s code this up in NumPy nth output position
first filter nth example

161 3D Activations Let’s code this up in NumPy nth output position
first filter nth example

162 3D Activations xr Let’s code this up in NumPy nth output position
first filter nth example

163 3D Activations xr Let’s code this up in NumPy nth nth output position
first filter nth example nth example

164 3D Activations xr Let’s code this up in NumPy nth nth output position
first filter all input channels nth example nth example

165 3D Activations xr Let’s code this up in NumPy nth nth input region
output position first filter all input channels nth example nth example

166 3D Activations xr Let’s code this up in NumPy nth nth input region
output position first filter all input channels nth example nth example

167 3D Activations xr Let’s code this up in NumPy nth nth input region
output position first filter all input channels nth example nth example first filter

168 3D Activations xr Let’s code this up in NumPy nth nth input region
output position first filter all input channels all channels nth example nth example first filter

169 3D Activations xr Let’s code this up in NumPy nth nth input region
output position first filter all positions all channels all input channels nth example nth example first filter

170 3D Activations xr Let’s code this up in NumPy nth nth bias
input region output position first filter all positions all channels all input channels nth example nth example first filter

171 3D Activations (Input) (32 filters, each 3x5x5)
We can unravel the 3D cube and show each layer separately: (Input) (32 filters, each 3x5x5) Figure: Andrej Karpathy

172 3D Activations (Input) (32 filters, each 3x5x5)
We can unravel the 3D cube and show each layer separately: (Input) (32 filters, each 3x5x5) Figure: Andrej Karpathy

173 3D Activations (Input) (32 filters, each 3x5x5)
We can unravel the 3D cube and show each layer separately: (Input) (32 filters, each 3x5x5) Figure: Andrej Karpathy

174 3D Activations (Input) (32 filters, each 3x5x5)
We can unravel the 3D cube and show each layer separately: (Input) (32 filters, each 3x5x5) Figure: Andrej Karpathy

175 Questions?


Download ppt "Lecture 25: Backprop and convnets"

Similar presentations


Ads by Google