Deep learning: Recurrent Neural Networks CV192

Deep learning: Recurrent Neural Networks CV192
Figure from: Deep learning: Recurrent Neural Networks CV192 Lecturer: Oren Freifeld, TA: Ron Shapira Weber Photo by Aphex34 / CC BY-SA 4.0

Contents Review of previous lecture – CNN Recurrent Neural Networks
Back propagation through RNN LSTM

Example – Object recognition and localization
[Andrej Karpathy Li Fei-Fei, (2015): Deep Visual-Semantic Alignments for Generating Image Descriptions]

MLP with one hidden layer
Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. [Lecun, Y., Bengio, Y., & Hinton, G. (2015)]

How big should our hidden layer be?
Taken from Stanford notes

Convolution Discrete: kernel Convolution: matrix*kernel

CNN – Convolutional Layer
. CNN – Convolutional Layer Filters learned by AlexNet (2012) Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, 1–9. 96 convolution kernels of size 11x11x3 learned by the first convolutional layer on the 224x224x3 input images (ImageNet) Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).

CNN – Pooling Layer Reduce the spatial size of the next layer
. CNN – Pooling Layer Reduce the spatial size of the next layer

CNN – Fully connected layer
. CNN – Fully connected layer Usually the last layer before the output layer (e.g softmax)

Feature Visualization
. Feature Visualization Visualizing via optimization of the input: Olah, et al., "Feature Visualization", Distill, 2017. Colah, et al., "Feature Visualization", Distill, 2017.

Recurrent Neural Networks

Sequence modeling Examples: Image classification Image captioning
Sentiment analysis Machine translation Video frame-by-frame classification /rnn-effectiveness/

DRAW: A recurrent neural network for image generation
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation. arXiv preprint arXiv: [Gregor, et al., (2015) DRAW: A recurrent neural network for image generation]

Recurrent Neural Networks vs Feedforward Networks
. Recurrent Neural Networks vs Feedforward Networks 𝑦 𝑛 𝑥 𝑛 ℎ 𝑛 ℎ 𝑛 =𝜎(𝑤[ 𝑥 𝑛 , ℎ 𝑛−1 ]+𝑏) 𝑦=𝑔( 𝑤 ′ ℎ 𝑛 +𝑏′) 𝑙𝑜𝑠𝑠 𝑙𝑜𝑠𝑠 𝑦 𝑦=𝑔(𝑤′ℎ+𝑏′) ℎ ℎ=𝜎(𝑤𝑥+𝑏) 𝑥

. Recurrent Neural Networks 𝑙𝑜𝑠𝑠 𝑙𝑜𝑠𝑠 Gradient flow 𝐿 1 𝐿 2 𝑦 𝑛 𝑦=𝑔( 𝑤 ′ ℎ 𝑛 +𝑏′) 𝑦 1 𝑦 2 𝑤 ℎℎ 𝑤 ℎℎ 𝑤 ℎℎ ℎ 𝑛 ℎ 0 ℎ 1 ℎ 2 ℎ 𝑛 =𝜎(𝑤[𝑥, ℎ 𝑛−1 ]+𝑏) 𝑤 𝑥ℎ 𝑤 𝑥ℎ 𝑥 1 𝑥 2 𝑥 𝑛

. Recurrent Neural Networks 𝑙𝑜𝑠𝑠 Back propagation “through time”: Chain rule application: 𝜕 𝑙𝑜𝑠𝑠 𝑡 𝜕𝑊 = 𝑘=1 𝑡 𝜕 𝑙𝑜𝑠𝑠 𝑡 𝜕 𝑦 𝑡 𝜕 𝑦 𝑡 𝜕 ℎ 𝑡 𝜕 ℎ 𝑡 𝜕 ℎ 𝑘 𝜕 ℎ 𝑘 𝜕𝑊 And: 𝜕 ℎ 𝑡 𝜕 ℎ 𝑘 = 𝑗=𝑘+1 𝑡 𝜕 ℎ 𝑗 𝜕 ℎ 𝑗−1 where ℎ 𝑡 =𝜎(𝑤[𝑥, ℎ 𝑡−1 ]+𝑏) 𝐿 1 𝐿 2 𝑦 1 𝑦 2 𝑤 ℎℎ 𝑤 ℎℎ 𝑤 ℎℎ ℎ 0 ℎ 1 ℎ 2 W = wh, wx 𝑤 𝑥ℎ 𝑤 𝑥ℎ 𝑥 1 𝑥 2

. Recurrent Neural Networks 𝑙𝑜𝑠𝑠 Truncated Back propagation “through time”: Break every K steps. Forward propagation stays the same. Downside? 𝐿 1 𝐿 2 𝑦 1 𝑦 2 𝑤 ℎℎ 𝑤 ℎℎ 𝑤 ℎℎ ℎ 0 ℎ 1 ℎ 2 𝑤 𝑥ℎ 𝑤 𝑥ℎ 𝑥 1 𝑥 2

. Recurrent Neural Networks Example: RNN which predicts the next character The corpus (vocabulary): {h,e,l,o} “one-hot” vectorization: h = [0,0,01]

Recall the vanishing gradient problem…
. Recall the vanishing gradient problem… Sigmoid Function: 𝜎 𝑥 = 1 1+ 𝑒 −𝑥 df dx 𝜎(𝑥)= 𝑒 −𝑥 1− 𝑒 −𝑥 2 𝑚𝑎𝑥 df dx 𝜎(𝑥)= 1 4 1 4 𝑡 →0 when 𝑡 >> 1

Gradient behavior in RNN
. Gradient behavior in RNN Let’s look at the gradient w.r.t. ℎ 1 : 𝜕 𝑙𝑜𝑠𝑠 4 𝜕 ℎ 1 = 𝜕 𝑙𝑜𝑠𝑠 4 𝜕 𝑦 4 𝜕 𝑦 4 𝜕 ℎ 4 𝜕 ℎ 4 𝜕 ℎ 3 𝜕 ℎ 3 𝜕 ℎ 2 𝜕 ℎ 2 𝜕 ℎ 1 Where: ℎ 𝑡 =𝜎(𝑤[𝑥, ℎ 𝑡−1 ]+𝑏) 1 4 𝑡 →0 when 𝑡 >> 1 𝑤 0 𝑤 1 𝑤 2 𝑤 3 ℎ 1 ℎ 2 ℎ 3 ℎ 4 𝑦 4 𝑙𝑜𝑠 𝑠 4 Removed x for simplicity

Vanishing / exploding gradient
Optimization becomes tricky when the computational graph becomes extremely deep. Suppose we need to repeatedly multiply by a weight matrix 𝑾. Suppose that 𝑾 has an eigendecomposition 𝑾 = 𝑽 𝑑𝑖𝑎𝑔 𝝀 𝑽 −𝟏 . 𝑾 𝑛 = 𝑽 𝑑𝑖𝑎𝑔 𝝀 𝑽 −𝟏 𝑛 =𝑽 𝑑𝑖𝑎𝑔 𝝀 𝑛 𝑽 −𝟏 Any eigenvalues 𝜆 𝑖 that are not near an absolute value of 1 will either explode if 𝜆 𝑖 >1 or vanish if 𝜆 𝑖 <1 . [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

Vanishing / exploding gradient
. Vanishing / exploding gradient On a more intuitive note, we working with a sequence of length t we have t copies of our network On the scaler case, we backpropagate our gradient through 𝑤′ t times until we reach ℎ 0 . If 𝑤>1 our gradient will explode If 𝑤<1 our gradient will vanish. 𝑥

Long Short Term Memory (LSTM)
. Long Short Term Memory (LSTM) Long Short Term Memory networks (LSTM) are designed to counter the vanishing gradient problem. They introduce the “cell state” ( 𝑐 𝑡 ) parameter which allows for almost uninterrupted gradient flow through the network. The LSTM module is composed of four gates (layers) that interact with one another: Input gate Forget gate Output gate t𝑎𝑛ℎ gate Specifically it introduces a special set of units called LSTM units which are linear and have a recurrent connection to itself which is fixed to 1. The flow of information into the unit and from the unit is guarded by an input and output gates (their behaviour is learned). [Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory]

. LSTM vs regular RNN The repeating module in a standard RNN contains a single layer. The repeating module in an LSTM contains four interacting layers. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8),

. LSTM The Cell state 𝑐 𝑡 holds information about previous state – memory cell. 𝑐 𝑡 = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + 𝑖 𝑡 ⊙ 𝑐 𝑡 Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 𝑐 𝑡 is additive w.r.t to 𝑤, which solves the vanishing gradient flow problem.

. LSTM – Gradient Flow Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), Chen, G. (2016). A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation. arXiv preprint arXiv: 𝜕 𝑙𝑜𝑠𝑠 𝑡 𝜕 𝐶 𝑡 = 𝜕 𝑙𝑜𝑠𝑠 𝑡 𝜕 𝑦 𝑡 𝜕 𝑦 𝑡 𝜕 ℎ 𝑡 𝜕 ℎ 𝑡 𝜕 𝑐 𝑡 𝑗=2 𝑡 𝜕 𝑐 𝑗 𝜕 𝑐 𝑗−1 Chen, G. (2016). A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation

Input gate 𝑐 𝑡 ∗ = 𝑐 𝑡−1 + 𝑖 𝑡 ⊙𝑡𝑎𝑛ℎ
. Input gate 𝑐 𝑡 ∗ = 𝑐 𝑡−1 + 𝑖 𝑡 ⊙𝑡𝑎𝑛ℎ The input gate layer 𝑖 𝑡 decides what information should go to the current state 𝑐 𝑡 w.r.t. the current input ( ℎ 𝑡−1 , 𝑥 𝑡 ). When 𝑖 𝑡 =0 we ignore the current time step. ( ℎ 𝑡−1 , 𝑥 𝑡 ) 𝑖 𝑡 𝑖 𝑡 ℎ 𝑡−1 , 𝑥 𝑡 4 2 5 1 1 0.1 2 0.4 13 −6 1 0.1 Decide which hidden units contribute each time step For instance, some are ON only when there’s a noun or a verb 𝜎

(tanh) gate 𝑐 𝑡 ∗ = 𝑐 𝑡−1 + 𝑖 𝑡 ⊙ 𝑐 𝑡 Creates the current state 𝑐 𝑡 .
𝑐 𝑡 ∗ = 𝑐 𝑡−1 + 𝑖 𝑡 ⊙ 𝑐 𝑡 Creates the current state 𝑐 𝑡 . Decide which hidden units contribute each time step For instance, some are ON only when there’s a noun or a verb

Forget Gate 𝑐 𝑡 ∗ = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + 𝑖 𝑡 ⊙ 𝑐 𝑡
. Forget Gate 𝑐 𝑡 ∗ = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + 𝑖 𝑡 ⊙ 𝑐 𝑡 The forget gate layer 𝑓 𝑡 decides what information to discard from the previous state 𝑐 𝑡−1 w.r.t. the current input ( ℎ 𝑡−1 , 𝑥 𝑡 ). It scales input with a sigmoid function. When 𝑓 𝑡 =0 we “forget” the previous state. What to add from previous states 𝑓 𝑡 =𝜎( 𝑤 𝑓 ℎ 𝑡−1 , 𝑥 𝑡 +𝑏)

Output Gate 𝑐 𝑡 = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + 𝑖 𝑡 ⊙ 𝑐 𝑡 ℎ 𝑡 = 𝑜 𝑡 ⊙tanh⁡( 𝑐 𝑡 )
. Output Gate 𝑐 𝑡 = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + 𝑖 𝑡 ⊙ 𝑐 𝑡 ℎ 𝑡 = 𝑜 𝑡 ⊙tanh⁡( 𝑐 𝑡 ) The output gate layer 𝑜 𝑡 filters the cell state 𝑐 𝑡 . It decides what information goes “out” of the cell and what remains hidden from the rest of the network. Finally, we need to decide what we’re going to output.

LSTM 𝑐 𝑡 = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + 𝑖 𝑡 ⊙ 𝑐 𝑡 ℎ 𝑡 = 𝑜 𝑡 ⊙tanh 𝑐 𝑡
. LSTM 𝑓 𝑡 =𝜎( 𝑤 𝑓 ℎ 𝑡−1 , 𝑥 𝑡 + 𝑏 𝑓 ) 𝑖 𝑡 =𝜎( 𝑤 𝑖 ℎ 𝑡−1 , 𝑥 𝑡 + 𝑏 𝑖 ) 𝑜 𝑡 =𝜎 𝑤 𝑜 ℎ 𝑡−1 , 𝑥 𝑡 + 𝑏 𝑜 𝑐 𝑡 =tanh⁡(𝑊𝑐 ℎ 𝑡−1 , 𝑥 𝑡 + 𝑏 𝑐 ) 𝑐 𝑡 = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + 𝑖 𝑡 ⊙ 𝑐 𝑡 ℎ 𝑡 = 𝑜 𝑡 ⊙tanh 𝑐 𝑡 Where ℎ 𝑡 is the next hidden state. The gradient w.r.t to W is computed locally at every time step when computing the current cell state and hidden state. W is decomposed into the different gates.

. LSTM Summary LSTM solves the problem of vanishing gradient by introducing the memory cells - 𝑐 𝑡 = 𝑓 𝑡 ∗ 𝑐 𝑡−1 + 𝑖 𝑡 ∗ 𝑐 𝑡 which is mostly defined by addition and element-wise multiplication operators. The gates system filters what information we keep from the previous states and what information to add from the current state. The gradient w.r.t to W is computed locally at every time step when computing the current cell state and hidden state. W is decomposed into the different gates.

Increasing model memory
. Increasing model memory Increasing memory: Increasing the size of the hidden layer increases the model size and computation quadratically ( ℎ 𝑛 → ℎ 𝑛+1 are fully-connected) 𝑤 ℎ 1 ℎ 2 ℎ 3 ℎ 4 𝑦 4 𝑙𝑜𝑠 𝑠 4 Matrix mul vs smaller matrix exponent

. Increasing model memory Adding more hidden layers increases the model capacity while computation scales linearly. Increases representational power via non-linearity. 𝑤 ℎ 1,1 ℎ 1,2 ℎ 1,3 ℎ 1,4 𝑦 4 𝑙𝑜𝑠 𝑠 4 ℎ 2,1 ℎ 2,2 ℎ 2,3 ℎ 2,4 Separates each linear transformation (matrix multiplication) for each hidden layer we add. In oppose to increase the size of the matrix

. Increasing model memory Adding depth through ‘time’. Does not increase memory. Increases representational power via non-linearity. 𝑙𝑜𝑠 𝑠 3 Adds more non-linearity 𝑦 3 ℎ 1 ℎ 12 ℎ 2 ℎ 23 ℎ 3 𝑤 𝑤 𝑤

Neural Image Caption Generation with Visual Attention
. Neural Image Caption Generation with Visual Attention Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015, June). Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on (pp ). IEEE. [Xu et al.,(2015) Show and tell: A neural image caption generator]

. Neural Image Caption Generation with Visual Attention Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015, June). Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on (pp ). IEEE. in order to obtain a correspondence between the feature vectors and portions of the 2-D image, we extract features from a lower convolutional layer unlike previous work which instead used a fully connected layer. This [Xu et al.,(2015) Show and tell: A neural image caption generator]

. Neural Image Caption Generation with Visual Attention Word generation conditioned on: Context vector (CNN features) Previous hidden state Previously generated word Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015, June). Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on (pp ). IEEE. in order to obtain a correspondence between the feature vectors and portions of the 2-D image, we extract features from a lower convolutional layer unlike previous work which instead used a fully connected layer. [Xu et al.,(2015) Show and tell: A neural image caption generator]

Shakespeare http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Char-rnn 3-layer RNN with 512 hidden nodes on each layer.

Visualizing the predictions and the “neuron” firings in the RNN
about 5% of them turn out to have learned quite interesting and interpretible algorithms:

about 5% of them turn out to have learned quite interesting and interpretible algorithms:

Used CNN image captioning with RNN trained on Reddit
What’s the problem with this? Bias data! [

. Summary Deep learning is a field of machine learning that uses artificial neural network in order to learn representations in an hierarchical manner. Convolutional neural achieved state-of-the-art results in the field of computer vision by Reducing the number of parameters via shard weights and local connectivity. Using an hierarchal model. Recurrent neural network are used for sequence modelling. The vanishing gradient problem is address by a model called LSTM.

. Thanks!

Deep learning: Recurrent Neural Networks CV192

Similar presentations

Presentation on theme: "Deep learning: Recurrent Neural Networks CV192"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep learning: Recurrent Neural Networks CV192

Similar presentations

Presentation on theme: "Deep learning: Recurrent Neural Networks CV192"— Presentation transcript:

Similar presentations

About project

Feedback