Download presentation
Presentation is loading. Please wait.
Published byCamron Baldwin Modified over 6 years ago
1
RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect
May 2017
2
Sequences and Long-Term Dependency
The Spanish King officially abdicated in … of his …, Felipe. Felipe will be confirmed tomorrow as the new Spanish … .
3
Sequences and Long-Term Dependency
The Spanish King officially abdicated in favour of his son , Felipe. Felipe will be confirmed tomorrow as the new Spanish King .
4
Recurrent Neural Networks
Image Labeling Image Caption Sentiment Analysis Machine Translation Video Labeling
5
Computational Graphs z 𝑦 x x y x w 𝑢 (2) 𝑢 (3) sum x 𝑢 (1) dot sqr 𝜆
Let x and y be matrices and z the dot product; 𝑧= 𝑥⋅𝑦 𝑦 =𝑥⋅𝑤 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑𝑒𝑐𝑎𝑦 𝑝𝑒𝑛𝑎𝑙𝑡𝑦= 𝜆 Σ 𝑖 𝑤 𝑖 2
6
Recurrent Neural Networks – Training
𝑦 (𝑡−1) 𝐿 (𝑡−1) y: observed L: Loss o: output h: hidden 𝑥:𝑖𝑛𝑝𝑢𝑡 𝑜 𝑾 𝑽 ℎ 𝑼 𝑥
7
Recurrent Neural Networks – Training
𝑦 (𝑡−1) 𝑦 (𝑡−1) 𝐿 (𝑡−1) 𝐿 (𝑡−1) y: observed L: Loss o: output h: hidden 𝑥:𝑖𝑛𝑝𝑢𝑡 𝑜 𝑜 (…) 𝑜 (𝑡−1) 𝑾 𝑽 𝑽 Unfold 𝑾 ℎ ℎ (𝑡−1) 𝑼 𝑼 𝑥 𝑥 (𝑡−1)
8
Recurrent Neural Networks – Training
𝑦 (𝑡−1) 𝑦 (𝑡−1) 𝑦 (𝑡) 𝐿 (𝑡−1) 𝐿 (𝑡−1) 𝐿 (𝑡) y: observed L: Loss o: output h: hidden 𝑥:𝑖𝑛𝑝𝑢𝑡 𝑜 (𝑡) 𝑜 𝑜 (…) 𝑜 (𝑡−1) 𝑾 𝑽 𝑽 𝑾 𝑽 Unfold 𝑾 ℎ ℎ (𝑡−1) ℎ (𝑡) 𝑼 𝑼 𝑼 𝑥 (𝑡−1) 𝑥 (𝑡) 𝑥
9
Recurrent Neural Networks – Training in Time
𝑦 (𝑡−1) 𝑦 (𝑡−1) 𝑦 (𝑡) 𝑦 (𝑡+1) 𝐿 (𝑡−1) 𝐿 (𝑡−1) 𝐿 (𝑡) 𝐿 (𝑡+1) y: observed L: Loss o: output h: hidden 𝑥:𝑖𝑛𝑝𝑢𝑡 𝑜 (…) 𝑜 (𝑡−1) 𝑜 (𝑡) 𝑜 𝑜 (𝑡+1) 𝑾 𝑽 𝑽 𝑾 𝑽 𝑾 𝑽 𝑾 Unfold 𝑾 ℎ ℎ (𝑡−1) ℎ (𝑡) ℎ (𝑡+1) ℎ (…) 𝑼 𝑼 𝑼 𝑼 𝑥 (𝑡) 𝑥 (𝑡+1) 𝑥 𝑥 (𝑡−1)
10
𝑁→1 Recurrent Neural Networks – Summarizing 𝐿 (𝜏) 𝑦 (𝜏) 𝑜 (𝜏) …
𝑽 𝑾 … 𝑾 𝑾 ℎ (𝑡−1) ℎ (𝑡) … 𝑾 ℎ (𝜏) 𝑼 𝑼 𝑼 𝑼 𝑥 (𝑡−1) 𝑥 (𝑡) 𝑥 (…) 𝑥 (𝜏)
11
1→𝑀 Recurrent Neural Networks – Expanding ℎ (𝑡−1) 𝑠 (…) 𝑥 (𝑡) ℎ (𝑡)
𝑓 𝑥 (𝑡) ℎ (𝑡) ℎ (𝑡+1) ℎ (…) 𝑜 (𝑡−1) 𝐿 (𝑡−1) 𝑦 (𝑡−1) 𝑜 (𝑡) 𝐿 (𝑡) 𝑦 (𝑡) 𝑜 (𝑡+1) 𝐿 (𝑡+1) 𝑦 (𝑡+1) 𝑾 𝑹 𝑽 𝑦 (…) 𝑼 𝑥 𝑎𝑠 𝑎 𝑓𝑖𝑥𝑒𝑑 𝑙𝑒𝑛𝑔𝑡ℎ 𝑣𝑒𝑐𝑡𝑜𝑟 𝑎𝑠 𝑎𝑛 𝑖𝑛𝑝𝑢𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 𝑡𝑜 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑖𝑠 𝑢𝑠𝑒𝑑 𝑓𝑜𝑟 𝑢𝑠𝑒𝑐𝑎𝑠𝑒𝑠 𝑠𝑢𝑐ℎ 𝑎𝑠 𝑖𝑚𝑎𝑔𝑒 𝑐𝑎𝑝𝑡𝑖𝑜𝑛𝑖𝑛𝑔 1→𝑀
12
𝑁→𝑀 Encode-Decoder (seq2seq)
𝑥 (1) 𝑥 (2) 𝑥 (…) 𝐶 𝑥 ( 𝑛 𝑥 ) 𝑦 (1) 𝑦 (…) 𝑦 (𝑛_𝑦) 𝑁→𝑀 Input and output sequences do not need to be of different size Encoder takes the first sequence 𝑋= 𝑥 1 , …, 𝑥 ( 𝑥 𝑛) 𝑎𝑛𝑑 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑧𝑒𝑑 𝑖𝑡 𝑖𝑛𝑡𝑜 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝐶;𝑡ℎ𝑒𝑛 𝑑𝑒𝑐𝑜𝑑𝑒𝑟 𝑡𝑎𝑘𝑒𝑠 𝑓𝑖𝑥𝑒𝑑−𝑙𝑒𝑛𝑔𝑡ℎ𝑒𝑑 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 𝐶 𝑎𝑛𝑑 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑠 𝑜𝑢𝑡𝑝𝑢𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 𝑌= 𝑦 1 , …, 𝑦 𝑛 𝑚 𝑇ℎ𝑒 𝑡𝑤𝑜 𝑅𝑁𝑁𝑠 𝑎𝑟𝑒 𝑡𝑟𝑎𝑖𝑛𝑒𝑑 𝑗𝑜𝑖𝑛𝑡𝑙𝑦 𝑡𝑜 𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓 𝑃 𝑌 𝑋 ℎ 𝑛 𝑥 𝑖𝑠 𝑜𝑓𝑡𝑒𝑛 𝑢𝑠𝑒𝑑 𝑎𝑠 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝐶
13
𝐼𝑓 𝑊 𝑎𝑑𝑚𝑖𝑡𝑠 𝑒𝑖𝑔𝑒𝑛𝑑𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄Λ Q 𝑇
Challenge of Long-Term Dependencies Deep RNNs have a deep computational graph They apply the same operation at each time-step for a long temporal sequence. Multipying weight Matix W t times to rech time−step t → 𝑊 𝑡 𝑆𝑢𝑝𝑝𝑜𝑠𝑒 𝑊 ℎ𝑎𝑠 𝑑𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑊= 𝑉𝑑𝑖𝑎𝑔 𝜆 𝑊 𝑡 = 𝑉𝑑𝑖𝑎𝑔 𝜆 𝑉 −1 𝑡 = 𝑉𝑑𝑖𝑎𝑔 𝜆 𝑡 𝑉 −1 𝑊ℎ𝑒𝑛 𝜆≠1 𝑊 𝑐𝑎𝑛 𝑏𝑒𝑐𝑜𝑚𝑒 𝑡𝑜𝑜 𝑙𝑎𝑟𝑔𝑒 𝑜𝑟 𝑡𝑜𝑜 𝑠𝑚𝑎𝑙𝑙. More specially in RNNs ℎ 𝑡 = 𝑊 𝑇 ℎ 𝑡−1 ℎ 𝑡 = 𝑊 𝑡 𝑇 ℎ 0 𝐼𝑓 𝑊 𝑎𝑑𝑚𝑖𝑡𝑠 𝑒𝑖𝑔𝑒𝑛𝑑𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄Λ Q 𝑇 ∴ ℎ 𝑡 = 𝑄 𝑇 Λ 𝑡 𝑄 ℎ (0) If ℎ 𝑡 becomes too small, calculating gradients becomes negligible The problem is specifically severe in RNNs as we are using the same weight vector 𝑊 in the multiplication. Feed-Forward networks use a different vector at each layer and mostly avoid the problem.
14
LSTM – Internal Structure
Output gate (sigmoid) controls the output Linear self-loop Forget gate (sigmoid) controls The value of state self-loop Input gate (non-linear) controls the input
15
LSTM – The Math 𝑓 𝑖 (𝑡) =𝜎 𝑏 𝑖 𝑓 + Σ 𝑗 𝑈 𝑖,𝑗 𝑓 𝑥 𝑗 𝑡 + Σ 𝑗 𝑊 𝑖,𝑗 𝑓 ℎ 𝑗 𝑡−1 𝑠 𝑖 (𝑡) = 𝑓 𝑖 (𝑡) 𝑠 𝑖 (𝑡−1) + 𝑔 𝑖 𝑡 𝜎( 𝑏 𝑖 + Σ 𝑗 𝑈 𝑖,𝑗 𝑥 𝑗 𝑡 + Σ 𝑗 𝑊 𝑖,𝑗 ℎ 𝑗 𝑡−1 ) 𝑔 𝑖 (𝑡) =𝜎 𝑏 𝑖 𝑔 + Σ 𝑗 𝑈 𝑖,𝑗 𝑔 𝑥 𝑗 𝑡 + Σ 𝑗 𝑊 𝑖,𝑗 𝑔 ℎ 𝑗 𝑡−1 ℎ 𝑖 (𝑡) = tanh 𝑠 𝑖 𝑡 𝑞 𝑖 (𝑡) 𝑞 𝑖 𝑡 = 𝜎 𝑏 𝑖 𝑜 + Σ 𝑗 𝑈 𝑖,𝑗 0 𝑥 𝑗 𝑡 + Σ 𝑗 𝑊 𝑖,𝑗 𝑜 ℎ 𝑗 𝑡−1 𝑊:𝑅𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑊𝑒𝑖𝑔ℎ𝑡 U:Input Weight B:bias q:output gate g:external gate f:forget gate
16
Back Propagation Through a Memory Cell (from Hinton Lecture)
17
LSTM – The Solution The LSTM Architecture consists of a set of recurrently connected subnets, known as memory blocks An LSTM network is structures similar to a normal RNN, except that non-linear units in the hidden layers are replaced by memory blocks. Hidden layers can be connected to any form of output layer. Each block has three gates, input, output, and forget. This allows for protecting the information inside and dynamically changing the values inside.
18
Handwriting Recognition by Alex Graves
19
RNN In MXNet
20
RNN In MXNet
21
RNN In MXNet
22
References How ANN predicts Stock Market Indices? by Vidya Sagar Reddy Gopala, Deepak Ramu Deep Learning, O’Reilly Media Inc. ISBN: ; By: Josh Patterson and Adam Gibson Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks, CreateSpace Independent Publishing Platform; ISBN: ; By: Jeff Heaton Deeplearningbook.org and Deep Learning from MIT Press. ISBN: Supervised Sequence Labling with Recurrent Neural Networks; Alex Grave. or Coursera; Neural Networks for Machine Learning by Jeoff Hinton at University of Toronto
23
Thank You Cyrus M Vahid
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.