Other Classification Models: Recurrent Neural Network (RNN)

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Advertisements

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Radial Basis Functions
Appendix B: An Example of Back-propagation algorithm
Classification / Regression Neural Networks 2
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Deep Learning Methods For Automated Discourse CIS 700-7
Lecture 6 Smaller Network: RNN
Machine Learning Supervised Learning Classification and Regression
Big data classification using neural network
Convolutional Sequence to Sequence Learning
Fall 2004 Backpropagation CS478 - Machine Learning.
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
RNNs: An example applied to the prediction task
CS 388: Natural Language Processing: Neural Networks
SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.
Best viewed with Computer Modern fonts installed
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Artificial Neural Networks
Deep Learning Amin Sobhani.
Other Classification Models: Neural Network
Recursive Neural Networks
Recurrent Neural Networks for Natural Language Processing
Other Classification Models: Neural Network
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Deep Learning: Model Summary
Intro to NLP and Deep Learning
ICS 491 Big Data Analytics Fall 2017 Deep Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Intelligent Information System Lab
Different Units Ramakrishna Vedantam.
Neural Networks A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Neural Networks 2 CS446 Machine Learning.
RNNs: Going Beyond the SRN in Language Prediction
A critical review of RNN for sequence learning Zachary C
Grid Long Short-Term Memory
RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect
Hidden Markov Models Part 2: Algorithms
Advanced Artificial Intelligence
Neural Networks Advantages Criticism
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
A First Look at Music Composition using LSTM Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks (RNN)
Final Presentation: Neural Network Doc Summarization
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.
of the Artificial Neural Networks.
Tips for Training Deep Network
Understanding LSTM Networks
Recurrent Neural Networks
Long Short Term Memory within Recurrent Neural Networks
Neural Networks Geoff Hulten.
Lecture 16: Recurrent Neural Networks (RNNs)
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
RNNs: Going Beyond the SRN in Language Prediction
Neural networks (1) Traditional multi-layer perceptrons
Attention.
LSTM: Long Short Term Memory
Meta Learning (Part 2): Gradient Descent as LSTM
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Attention for translation
Prediction Networks Prediction A simple example (section 3.7.3)
Automatic Handwriting Generation
David Kauchak CS158 – Spring 2019
Recurrent Neural Networks
Sequence-to-Sequence Models
Bidirectional LSTM-CRF Models for Sequence Tagging
LHC beam mode classification
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Other Classification Models: Recurrent Neural Network (RNN) COMP5331 Other Classification Models: Recurrent Neural Network (RNN) Prepared by Raymond Wong Presented by Raymond Wong raywong@cse RNN

Other Classification Models Support Vector Machine (SVM) Neural Network Recurrent Neural Network RNN

Neural Network Neural Network x1 x2 d 1 1 We train the model starting from the first record. Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output 1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output 1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output 1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

We train the model with the first record again. Neural Network x1 x2 d 1 We train the model with the first record again. Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

Neural Network Here, we know that training the model with one record is “independent” of training the model with another record This means that we assume that records in the table are “independent” RNN

Thus, records in the table are “dependent”. In some cases, the current record is “related” to the “previous” records in the table. Thus, records in the table are “dependent”. We also want to capture this “dependency” in the model We could use a new model called “recurrent neural network” for this purpose. RNN

Neural Network Neural Network input x1 Neural Network output y x2 Output attribute Input attributes RNN

Neural Network Neural Network Record 1 (vector) input x1,1 output x1 y1 x1,2 Output attribute Input attributes RNN

Neural Network Neural Network input Neural Network output x1 y1 Input vector Output attribute RNN

Recurrent Neural Network Recurrent Neural Network (RNN) Neural Network with a Loop Recurrent Neural Network (RNN) input output x1 y1 Input vector Output attribute RNN

Recurrent Neural Network Recurrent Neural Network (RNN) input output x1 RNN y1 Input vector Output attribute RNN

Unfolded representation of RNN x1 y1 Timestamp = 1 RNN x2 y2 Timestamp = 2 RNN x3 y3 Timestamp = 3 … … RNN xt yt Timestamp = t RNN

Internal state variable RNN xt-1 yt-1 … Timestamp = t-1 st-1 Internal state variable RNN xt yt Timestamp = t Internal state variable RNN xt+1 yt+1 … st Timestamp = t+1 st+1 Internal state variable RNN

… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st

Limitation It may “memorize” a lot of past events/values Due to its complex structure, it is more time-consuming for training. RNN

RNN Basic RNN Traditional LSTM GRU RNN

Basic RNN The basic RNN is very simple. It contains only one single activation function (e.g., “tanh” and “ReLU”). RNN

… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st

… xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t Basic RNN

… xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t Memory Unit Timestamp = t Basic RNN xt+1 yt+1 … st Timestamp = t+1 RNN

Usually, it is “tanh” or “ReLU” Basic RNN xt-1 yt-1 … Timestamp = t-1 st-1 xt yt Activation Function Timestamp = t Usually, it is “tanh” or “ReLU” Basic RNN xt+1 yt+1 … st Timestamp = t+1 RNN

… W = 0.7 0.3 0.4 b = 0.4 xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt Activation Function Timestamp = t st = tanh(W . [xt, st-1] + b) st = tanh(W . [xt, st-1] + b) yt = st yt = st Basic RNN xt+1 yt+1 … st Timestamp = t+1 RNN

In the following, we want to compute (weight) values in the basic RNN. Similar to the neural network, the basic RNN model has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In the basic RNN, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN

Consider this example with two timestamps. xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use the basic RNN to do the training. RNN

Basic RNN xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 st-1 s0 xt yt x0 y0 st-1 s0 xt yt Activation Function Timestamp = t 1 x1 st = tanh(W . [xt, st-1] + b) st = tanh(W . [xt, st-1] + b) yt = st yt = st y1 Basic RNN xt+1 yt+1 … st s1 Timestamp = t+1 2 x2 y2 RNN

st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 s1 = tanh(W . [x1, s0] + b) = tanh( 0.7 0.3 0.4 0.1 0.4 0 + 0.4) W = 0.7 0.3 0.4 b = 0.4 = tanh(0.7 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.4) = tanh(0.59) s1 = 0.5299 = 0.5299 y1 = 0.5299 y1 = s1 Error = y1 - y = 0.5299 = 0.5299 – 0.3 = 0.2299 RNN

st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.5299 0.5299 W = 0.7 0.3 0.4 b = 0.4 s1 = 0.5299 y1 = 0.5299 RNN

st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 s2 = tanh(W . [x2, s1] + b) 0.5299 0.5299 = tanh( 0.7 0.3 0.4 0.7 0.9 0.5299 + 0.4) W = 0.7 0.3 0.4 b = 0.4 = tanh(0.7 . 0.7 + 0.3 . 0.9 + 0.4 . 0.5299 + 0.4) = tanh(1.3720) s2 = 0.8791 = 0.8791 y2 = 0.8791 y2 = s2 Error = y2 - y = 0.8791 = 0.8791 – 0.5 = 0.3791 RNN

RNN Basic RNN Traditional LSTM GRU RNN

Traditional LSTM Disadvantage of Basic RNN The basic RNN model is too “simple”. It could not simulate our human brain too much. It is not easy for the basic RNN model to converge (i.e., it may take a very long time to train the RNN model) RNN

Traditional LSTM Before we give the details of our brain, we want to emphasize that there is an internal state variable (i.e., variable st) to store our memory (i.e., a value) The next RNN to be described is called the LSTM (Long Short-Term Memory) model. RNN

Traditional LSTM It could simulate the brain process. Forget Feature It could “decide” to forget a portion of the internal state variable. Input Feature It could “decide” to input a portion of the input variable for the model It could “decide” the strength of the input for the model (i.e., the activation function) (called the “weight” of the input) RNN

Traditional LSTM Output Feature It could “decide” to output a portion of the output for the model It could “decide” the strength of the output for the model (i.e., the activation function) (called the “weight” of the output) RNN

Traditional LSTM Our brain includes the following steps. Forget component Input component Input activation component Internal state component Output component Final output component Forget gate Input gate Input activation gate Input state gate Output gate Final output gate RNN

… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st

… xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN

… xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t Memory Unit Timestamp = t Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

Sigmoid function (net) Traditional LSTM xt-1 yt-1 … Wf = 0.7 0.3 0.4 bf = 0.4 Timestamp = t-1 st-1 xt yt ft Timestamp = t  ft = (Wf [xt, yt-1] + bf) Forget gate ft = (Wf [xt, yt-1] + bf) Sigmoid function (net) y = 1 1 + e-net Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

… Wi = 0.2 0.3 0.4 bi = 0.2 xt-1 Traditional LSTM Timestamp = t-1 yt-1 ft Timestamp = t  ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) Input gate it = (Wi [xt, yt-1] + bi) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Wa = 0.4 0.2 0.1 ba = 0.5 Timestamp = t-1 st-1 xt yt ft Timestamp = t  ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh at = tanh(Wa [xt, yt-1] + ba) Input activation gate at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt+1 yt+1 … st tanh function tanh(net) y = e2 net – 1 e2 net + 1 Timestamp = t+1 RNN

at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t  ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at Internal state gate st = ft . st-1 + it . at Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Wo = 0.8 0.9 0.2 bo = 0.3 Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at  ot = (Wo [xt, yt-1] + bo) Output state gate ot = (Wo [xt, yt-1] + bo) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at  tanh ot = (Wo [xt, yt-1] + bo) x yt = ot . tanh(st) Final Output state gate yt = ot . tanh(st) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at  tanh ot = (Wo [xt, yt-1] + bo) x yt = ot . tanh(st) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we want to compute (weight) values in the traditional LSTM. Similar to the neural network, the traditional LSTM model has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In the traditional LSTM, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN

Consider this example with two timestamps. xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use the traditional LSTM to do the training. RNN

at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 st-1 s0 xt yt ft f1 Timestamp = t 1 x1  x i1 ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) a1 at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot o1 st = ft . st-1 + it . at  tanh ot = (Wo [xt, yt-1] + bo) x y1 yt = ot . tanh(st) s1 Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 2 x2 y2 RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 f1 = (Wf [x1, y0] + bf) = ( 0.7 0.3 0.4 0.1 0.4 0 + 0.4) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = (0.7 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.4) = (0.59) = 0.6434 f1 = 0.6434 RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 i1 = (Wi [x1, y0] + bi) = ( 0.2 0.3 0.4 0.1 0.4 0 + 0.2) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = (0.2 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.2) = (0.34) = 0.5842 f1 = 0.6434 i1 = 0.5842 RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 a1 = tanh(Wa [x1, y0] + ba) = tanh( 0.4 0.2 0.1 0.1 0.4 0 + 0.5) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = tanh(0.4 . 0.1 + 0.2 . 0.4 + 0.1 . 0 + 0.5) = tanh(0.62) = 0.5511 f1 = 0.6434 i1 = 0.5842 a1 = 0.5511 RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 s1 = f1 . s0 + i1 . a1 = 0.6434 . 0 + 0.5842 . 0.5511 Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = 0.3220 f1 = 0.6434 i1 = 0.5842 a1 = 0.5511 s1 = 0.3220 RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 o1 = (Wo [x1, y0] + bo) = ( 0.8 0.9 0.2 0.1 0.4 0 + 0.3) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = (0.8 . 0.1 + 0.9 . 0.4 + 0.2 . 0 + 0.3) = (0.74) = 0.6770 f1 = 0.6434 i1 = 0.5842 a1 = 0.5511 s1 = 0.3220 o1 = 0.6770 RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 = o1 . tanh(s1) = 0.6770 . tanh(0.3220) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = 0.2107 f1 = 0.6434 i1 = 0.5842 a1 = 0.5511 s1 = 0.3220 o1 = 0.6770 RNN y1 = 0.2107

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 Error = y1 - y = 0.2107 – 0.3 Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = -0.0893 f1 = 0.6434 i1 = 0.5842 a1 = 0.5511 s1 = 0.3220 o1 = 0.6770 RNN y1 = 0.2107

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.2107 0.3220 Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 f1 = 0.6434 i1 = 0.5842 a1 = 0.5511 s1 = 0.3220 o1 = 0.6770 RNN y1 = 0.2107

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 f2 = (Wf [x2, y1] + bf) 0.2107 0.3220 = ( 0.7 0.3 0.4 0.7 0.9 0.2107 + 0.4) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = (0.7 . 0.7 + 0.3 . 0.9 + 0.4 . 0.2107 + 0.4) = (1.2443) = 0.7763 f2 = 0.7763 RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 i2 = (Wi [x2, y1] + bi) 0.2107 0.3220 = ( 0.2 0.3 0.4 0.7 0.9 0.2107 + 0.2) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = (0.2 . 0.7 + 0.3 . 0.9 + 0.4 . 0.2107 + 0.2) = (0.6943) = 0.6669 f2 = 0.7763 i2 = 0.6669 RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 a2 = tanh(Wa [x2, y1] + ba) 0.2107 0.3220 = tanh( 0.4 0.2 0.1 0.7 0.9 0.2107 + 0.5) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = tanh(0.4 . 0.7 + 0.2 . 0.9 + 0.1 . 0.2107 + 0.5) = tanh(0.9811) = 0.7535 f2 = 0.7763 i2 = 0.6669 a2 = 0.7535 RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 s2 = f2 . s1 + i2 . a2 0.2107 0.3220 = 0.7763 . 0.3220 + 0.6669 . 0.7535 Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = 0.7525 f2 = 0.7763 i2 = 0.6669 a2 = 0.7535 s2 = 0.7525 RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 o2 = (Wo [x2, y1] + bo) 0.2107 0.3220 = ( 0.8 0.9 0.2 0.7 0.9 0.2107 + 0.3) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = (0.8 . 0.7 + 0.9 . 0.9 + 0.2 . 0.2107 + 0.3) = (1.7121) = 0.8471 f2 = 0.7763 i2 = 0.6669 a2 = 0.7535 s2 = 0.7525 o2 = 0.8471 RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 y2 = o2 . tanh(s2) 0.2107 0.3220 = 0.8471 . tanh(0.7525) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = 0.5393 f2 = 0.7763 i2 = 0.6669 a2 = 0.7535 s2 = 0.7525 o2 = 0.8471 RNN y2 = 0.5393

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.2107 0.3220 Error = y2 - y = 0.5393 – 0.5 Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = 0.0393 f2 = 0.7763 i2 = 0.6669 a2 = 0.7535 s2 = 0.7525 o2 = 0.8471 RNN y2 = 0.5393

Similar to the “neural network”, the LSTM model (and the basic RNN model) could also have multiple layers and have multiple memory units in each layer. RNN

Multi-layer RNN input x1 RNN output y x2 input Memory Unit x1 output y

Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN

Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN

Multi-layer RNN input output x1 y1 x2 y2 x3 y3 x4 y4 x5 Input layer Hidden layer Output layer

RNN Basic RNN Traditional LSTM GRU RNN

GRU GRU (Gated Recurrent Unit) is a variation of the traditional LSTM model. Its structure is similar to the traditional LSTM model. But, its structure is “simpler”. Before we introduce GRU, let us see the properties of the traditional LSTM RNN

GRU Properties of the traditional LSTM The traditional LSTM model has a greater power of capturing the properties in the data. Thus, the result generated by this model is usually more accurate. Besides, it could “remember” or “memorize” longer sequences. RNN

GRU Since the structure of GRU is simpler than the traditional LSTM model, it has the following advantages The training time is shorter It requires fewer data points to capture the properties of the data. RNN

GRU Different from the traditional LSTM model, the GRU model does not have an internal state variable (i.e., variable st) to store our memory (i.e., a value) It regards the “predicted” target attribute value of the previous record (with an internal operation called “reset”) as a reference to store our memory RNN

GRU Similarly, the GRU model simulates the brain process. Reset Feature It could regard the “predicted” target attribute value of the previous record as a reference to store the memory Input Feature It could “decide” the strength of the input for the model (i.e., the activation function) RNN

GRU Output Feature It could “combine” a portion of the “predicted” target attribute value of the previous record and a portion of the “processed” input variable The ratio of these 2 portions is determined by the update feature. RNN

GRU Our brain includes the following steps. Reset component Input activation component Update component Final output component Reset gate Input activation gate Update gate Final output gate RNN

… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st

… xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN

… xt-1 GRU Timestamp = t-1 yt-1 xt Timestamp = t GRU yt RNN

… xt-1 GRU Timestamp = t-1 yt-1 xt Timestamp = t Memory Unit GRU yt RNN

… Wr = 0.7 0.3 0.4 br = 0.4 xt-1 GRU Timestamp = t-1 yt-1 rt xt  rt = (Wr [xt, yt-1] + br) Reset gate rt = (Wr [xt, yt-1] + br) GRU xt+1 yt+1 … Timestamp = t+1 RNN

at = tanh(Wa [xt, rt . yt-1] + ba) GRU xt-1 yt-1 … Wa = 0.2 0.3 0.4 ba = 0.3 Timestamp = t-1 xt yt rt Timestamp = t  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) at Input activation gate at = tanh(Wa [xt, rt . yt-1] + ba) GRU xt+1 yt+1 … Timestamp = t+1 RNN

at = tanh(Wa [xt, rt . yt-1] + ba) GRU xt-1 yt-1 … Wu = 0.4 0.2 0.1 bu = 0.5 Timestamp = t-1 xt yt rt Timestamp = t  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at ut  Update gate ut = (Wu [xt, yt-1] + bu) GRU xt+1 yt+1 … Timestamp = t+1 RNN

at = tanh(Wa [xt, rt . yt-1] + ba) RNN xt-1 yt-1 … Timestamp = t-1 xt yt rt Timestamp = t  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut +  1- x GRU xt+1 yt+1 … Final output gate yt = (1 - ut) . yt-1 + ut . at Timestamp = t+1 RNN

at = tanh(Wa [xt, rt . yt-1] + ba) RNN xt-1 yt-1 … Timestamp = t-1 xt yt rt Timestamp = t  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut +  1- x GRU xt+1 yt+1 … Timestamp = t+1 RNN

In the following, we want to compute (weight) values in GRU. Similar to the neural network, the GRU has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In GRU, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN

Consider this example with two timestamps. xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use GRU to do the training. RNN

at = tanh(Wa [xt, rt . yt-1] + ba) RNN xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 xt yt r1 rt Timestamp = t 1 x1  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at a1 yt = (1 - ut) . yt-1 + ut . at u1 x ut +  y1 1- x GRU xt+1 yt+1 … Timestamp = t+1 2 x2 y2 RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 r1 = (Wr [x1, y0] + br) = ( 0.7 0.3 0.4 0.1 0.4 0 + 0.4) Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = (0.7 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.4) = (0.59) = 0.6434 r1 = 0.6434 RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 a1 = tanh(Wa [x1, r1 . y0] + ba) = tanh( 0.2 0.3 0.4 0.1 0.4 0.6434 ∙0 + 0.3) Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = tanh(0.2 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.3) = tanh(0.44) = 0.4136 r1 = 0.6434 a1 = 0.4136 RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 u1 = (Wu [x1, y0] + bu) = ( 0.4 0.2 0.1 0.1 0.4 0 + 0.5) Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = (0.4 . 0.1 + 0.2 . 0.4 + 0.1 . 0 + 0.5) = (0.62) = 0.6502 r1 = 0.6434 a1 = 0.4136 u1 = 0.6502 RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 = (1 – u1) . y0 + u1 . a1 = (1 – 0.6502) . 0 + 0.6502 . 0.4136 Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = 0.2690 r1 = 0.6434 a1 = 0.4136 u1 = 0.6502 y1 = 0.2690 RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 Error = y1 - y = 0.2690 – 0.3 Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = -0.0310 r1 = 0.6434 a1 = 0.4136 u1 = 0.6502 y1 = 0.2690 RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 0.2690 Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 r1 = 0.6434 a1 = 0.4136 u1 = 0.6502 y1 = 0.2690 RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 r2 = (Wr [x2, y1] + br) 0.2690 = ( 0.7 0.3 0.4 0.7 0.9 0.2690 + 0.4) Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = (0.7 . 0.7 + 0.3 . 0.9 + 0.4 . 0.2690 + 0.4) = (1.2676) = 0.7803 r2 = 0.7803 RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 a2 = tanh(Wa [x2, r2 . y1] + ba) 0.2690 = tanh( 0.2 0.3 0.4 0.7 0.9 0.7803 ∙0.2690 + 0.3) Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = tanh(0.2 . 0.7 + 0.3 . 0.9 + 0.4 . 0.2099 + 0.3) = tanh(0.7940) = 0.6606 r2 = 0.7803 a2 = 0.6606 RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 u2 = (Wu [x2, y1] + bu) 0.2690 = ( 0.4 0.2 0.1 0.7 0.9 0.2690 + 0.5) Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = (0.4 . 0.7 + 0.2 . 0.9 + 0.1 . 0.2690 + 0.5) = (0.9869) = 0.7285 r2 = 0.7803 a2 = 0.6606 u2 = 0.7285 RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 y2 = (1 – u2) . y1 + u2 . a2 0.2690 = (1 – 0.7285) . 0.2690 + 0.7285 . 0.6606 Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = 0.5543 r2 = 0.7803 a2 = 0.6606 u2 = 0.7285 y2 = 0.5543 RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 0.2690 Error = y2 - y = 0.5543 – 0.5 Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = 0.0543 r2 = 0.7803 a2 = 0.6606 u2 = 0.7285 y2 = 0.5543 RNN

Similar to the “neural network”, GRU could also have multiple layers and have multiple memory units in each layer. RNN

Multi-layer RNN input x1 RNN output y x2 input Memory Unit x1 output y

Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN

Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN

Multi-layer RNN input output x1 y1 x2 y2 x3 y3 x4 y4 x5 Input layer Hidden layer Output layer