Other Classification Models: Recurrent Neural Network (RNN)

Other Classification Models: Recurrent Neural Network (RNN)
COMP5331 Other Classification Models: Recurrent Neural Network (RNN) Prepared by Raymond Wong Presented by Raymond Wong RNN

Other Classification Models
Support Vector Machine (SVM) Neural Network Recurrent Neural Network RNN

Neural Network Neural Network x1 x2 d 1
1 We train the model starting from the first record. Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output
1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

We train the model with the first record again.
Neural Network x1 x2 d 1 We train the model with the first record again. Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

Neural Network Here, we know that training the model with one record is “independent” of training the model with another record This means that we assume that records in the table are “independent” RNN

Thus, records in the table are “dependent”.
In some cases, the current record is “related” to the “previous” records in the table. Thus, records in the table are “dependent”. We also want to capture this “dependency” in the model We could use a new model called “recurrent neural network” for this purpose. RNN

Neural Network Neural Network input x1 Neural Network output y x2
Output attribute Input attributes RNN

Neural Network Neural Network Record 1 (vector) input x1,1
output x1 y1 x1,2 Output attribute Input attributes RNN

Neural Network Neural Network input Neural Network output x1 y1
Input vector Output attribute RNN

Recurrent Neural Network
Recurrent Neural Network (RNN) Neural Network with a Loop Recurrent Neural Network (RNN) input output x1 y1 Input vector Output attribute RNN

Recurrent Neural Network
Recurrent Neural Network (RNN) input output x1 RNN y1 Input vector Output attribute RNN

Unfolded representation of RNN
x1 y1 Timestamp = 1 RNN x2 y2 Timestamp = 2 RNN x3 y3 Timestamp = 3 … … RNN xt yt Timestamp = t RNN

Internal state variable
RNN xt-1 yt-1 … Timestamp = t-1 st-1 Internal state variable RNN xt yt Timestamp = t Internal state variable RNN xt+1 yt+1 … st Timestamp = t+1 st+1 Internal state variable RNN

… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st

Limitation It may “memorize” a lot of past events/values
Due to its complex structure, it is more time-consuming for training. RNN

RNN Basic RNN Traditional LSTM GRU RNN

Basic RNN The basic RNN is very simple.
It contains only one single activation function (e.g., “tanh” and “ReLU”). RNN

… xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t Basic RNN

… xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t
Memory Unit Timestamp = t Basic RNN xt+1 yt+1 … st Timestamp = t+1 RNN

Usually, it is “tanh” or “ReLU”
Basic RNN xt-1 yt-1 … Timestamp = t-1 st-1 xt yt Activation Function Timestamp = t Usually, it is “tanh” or “ReLU” Basic RNN xt+1 yt+1 … st Timestamp = t+1 RNN

… W = 0.7 0.3 0.4 b = 0.4 xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt
Activation Function Timestamp = t st = tanh(W . [xt, st-1] + b) st = tanh(W . [xt, st-1] + b) yt = st yt = st Basic RNN xt+1 yt+1 … st Timestamp = t+1 RNN

In the following, we want to compute (weight) values in the basic RNN.
Similar to the neural network, the basic RNN model has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In the basic RNN, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN

Consider this example with two timestamps.
xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use the basic RNN to do the training. RNN

Basic RNN xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 st-1 s0 xt yt
x0 y0 st-1 s0 xt yt Activation Function Timestamp = t 1 x1 st = tanh(W . [xt, st-1] + b) st = tanh(W . [xt, st-1] + b) yt = st yt = st y1 Basic RNN xt+1 yt+1 … st s1 Timestamp = t+1 2 x2 y2 RNN

st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 s1 = tanh(W . [x1, s0] + b) = tanh( ) W = b = 0.4 = tanh( ) = tanh(0.59) s1 = = y1 = y1 = s1 Error = y1 - y = = – 0.3 = RNN

st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.5299 0.5299 W = b = 0.4 s1 = y1 = RNN

st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 s2 = tanh(W . [x2, s1] + b) 0.5299 0.5299 = tanh( ) W = b = 0.4 = tanh( ) = tanh(1.3720) s2 = = y2 = y2 = s2 Error = y2 - y = = – 0.5 = RNN

Traditional LSTM Disadvantage of Basic RNN
The basic RNN model is too “simple”. It could not simulate our human brain too much. It is not easy for the basic RNN model to converge (i.e., it may take a very long time to train the RNN model) RNN

Traditional LSTM Before we give the details of our brain, we want to emphasize that there is an internal state variable (i.e., variable st) to store our memory (i.e., a value) The next RNN to be described is called the LSTM (Long Short-Term Memory) model. RNN

Traditional LSTM It could simulate the brain process. Forget Feature
It could “decide” to forget a portion of the internal state variable. Input Feature It could “decide” to input a portion of the input variable for the model It could “decide” the strength of the input for the model (i.e., the activation function) (called the “weight” of the input) RNN

Traditional LSTM Output Feature
It could “decide” to output a portion of the output for the model It could “decide” the strength of the output for the model (i.e., the activation function) (called the “weight” of the output) RNN

Traditional LSTM Our brain includes the following steps.
Forget component Input component Input activation component Internal state component Output component Final output component Forget gate Input gate Input activation gate Input state gate Output gate Final output gate RNN

… xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t
RNN

Memory Unit Timestamp = t Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

Sigmoid function (net)
Traditional LSTM xt-1 yt-1 … Wf = bf = 0.4 Timestamp = t-1 st-1 xt yt ft Timestamp = t  ft = (Wf [xt, yt-1] + bf) Forget gate ft = (Wf [xt, yt-1] + bf) Sigmoid function (net) y = 1 1 + e-net Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

… Wi = 0.2 0.3 0.4 bi = 0.2 xt-1 Traditional LSTM Timestamp = t-1 yt-1
ft Timestamp = t  ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) Input gate it = (Wi [xt, yt-1] + bi) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 … Wa = ba = 0.5 Timestamp = t-1 st-1 xt yt ft Timestamp = t  ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh at = tanh(Wa [xt, yt-1] + ba) Input activation gate at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt+1 yt+1 … st tanh function tanh(net) y = e2 net – 1 e2 net + 1 Timestamp = t+1 RNN

Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t  ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at Internal state gate st = ft . st-1 + it . at Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

Traditional LSTM xt-1 yt-1 … Wo = bo = 0.3 Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at  ot = (Wo [xt, yt-1] + bo) Output state gate ot = (Wo [xt, yt-1] + bo) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at  tanh ot = (Wo [xt, yt-1] + bo) x yt = ot . tanh(st) Final Output state gate yt = ot . tanh(st) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at  tanh ot = (Wo [xt, yt-1] + bo) x yt = ot . tanh(st) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN

Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation)
In the following, we want to compute (weight) values in the traditional LSTM. Similar to the neural network, the traditional LSTM model has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In the traditional LSTM, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN

xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use the traditional LSTM to do the training. RNN

Traditional LSTM xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 st-1 s0 xt yt ft f1 Timestamp = t 1 x1  x i1 ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) a1 at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot o1 st = ft . st-1 + it . at  tanh ot = (Wo [xt, yt-1] + bo) x y1 yt = ot . tanh(st) s1 Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 2 x2 y2 RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 f1 = (Wf [x1, y0] + bf) = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (0.59) = f1 = RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 i1 = (Wi [x1, y0] + bi) = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (0.34) = f1 = i1 = RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 a1 = tanh(Wa [x1, y0] + ba) = tanh( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = tanh( ) = tanh(0.62) = f1 = i1 = a1 = RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 s1 = f1 . s0 + i1 . a1 = Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f1 = i1 = a1 = s1 = RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 o1 = (Wo [x1, y0] + bo) = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (0.74) = f1 = i1 = a1 = s1 = o1 = RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 = o1 . tanh(s1) = tanh(0.3220) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f1 = i1 = a1 = s1 = o1 = RNN y1 =

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 Error = y1 - y = – 0.3 Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f1 = i1 = a1 = s1 = o1 = RNN y1 =

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.2107 0.3220 Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 f1 = i1 = a1 = s1 = o1 = RNN y1 =

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 f2 = (Wf [x2, y1] + bf) 0.2107 0.3220 = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (1.2443) = f2 = RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 i2 = (Wi [x2, y1] + bi) 0.2107 0.3220 = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (0.6943) = f2 = i2 = RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 a2 = tanh(Wa [x2, y1] + ba) 0.2107 0.3220 = tanh( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = tanh( ) = tanh(0.9811) = f2 = i2 = a2 = RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 s2 = f2 . s1 + i2 . a2 0.2107 0.3220 = Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f2 = i2 = a2 = s2 = RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 o2 = (Wo [x2, y1] + bo) 0.2107 0.3220 = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (1.7121) = f2 = i2 = a2 = s2 = o2 = RNN

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 y2 = o2 . tanh(s2) 0.2107 0.3220 = tanh(0.7525) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f2 = i2 = a2 = s2 = o2 = RNN y2 =

ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.2107 0.3220 Error = y2 - y = – 0.5 Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f2 = i2 = a2 = s2 = o2 = RNN y2 =

Similar to the “neural network”, the LSTM model (and the basic RNN model) could also have multiple layers and have multiple memory units in each layer. RNN

Multi-layer RNN input x1 RNN output y x2 input Memory Unit x1 output y

Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN

Multi-layer RNN input output x1 y1 x2 y2 x3 y3 x4 y4 x5 Input layer
Hidden layer Output layer

GRU GRU (Gated Recurrent Unit) is a variation of the traditional LSTM model. Its structure is similar to the traditional LSTM model. But, its structure is “simpler”. Before we introduce GRU, let us see the properties of the traditional LSTM RNN

GRU Properties of the traditional LSTM
The traditional LSTM model has a greater power of capturing the properties in the data. Thus, the result generated by this model is usually more accurate. Besides, it could “remember” or “memorize” longer sequences. RNN

GRU Since the structure of GRU is simpler than the traditional LSTM model, it has the following advantages The training time is shorter It requires fewer data points to capture the properties of the data. RNN

GRU Different from the traditional LSTM model, the GRU model does not have an internal state variable (i.e., variable st) to store our memory (i.e., a value) It regards the “predicted” target attribute value of the previous record (with an internal operation called “reset”) as a reference to store our memory RNN

GRU Similarly, the GRU model simulates the brain process.
Reset Feature It could regard the “predicted” target attribute value of the previous record as a reference to store the memory Input Feature It could “decide” the strength of the input for the model (i.e., the activation function) RNN

GRU Output Feature It could “combine” a portion of the “predicted” target attribute value of the previous record and a portion of the “processed” input variable The ratio of these 2 portions is determined by the update feature. RNN

GRU Our brain includes the following steps. Reset component
Input activation component Update component Final output component Reset gate Input activation gate Update gate Final output gate RNN

RNN

… xt-1 GRU Timestamp = t-1 yt-1 xt Timestamp = t GRU yt
RNN

… xt-1 GRU Timestamp = t-1 yt-1 xt Timestamp = t Memory Unit GRU yt
RNN

… Wr = 0.7 0.3 0.4 br = 0.4 xt-1 GRU Timestamp = t-1 yt-1 rt xt
 rt = (Wr [xt, yt-1] + br) Reset gate rt = (Wr [xt, yt-1] + br) GRU xt+1 yt+1 … Timestamp = t+1 RNN

at = tanh(Wa [xt, rt . yt-1] + ba)
GRU xt-1 yt-1 … Wa = ba = 0.3 Timestamp = t-1 xt yt rt Timestamp = t  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) at Input activation gate at = tanh(Wa [xt, rt . yt-1] + ba) GRU xt+1 yt+1 … Timestamp = t+1 RNN

GRU xt-1 yt-1 … Wu = bu = 0.5 Timestamp = t-1 xt yt rt Timestamp = t  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at ut  Update gate ut = (Wu [xt, yt-1] + bu) GRU xt+1 yt+1 … Timestamp = t+1 RNN

RNN xt-1 yt-1 … Timestamp = t-1 xt yt rt Timestamp = t  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut +  1- x GRU xt+1 yt+1 … Final output gate yt = (1 - ut) . yt-1 + ut . at Timestamp = t+1 RNN

RNN xt-1 yt-1 … Timestamp = t-1 xt yt rt Timestamp = t  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut +  1- x GRU xt+1 yt+1 … Timestamp = t+1 RNN

In the following, we want to compute (weight) values in GRU.
Similar to the neural network, the GRU has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In GRU, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN

xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use GRU to do the training. RNN

RNN xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 xt yt r1 rt Timestamp = t 1 x1  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at a1 yt = (1 - ut) . yt-1 + ut . at u1 x ut +  y1 1- x GRU xt+1 yt+1 … Timestamp = t+1 2 x2 y2 RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 r1 = (Wr [x1, y0] + br) = ( ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = ( ) = (0.59) = r1 = RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 a1 = tanh(Wa [x1, r1 . y0] + ba) = tanh( ∙ ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = tanh( ) = tanh(0.44) = r1 = a1 = RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 u1 = (Wu [x1, y0] + bu) = ( ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = ( ) = (0.62) = r1 = a1 = u1 = RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 = (1 – u1) . y0 + u1 . a1 = (1 – ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = r1 = a1 = u1 = y1 = RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 Error = y1 - y = – 0.3 Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = r1 = a1 = u1 = y1 = RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 0.2690 Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 r1 = a1 = u1 = y1 = RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 r2 = (Wr [x2, y1] + br) 0.2690 = ( ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = ( ) = (1.2676) = r2 = RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 a2 = tanh(Wa [x2, r2 . y1] + ba) 0.2690 = tanh( ∙ ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = tanh( ) = tanh(0.7940) = r2 = a2 = RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 u2 = (Wu [x2, y1] + bu) 0.2690 = ( ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = ( ) = (0.9869) = r2 = a2 = u2 = RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 y2 = (1 – u2) . y1 + u2 . a2 0.2690 = (1 – ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = r2 = a2 = u2 = y2 = RNN

rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 0.2690 Error = y2 - y = – 0.5 Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = r2 = a2 = u2 = y2 = RNN

Similar to the “neural network”, GRU could also have multiple layers and have multiple memory units in each layer. RNN

Multi-layer RNN input x1 RNN output y x2 input Memory Unit x1 output y

Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN

Multi-layer RNN input output x1 y1 x2 y2 x3 y3 x4 y4 x5 Input layer
Hidden layer Output layer

Other Classification Models: Recurrent Neural Network (RNN)

Similar presentations

Presentation on theme: "Other Classification Models: Recurrent Neural Network (RNN)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Other Classification Models: Recurrent Neural Network (RNN)

Similar presentations

Presentation on theme: "Other Classification Models: Recurrent Neural Network (RNN)"— Presentation transcript:

Similar presentations

About project

Feedback