Other Classification Models: Recurrent Neural Network (RNN) COMP5331 Other Classification Models: Recurrent Neural Network (RNN) Prepared by Raymond Wong Presented by Raymond Wong raywong@cse RNN
Other Classification Models Support Vector Machine (SVM) Neural Network Recurrent Neural Network RNN
Neural Network Neural Network x1 x2 d 1 1 We train the model starting from the first record. Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN
Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output 1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN
Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output 1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN
Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output 1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN
We train the model with the first record again. Neural Network x1 x2 d 1 We train the model with the first record again. Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN
Neural Network Here, we know that training the model with one record is “independent” of training the model with another record This means that we assume that records in the table are “independent” RNN
Thus, records in the table are “dependent”. In some cases, the current record is “related” to the “previous” records in the table. Thus, records in the table are “dependent”. We also want to capture this “dependency” in the model We could use a new model called “recurrent neural network” for this purpose. RNN
Neural Network Neural Network input x1 Neural Network output y x2 Output attribute Input attributes RNN
Neural Network Neural Network Record 1 (vector) input x1,1 output x1 y1 x1,2 Output attribute Input attributes RNN
Neural Network Neural Network input Neural Network output x1 y1 Input vector Output attribute RNN
Recurrent Neural Network Recurrent Neural Network (RNN) Neural Network with a Loop Recurrent Neural Network (RNN) input output x1 y1 Input vector Output attribute RNN
Recurrent Neural Network Recurrent Neural Network (RNN) input output x1 RNN y1 Input vector Output attribute RNN
Unfolded representation of RNN x1 y1 Timestamp = 1 RNN x2 y2 Timestamp = 2 RNN x3 y3 Timestamp = 3 … … RNN xt yt Timestamp = t RNN
Internal state variable RNN xt-1 yt-1 … Timestamp = t-1 st-1 Internal state variable RNN xt yt Timestamp = t Internal state variable RNN xt+1 yt+1 … st Timestamp = t+1 st+1 Internal state variable RNN
… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st
Limitation It may “memorize” a lot of past events/values Due to its complex structure, it is more time-consuming for training. RNN
RNN Basic RNN Traditional LSTM GRU RNN
Basic RNN The basic RNN is very simple. It contains only one single activation function (e.g., “tanh” and “ReLU”). RNN
… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st
… xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t Basic RNN
… xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t Memory Unit Timestamp = t Basic RNN xt+1 yt+1 … st Timestamp = t+1 RNN
Usually, it is “tanh” or “ReLU” Basic RNN xt-1 yt-1 … Timestamp = t-1 st-1 xt yt Activation Function Timestamp = t Usually, it is “tanh” or “ReLU” Basic RNN xt+1 yt+1 … st Timestamp = t+1 RNN
… W = 0.7 0.3 0.4 b = 0.4 xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt Activation Function Timestamp = t st = tanh(W . [xt, st-1] + b) st = tanh(W . [xt, st-1] + b) yt = st yt = st Basic RNN xt+1 yt+1 … st Timestamp = t+1 RNN
In the following, we want to compute (weight) values in the basic RNN. Similar to the neural network, the basic RNN model has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In the basic RNN, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN
Consider this example with two timestamps. xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use the basic RNN to do the training. RNN
Basic RNN xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 st-1 s0 xt yt x0 y0 st-1 s0 xt yt Activation Function Timestamp = t 1 x1 st = tanh(W . [xt, st-1] + b) st = tanh(W . [xt, st-1] + b) yt = st yt = st y1 Basic RNN xt+1 yt+1 … st s1 Timestamp = t+1 2 x2 y2 RNN
st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 s1 = tanh(W . [x1, s0] + b) = tanh( 0.7 0.3 0.4 0.1 0.4 0 + 0.4) W = 0.7 0.3 0.4 b = 0.4 = tanh(0.7 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.4) = tanh(0.59) s1 = 0.5299 = 0.5299 y1 = 0.5299 y1 = s1 Error = y1 - y = 0.5299 = 0.5299 – 0.3 = 0.2299 RNN
st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.5299 0.5299 W = 0.7 0.3 0.4 b = 0.4 s1 = 0.5299 y1 = 0.5299 RNN
st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 s2 = tanh(W . [x2, s1] + b) 0.5299 0.5299 = tanh( 0.7 0.3 0.4 0.7 0.9 0.5299 + 0.4) W = 0.7 0.3 0.4 b = 0.4 = tanh(0.7 . 0.7 + 0.3 . 0.9 + 0.4 . 0.5299 + 0.4) = tanh(1.3720) s2 = 0.8791 = 0.8791 y2 = 0.8791 y2 = s2 Error = y2 - y = 0.8791 = 0.8791 – 0.5 = 0.3791 RNN
RNN Basic RNN Traditional LSTM GRU RNN
Traditional LSTM Disadvantage of Basic RNN The basic RNN model is too “simple”. It could not simulate our human brain too much. It is not easy for the basic RNN model to converge (i.e., it may take a very long time to train the RNN model) RNN
Traditional LSTM Before we give the details of our brain, we want to emphasize that there is an internal state variable (i.e., variable st) to store our memory (i.e., a value) The next RNN to be described is called the LSTM (Long Short-Term Memory) model. RNN
Traditional LSTM It could simulate the brain process. Forget Feature It could “decide” to forget a portion of the internal state variable. Input Feature It could “decide” to input a portion of the input variable for the model It could “decide” the strength of the input for the model (i.e., the activation function) (called the “weight” of the input) RNN
Traditional LSTM Output Feature It could “decide” to output a portion of the output for the model It could “decide” the strength of the output for the model (i.e., the activation function) (called the “weight” of the output) RNN
Traditional LSTM Our brain includes the following steps. Forget component Input component Input activation component Internal state component Output component Final output component Forget gate Input gate Input activation gate Input state gate Output gate Final output gate RNN
… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st
… xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN
… xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t Memory Unit Timestamp = t Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
Sigmoid function (net) Traditional LSTM xt-1 yt-1 … Wf = 0.7 0.3 0.4 bf = 0.4 Timestamp = t-1 st-1 xt yt ft Timestamp = t ft = (Wf [xt, yt-1] + bf) Forget gate ft = (Wf [xt, yt-1] + bf) Sigmoid function (net) y = 1 1 + e-net Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
… Wi = 0.2 0.3 0.4 bi = 0.2 xt-1 Traditional LSTM Timestamp = t-1 yt-1 ft Timestamp = t ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) Input gate it = (Wi [xt, yt-1] + bi) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Wa = 0.4 0.2 0.1 ba = 0.5 Timestamp = t-1 st-1 xt yt ft Timestamp = t ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh at = tanh(Wa [xt, yt-1] + ba) Input activation gate at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt+1 yt+1 … st tanh function tanh(net) y = e2 net – 1 e2 net + 1 Timestamp = t+1 RNN
at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t x ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at Internal state gate st = ft . st-1 + it . at Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t x ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Wo = 0.8 0.9 0.2 bo = 0.3 Timestamp = t-1 st-1 xt yt ft Timestamp = t x ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) Output state gate ot = (Wo [xt, yt-1] + bo) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t x ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at tanh ot = (Wo [xt, yt-1] + bo) x yt = ot . tanh(st) Final Output state gate yt = ot . tanh(st) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t x ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at tanh ot = (Wo [xt, yt-1] + bo) x yt = ot . tanh(st) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we want to compute (weight) values in the traditional LSTM. Similar to the neural network, the traditional LSTM model has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In the traditional LSTM, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN
Consider this example with two timestamps. xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use the traditional LSTM to do the training. RNN
at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 st-1 s0 xt yt ft f1 Timestamp = t 1 x1 x i1 ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) a1 at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot o1 st = ft . st-1 + it . at tanh ot = (Wo [xt, yt-1] + bo) x y1 yt = ot . tanh(st) s1 Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 2 x2 y2 RNN
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 f1 = (Wf [x1, y0] + bf) = ( 0.7 0.3 0.4 0.1 0.4 0 + 0.4) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = (0.7 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.4) = (0.59) = 0.6434 f1 = 0.6434 RNN
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 i1 = (Wi [x1, y0] + bi) = ( 0.2 0.3 0.4 0.1 0.4 0 + 0.2) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = (0.2 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.2) = (0.34) = 0.5842 f1 = 0.6434 i1 = 0.5842 RNN
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 a1 = tanh(Wa [x1, y0] + ba) = tanh( 0.4 0.2 0.1 0.1 0.4 0 + 0.5) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = tanh(0.4 . 0.1 + 0.2 . 0.4 + 0.1 . 0 + 0.5) = tanh(0.62) = 0.5511 f1 = 0.6434 i1 = 0.5842 a1 = 0.5511 RNN
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 s1 = f1 . s0 + i1 . a1 = 0.6434 . 0 + 0.5842 . 0.5511 Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = 0.3220 f1 = 0.6434 i1 = 0.5842 a1 = 0.5511 s1 = 0.3220 RNN
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 o1 = (Wo [x1, y0] + bo) = ( 0.8 0.9 0.2 0.1 0.4 0 + 0.3) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = (0.8 . 0.1 + 0.9 . 0.4 + 0.2 . 0 + 0.3) = (0.74) = 0.6770 f1 = 0.6434 i1 = 0.5842 a1 = 0.5511 s1 = 0.3220 o1 = 0.6770 RNN
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 = o1 . tanh(s1) = 0.6770 . tanh(0.3220) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = 0.2107 f1 = 0.6434 i1 = 0.5842 a1 = 0.5511 s1 = 0.3220 o1 = 0.6770 RNN y1 = 0.2107
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 Error = y1 - y = 0.2107 – 0.3 Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = -0.0893 f1 = 0.6434 i1 = 0.5842 a1 = 0.5511 s1 = 0.3220 o1 = 0.6770 RNN y1 = 0.2107
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.2107 0.3220 Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 f1 = 0.6434 i1 = 0.5842 a1 = 0.5511 s1 = 0.3220 o1 = 0.6770 RNN y1 = 0.2107
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 f2 = (Wf [x2, y1] + bf) 0.2107 0.3220 = ( 0.7 0.3 0.4 0.7 0.9 0.2107 + 0.4) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = (0.7 . 0.7 + 0.3 . 0.9 + 0.4 . 0.2107 + 0.4) = (1.2443) = 0.7763 f2 = 0.7763 RNN
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 i2 = (Wi [x2, y1] + bi) 0.2107 0.3220 = ( 0.2 0.3 0.4 0.7 0.9 0.2107 + 0.2) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = (0.2 . 0.7 + 0.3 . 0.9 + 0.4 . 0.2107 + 0.2) = (0.6943) = 0.6669 f2 = 0.7763 i2 = 0.6669 RNN
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 a2 = tanh(Wa [x2, y1] + ba) 0.2107 0.3220 = tanh( 0.4 0.2 0.1 0.7 0.9 0.2107 + 0.5) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = tanh(0.4 . 0.7 + 0.2 . 0.9 + 0.1 . 0.2107 + 0.5) = tanh(0.9811) = 0.7535 f2 = 0.7763 i2 = 0.6669 a2 = 0.7535 RNN
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 s2 = f2 . s1 + i2 . a2 0.2107 0.3220 = 0.7763 . 0.3220 + 0.6669 . 0.7535 Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = 0.7525 f2 = 0.7763 i2 = 0.6669 a2 = 0.7535 s2 = 0.7525 RNN
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 o2 = (Wo [x2, y1] + bo) 0.2107 0.3220 = ( 0.8 0.9 0.2 0.7 0.9 0.2107 + 0.3) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = (0.8 . 0.7 + 0.9 . 0.9 + 0.2 . 0.2107 + 0.3) = (1.7121) = 0.8471 f2 = 0.7763 i2 = 0.6669 a2 = 0.7535 s2 = 0.7525 o2 = 0.8471 RNN
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 y2 = o2 . tanh(s2) 0.2107 0.3220 = 0.8471 . tanh(0.7525) Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = 0.5393 f2 = 0.7763 i2 = 0.6669 a2 = 0.7535 s2 = 0.7525 o2 = 0.8471 RNN y2 = 0.5393
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.2107 0.3220 Error = y2 - y = 0.5393 – 0.5 Wf = 0.7 0.3 0.4 Wi = 0.2 0.3 0.4 Wa = 0.4 0.2 0.1 Wo = 0.8 0.9 0.2 bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = 0.0393 f2 = 0.7763 i2 = 0.6669 a2 = 0.7535 s2 = 0.7525 o2 = 0.8471 RNN y2 = 0.5393
Similar to the “neural network”, the LSTM model (and the basic RNN model) could also have multiple layers and have multiple memory units in each layer. RNN
Multi-layer RNN input x1 RNN output y x2 input Memory Unit x1 output y
Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN
Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN
Multi-layer RNN input output x1 y1 x2 y2 x3 y3 x4 y4 x5 Input layer Hidden layer Output layer
RNN Basic RNN Traditional LSTM GRU RNN
GRU GRU (Gated Recurrent Unit) is a variation of the traditional LSTM model. Its structure is similar to the traditional LSTM model. But, its structure is “simpler”. Before we introduce GRU, let us see the properties of the traditional LSTM RNN
GRU Properties of the traditional LSTM The traditional LSTM model has a greater power of capturing the properties in the data. Thus, the result generated by this model is usually more accurate. Besides, it could “remember” or “memorize” longer sequences. RNN
GRU Since the structure of GRU is simpler than the traditional LSTM model, it has the following advantages The training time is shorter It requires fewer data points to capture the properties of the data. RNN
GRU Different from the traditional LSTM model, the GRU model does not have an internal state variable (i.e., variable st) to store our memory (i.e., a value) It regards the “predicted” target attribute value of the previous record (with an internal operation called “reset”) as a reference to store our memory RNN
GRU Similarly, the GRU model simulates the brain process. Reset Feature It could regard the “predicted” target attribute value of the previous record as a reference to store the memory Input Feature It could “decide” the strength of the input for the model (i.e., the activation function) RNN
GRU Output Feature It could “combine” a portion of the “predicted” target attribute value of the previous record and a portion of the “processed” input variable The ratio of these 2 portions is determined by the update feature. RNN
GRU Our brain includes the following steps. Reset component Input activation component Update component Final output component Reset gate Input activation gate Update gate Final output gate RNN
… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st
… xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN
… xt-1 GRU Timestamp = t-1 yt-1 xt Timestamp = t GRU yt RNN
… xt-1 GRU Timestamp = t-1 yt-1 xt Timestamp = t Memory Unit GRU yt RNN
… Wr = 0.7 0.3 0.4 br = 0.4 xt-1 GRU Timestamp = t-1 yt-1 rt xt rt = (Wr [xt, yt-1] + br) Reset gate rt = (Wr [xt, yt-1] + br) GRU xt+1 yt+1 … Timestamp = t+1 RNN
at = tanh(Wa [xt, rt . yt-1] + ba) GRU xt-1 yt-1 … Wa = 0.2 0.3 0.4 ba = 0.3 Timestamp = t-1 xt yt rt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) at Input activation gate at = tanh(Wa [xt, rt . yt-1] + ba) GRU xt+1 yt+1 … Timestamp = t+1 RNN
at = tanh(Wa [xt, rt . yt-1] + ba) GRU xt-1 yt-1 … Wu = 0.4 0.2 0.1 bu = 0.5 Timestamp = t-1 xt yt rt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at ut Update gate ut = (Wu [xt, yt-1] + bu) GRU xt+1 yt+1 … Timestamp = t+1 RNN
at = tanh(Wa [xt, rt . yt-1] + ba) RNN xt-1 yt-1 … Timestamp = t-1 xt yt rt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut + 1- x GRU xt+1 yt+1 … Final output gate yt = (1 - ut) . yt-1 + ut . at Timestamp = t+1 RNN
at = tanh(Wa [xt, rt . yt-1] + ba) RNN xt-1 yt-1 … Timestamp = t-1 xt yt rt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut + 1- x GRU xt+1 yt+1 … Timestamp = t+1 RNN
In the following, we want to compute (weight) values in GRU. Similar to the neural network, the GRU has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In GRU, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN
Consider this example with two timestamps. xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use GRU to do the training. RNN
at = tanh(Wa [xt, rt . yt-1] + ba) RNN xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 xt yt r1 rt Timestamp = t 1 x1 rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at a1 yt = (1 - ut) . yt-1 + ut . at u1 x ut + y1 1- x GRU xt+1 yt+1 … Timestamp = t+1 2 x2 y2 RNN
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 r1 = (Wr [x1, y0] + br) = ( 0.7 0.3 0.4 0.1 0.4 0 + 0.4) Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = (0.7 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.4) = (0.59) = 0.6434 r1 = 0.6434 RNN
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 a1 = tanh(Wa [x1, r1 . y0] + ba) = tanh( 0.2 0.3 0.4 0.1 0.4 0.6434 ∙0 + 0.3) Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = tanh(0.2 . 0.1 + 0.3 . 0.4 + 0.4 . 0 + 0.3) = tanh(0.44) = 0.4136 r1 = 0.6434 a1 = 0.4136 RNN
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 u1 = (Wu [x1, y0] + bu) = ( 0.4 0.2 0.1 0.1 0.4 0 + 0.5) Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = (0.4 . 0.1 + 0.2 . 0.4 + 0.1 . 0 + 0.5) = (0.62) = 0.6502 r1 = 0.6434 a1 = 0.4136 u1 = 0.6502 RNN
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 = (1 – u1) . y0 + u1 . a1 = (1 – 0.6502) . 0 + 0.6502 . 0.4136 Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = 0.2690 r1 = 0.6434 a1 = 0.4136 u1 = 0.6502 y1 = 0.2690 RNN
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 Error = y1 - y = 0.2690 – 0.3 Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = -0.0310 r1 = 0.6434 a1 = 0.4136 u1 = 0.6502 y1 = 0.2690 RNN
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 0.2690 Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 r1 = 0.6434 a1 = 0.4136 u1 = 0.6502 y1 = 0.2690 RNN
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 r2 = (Wr [x2, y1] + br) 0.2690 = ( 0.7 0.3 0.4 0.7 0.9 0.2690 + 0.4) Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = (0.7 . 0.7 + 0.3 . 0.9 + 0.4 . 0.2690 + 0.4) = (1.2676) = 0.7803 r2 = 0.7803 RNN
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 a2 = tanh(Wa [x2, r2 . y1] + ba) 0.2690 = tanh( 0.2 0.3 0.4 0.7 0.9 0.7803 ∙0.2690 + 0.3) Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = tanh(0.2 . 0.7 + 0.3 . 0.9 + 0.4 . 0.2099 + 0.3) = tanh(0.7940) = 0.6606 r2 = 0.7803 a2 = 0.6606 RNN
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 u2 = (Wu [x2, y1] + bu) 0.2690 = ( 0.4 0.2 0.1 0.7 0.9 0.2690 + 0.5) Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = (0.4 . 0.7 + 0.2 . 0.9 + 0.1 . 0.2690 + 0.5) = (0.9869) = 0.7285 r2 = 0.7803 a2 = 0.6606 u2 = 0.7285 RNN
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 y2 = (1 – u2) . y1 + u2 . a2 0.2690 = (1 – 0.7285) . 0.2690 + 0.7285 . 0.6606 Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = 0.5543 r2 = 0.7803 a2 = 0.6606 u2 = 0.7285 y2 = 0.5543 RNN
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 0.2690 Error = y2 - y = 0.5543 – 0.5 Wr = 0.7 0.3 0.4 br = 0.4 Wa = 0.2 0.3 0.4 ba = 0.3 Wu = 0.4 0.2 0.1 bu = 0.5 = 0.0543 r2 = 0.7803 a2 = 0.6606 u2 = 0.7285 y2 = 0.5543 RNN
Similar to the “neural network”, GRU could also have multiple layers and have multiple memory units in each layer. RNN
Multi-layer RNN input x1 RNN output y x2 input Memory Unit x1 output y
Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN
Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN
Multi-layer RNN input output x1 y1 x2 y2 x3 y3 x4 y4 x5 Input layer Hidden layer Output layer