Download presentation
Presentation is loading. Please wait.
Published byStephany Beasley Modified over 5 years ago
1
Other Classification Models: Recurrent Neural Network (RNN)
COMP5331 Other Classification Models: Recurrent Neural Network (RNN) Prepared by Raymond Wong Presented by Raymond Wong RNN
2
Other Classification Models
Support Vector Machine (SVM) Neural Network Recurrent Neural Network RNN
3
Neural Network Neural Network x1 x2 d 1
1 We train the model starting from the first record. Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN
4
Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output
1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN
5
Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output
1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN
6
Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output
1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN
7
We train the model with the first record again.
Neural Network x1 x2 d 1 We train the model with the first record again. Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN
8
Neural Network Here, we know that training the model with one record is “independent” of training the model with another record This means that we assume that records in the table are “independent” RNN
9
Thus, records in the table are “dependent”.
In some cases, the current record is “related” to the “previous” records in the table. Thus, records in the table are “dependent”. We also want to capture this “dependency” in the model We could use a new model called “recurrent neural network” for this purpose. RNN
10
Neural Network Neural Network input x1 Neural Network output y x2
Output attribute Input attributes RNN
11
Neural Network Neural Network Record 1 (vector) input x1,1
output x1 y1 x1,2 Output attribute Input attributes RNN
12
Neural Network Neural Network input Neural Network output x1 y1
Input vector Output attribute RNN
13
Recurrent Neural Network
Recurrent Neural Network (RNN) Neural Network with a Loop Recurrent Neural Network (RNN) input output x1 y1 Input vector Output attribute RNN
14
Recurrent Neural Network
Recurrent Neural Network (RNN) input output x1 RNN y1 Input vector Output attribute RNN
15
Unfolded representation of RNN
x1 y1 Timestamp = 1 RNN x2 y2 Timestamp = 2 RNN x3 y3 Timestamp = 3 … … RNN xt yt Timestamp = t RNN
16
Internal state variable
RNN xt-1 yt-1 … Timestamp = t-1 st-1 Internal state variable RNN xt yt Timestamp = t Internal state variable RNN xt+1 yt+1 … st Timestamp = t+1 st+1 Internal state variable RNN
17
… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st
18
Limitation It may “memorize” a lot of past events/values
Due to its complex structure, it is more time-consuming for training. RNN
19
RNN Basic RNN Traditional LSTM GRU RNN
20
Basic RNN The basic RNN is very simple.
It contains only one single activation function (e.g., “tanh” and “ReLU”). RNN
21
… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st
22
… xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t Basic RNN
23
… xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t
Memory Unit Timestamp = t Basic RNN xt+1 yt+1 … st Timestamp = t+1 RNN
24
Usually, it is “tanh” or “ReLU”
Basic RNN xt-1 yt-1 … Timestamp = t-1 st-1 xt yt Activation Function Timestamp = t Usually, it is “tanh” or “ReLU” Basic RNN xt+1 yt+1 … st Timestamp = t+1 RNN
25
… W = 0.7 0.3 0.4 b = 0.4 xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt
Activation Function Timestamp = t st = tanh(W . [xt, st-1] + b) st = tanh(W . [xt, st-1] + b) yt = st yt = st Basic RNN xt+1 yt+1 … st Timestamp = t+1 RNN
26
In the following, we want to compute (weight) values in the basic RNN.
Similar to the neural network, the basic RNN model has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In the basic RNN, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN
27
Consider this example with two timestamps.
xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use the basic RNN to do the training. RNN
28
Basic RNN xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 st-1 s0 xt yt
x0 y0 st-1 s0 xt yt Activation Function Timestamp = t 1 x1 st = tanh(W . [xt, st-1] + b) st = tanh(W . [xt, st-1] + b) yt = st yt = st y1 Basic RNN xt+1 yt+1 … st s1 Timestamp = t+1 2 x2 y2 RNN
29
st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 s1 = tanh(W . [x1, s0] + b) = tanh( ) W = b = 0.4 = tanh( ) = tanh(0.59) s1 = = y1 = y1 = s1 Error = y1 - y = = – 0.3 = RNN
30
st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.5299 0.5299 W = b = 0.4 s1 = y1 = RNN
31
st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 s2 = tanh(W . [x2, s1] + b) 0.5299 0.5299 = tanh( ) W = b = 0.4 = tanh( ) = tanh(1.3720) s2 = = y2 = y2 = s2 Error = y2 - y = = – 0.5 = RNN
32
RNN Basic RNN Traditional LSTM GRU RNN
33
Traditional LSTM Disadvantage of Basic RNN
The basic RNN model is too “simple”. It could not simulate our human brain too much. It is not easy for the basic RNN model to converge (i.e., it may take a very long time to train the RNN model) RNN
34
Traditional LSTM Before we give the details of our brain, we want to emphasize that there is an internal state variable (i.e., variable st) to store our memory (i.e., a value) The next RNN to be described is called the LSTM (Long Short-Term Memory) model. RNN
35
Traditional LSTM It could simulate the brain process. Forget Feature
It could “decide” to forget a portion of the internal state variable. Input Feature It could “decide” to input a portion of the input variable for the model It could “decide” the strength of the input for the model (i.e., the activation function) (called the “weight” of the input) RNN
36
Traditional LSTM Output Feature
It could “decide” to output a portion of the output for the model It could “decide” the strength of the output for the model (i.e., the activation function) (called the “weight” of the output) RNN
37
Traditional LSTM Our brain includes the following steps.
Forget component Input component Input activation component Internal state component Output component Final output component Forget gate Input gate Input activation gate Input state gate Output gate Final output gate RNN
38
… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st
39
… xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t
RNN
40
… xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t
Memory Unit Timestamp = t Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
41
Sigmoid function (net)
Traditional LSTM xt-1 yt-1 … Wf = bf = 0.4 Timestamp = t-1 st-1 xt yt ft Timestamp = t ft = (Wf [xt, yt-1] + bf) Forget gate ft = (Wf [xt, yt-1] + bf) Sigmoid function (net) y = 1 1 + e-net Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
42
… Wi = 0.2 0.3 0.4 bi = 0.2 xt-1 Traditional LSTM Timestamp = t-1 yt-1
ft Timestamp = t ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) Input gate it = (Wi [xt, yt-1] + bi) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
43
at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 … Wa = ba = 0.5 Timestamp = t-1 st-1 xt yt ft Timestamp = t ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh at = tanh(Wa [xt, yt-1] + ba) Input activation gate at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt+1 yt+1 … st tanh function tanh(net) y = e2 net – 1 e2 net + 1 Timestamp = t+1 RNN
44
at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
45
at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t x ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at Internal state gate st = ft . st-1 + it . at Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
46
at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t x ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
47
at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 … Wo = bo = 0.3 Timestamp = t-1 st-1 xt yt ft Timestamp = t x ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) Output state gate ot = (Wo [xt, yt-1] + bo) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
48
at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t x ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at tanh ot = (Wo [xt, yt-1] + bo) x yt = ot . tanh(st) Final Output state gate yt = ot . tanh(st) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
49
at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 … Timestamp = t-1 st-1 xt yt ft Timestamp = t x ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at tanh ot = (Wo [xt, yt-1] + bo) x yt = ot . tanh(st) Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 RNN
50
Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation)
In the following, we want to compute (weight) values in the traditional LSTM. Similar to the neural network, the traditional LSTM model has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In the traditional LSTM, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN
51
Consider this example with two timestamps.
xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use the traditional LSTM to do the training. RNN
52
at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 st-1 s0 xt yt ft f1 Timestamp = t 1 x1 x i1 ft = (Wf [xt, yt-1] + bf) it it = (Wi [xt, yt-1] + bi) a1 at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot o1 st = ft . st-1 + it . at tanh ot = (Wo [xt, yt-1] + bo) x y1 yt = ot . tanh(st) s1 Traditional LSTM xt+1 yt+1 … st Timestamp = t+1 2 x2 y2 RNN
53
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 f1 = (Wf [x1, y0] + bf) = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (0.59) = f1 = RNN
54
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 i1 = (Wi [x1, y0] + bi) = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (0.34) = f1 = i1 = RNN
55
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 a1 = tanh(Wa [x1, y0] + ba) = tanh( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = tanh( ) = tanh(0.62) = f1 = i1 = a1 = RNN
56
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 s1 = f1 . s0 + i1 . a1 = Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f1 = i1 = a1 = s1 = RNN
57
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 o1 = (Wo [x1, y0] + bo) = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (0.74) = f1 = i1 = a1 = s1 = o1 = RNN
58
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 = o1 . tanh(s1) = tanh(0.3220) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f1 = i1 = a1 = s1 = o1 = RNN y1 =
59
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 Error = y1 - y = – 0.3 Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f1 = i1 = a1 = s1 = o1 = RNN y1 =
60
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.2107 0.3220 Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 f1 = i1 = a1 = s1 = o1 = RNN y1 =
61
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 f2 = (Wf [x2, y1] + bf) 0.2107 0.3220 = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (1.2443) = f2 = RNN
62
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 i2 = (Wi [x2, y1] + bi) 0.2107 0.3220 = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (0.6943) = f2 = i2 = RNN
63
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 a2 = tanh(Wa [x2, y1] + ba) 0.2107 0.3220 = tanh( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = tanh( ) = tanh(0.9811) = f2 = i2 = a2 = RNN
64
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 s2 = f2 . s1 + i2 . a2 0.2107 0.3220 = Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f2 = i2 = a2 = s2 = RNN
65
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 o2 = (Wo [x2, y1] + bo) 0.2107 0.3220 = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (1.7121) = f2 = i2 = a2 = s2 = o2 = RNN
66
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 y2 = o2 . tanh(s2) 0.2107 0.3220 = tanh(0.7525) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f2 = i2 = a2 = s2 = o2 = RNN y2 =
67
ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.2107 0.3220 Error = y2 - y = – 0.5 Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f2 = i2 = a2 = s2 = o2 = RNN y2 =
68
Similar to the “neural network”, the LSTM model (and the basic RNN model) could also have multiple layers and have multiple memory units in each layer. RNN
69
Multi-layer RNN input x1 RNN output y x2 input Memory Unit x1 output y
70
Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN
71
Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN
72
Multi-layer RNN input output x1 y1 x2 y2 x3 y3 x4 y4 x5 Input layer
Hidden layer Output layer
73
RNN Basic RNN Traditional LSTM GRU RNN
74
GRU GRU (Gated Recurrent Unit) is a variation of the traditional LSTM model. Its structure is similar to the traditional LSTM model. But, its structure is “simpler”. Before we introduce GRU, let us see the properties of the traditional LSTM RNN
75
GRU Properties of the traditional LSTM
The traditional LSTM model has a greater power of capturing the properties in the data. Thus, the result generated by this model is usually more accurate. Besides, it could “remember” or “memorize” longer sequences. RNN
76
GRU Since the structure of GRU is simpler than the traditional LSTM model, it has the following advantages The training time is shorter It requires fewer data points to capture the properties of the data. RNN
77
GRU Different from the traditional LSTM model, the GRU model does not have an internal state variable (i.e., variable st) to store our memory (i.e., a value) It regards the “predicted” target attribute value of the previous record (with an internal operation called “reset”) as a reference to store our memory RNN
78
GRU Similarly, the GRU model simulates the brain process.
Reset Feature It could regard the “predicted” target attribute value of the previous record as a reference to store the memory Input Feature It could “decide” the strength of the input for the model (i.e., the activation function) RNN
79
GRU Output Feature It could “combine” a portion of the “predicted” target attribute value of the previous record and a portion of the “processed” input variable The ratio of these 2 portions is determined by the update feature. RNN
80
GRU Our brain includes the following steps. Reset component
Input activation component Update component Final output component Reset gate Input activation gate Update gate Final output gate RNN
81
… xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st
82
… xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t
RNN
83
… xt-1 GRU Timestamp = t-1 yt-1 xt Timestamp = t GRU yt
RNN
84
… xt-1 GRU Timestamp = t-1 yt-1 xt Timestamp = t Memory Unit GRU yt
RNN
85
… Wr = 0.7 0.3 0.4 br = 0.4 xt-1 GRU Timestamp = t-1 yt-1 rt xt
rt = (Wr [xt, yt-1] + br) Reset gate rt = (Wr [xt, yt-1] + br) GRU xt+1 yt+1 … Timestamp = t+1 RNN
86
at = tanh(Wa [xt, rt . yt-1] + ba)
GRU xt-1 yt-1 … Wa = ba = 0.3 Timestamp = t-1 xt yt rt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) at Input activation gate at = tanh(Wa [xt, rt . yt-1] + ba) GRU xt+1 yt+1 … Timestamp = t+1 RNN
87
at = tanh(Wa [xt, rt . yt-1] + ba)
GRU xt-1 yt-1 … Wu = bu = 0.5 Timestamp = t-1 xt yt rt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at ut Update gate ut = (Wu [xt, yt-1] + bu) GRU xt+1 yt+1 … Timestamp = t+1 RNN
88
at = tanh(Wa [xt, rt . yt-1] + ba)
RNN xt-1 yt-1 … Timestamp = t-1 xt yt rt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut + 1- x GRU xt+1 yt+1 … Final output gate yt = (1 - ut) . yt-1 + ut . at Timestamp = t+1 RNN
89
at = tanh(Wa [xt, rt . yt-1] + ba)
RNN xt-1 yt-1 … Timestamp = t-1 xt yt rt Timestamp = t rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut + 1- x GRU xt+1 yt+1 … Timestamp = t+1 RNN
90
In the following, we want to compute (weight) values in GRU.
Similar to the neural network, the GRU has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In GRU, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN
91
Consider this example with two timestamps.
xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use GRU to do the training. RNN
92
at = tanh(Wa [xt, rt . yt-1] + ba)
RNN xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 xt yt r1 rt Timestamp = t 1 x1 rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at a1 yt = (1 - ut) . yt-1 + ut . at u1 x ut + y1 1- x GRU xt+1 yt+1 … Timestamp = t+1 2 x2 y2 RNN
93
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 r1 = (Wr [x1, y0] + br) = ( ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = ( ) = (0.59) = r1 = RNN
94
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 a1 = tanh(Wa [x1, r1 . y0] + ba) = tanh( ∙ ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = tanh( ) = tanh(0.44) = r1 = a1 = RNN
95
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 u1 = (Wu [x1, y0] + bu) = ( ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = ( ) = (0.62) = r1 = a1 = u1 = RNN
96
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 = (1 – u1) . y0 + u1 . a1 = (1 – ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = r1 = a1 = u1 = y1 = RNN
97
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 Error = y1 - y = – 0.3 Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = r1 = a1 = u1 = y1 = RNN
98
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 0.2690 Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 r1 = a1 = u1 = y1 = RNN
99
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 r2 = (Wr [x2, y1] + br) 0.2690 = ( ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = ( ) = (1.2676) = r2 = RNN
100
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 a2 = tanh(Wa [x2, r2 . y1] + ba) 0.2690 = tanh( ∙ ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = tanh( ) = tanh(0.7940) = r2 = a2 = RNN
101
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 u2 = (Wu [x2, y1] + bu) 0.2690 = ( ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = ( ) = (0.9869) = r2 = a2 = u2 = RNN
102
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 y2 = (1 – u2) . y1 + u2 . a2 0.2690 = (1 – ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = r2 = a2 = u2 = y2 = RNN
103
rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 0.2690 Error = y2 - y = – 0.5 Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = r2 = a2 = u2 = y2 = RNN
104
Similar to the “neural network”, GRU could also have multiple layers and have multiple memory units in each layer. RNN
105
Multi-layer RNN input x1 RNN output y x2 input Memory Unit x1 output y
106
Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN
107
Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN
108
Multi-layer RNN input output x1 y1 x2 y2 x3 y3 x4 y4 x5 Input layer
Hidden layer Output layer
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.