Presentation is loading. Please wait.

Presentation is loading. Please wait.

Other Classification Models: Recurrent Neural Network (RNN)

Similar presentations


Presentation on theme: "Other Classification Models: Recurrent Neural Network (RNN)"— Presentation transcript:

1 Other Classification Models: Recurrent Neural Network (RNN)
COMP5331 Other Classification Models: Recurrent Neural Network (RNN) Prepared by Raymond Wong Presented by Raymond Wong RNN

2 Other Classification Models
Support Vector Machine (SVM) Neural Network Recurrent Neural Network RNN

3 Neural Network Neural Network x1 x2 d 1
1 We train the model starting from the first record. Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

4 Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output
1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

5 Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output
1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

6 Neural Network Neural Network x1 x2 d 1 input x1 Neural Network output
1 Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

7 We train the model with the first record again.
Neural Network x1 x2 d 1 We train the model with the first record again. Neural Network input Neural Network x1 output y x2 Output attribute Input attributes RNN

8 Neural Network Here, we know that training the model with one record is “independent” of training the model with another record This means that we assume that records in the table are “independent” RNN

9 Thus, records in the table are “dependent”.
In some cases, the current record is “related” to the “previous” records in the table. Thus, records in the table are “dependent”. We also want to capture this “dependency” in the model We could use a new model called “recurrent neural network” for this purpose. RNN

10 Neural Network Neural Network input x1 Neural Network output y x2
Output attribute Input attributes RNN

11 Neural Network Neural Network Record 1 (vector) input x1,1
output x1 y1 x1,2 Output attribute Input attributes RNN

12 Neural Network Neural Network input Neural Network output x1 y1
Input vector Output attribute RNN

13 Recurrent Neural Network
Recurrent Neural Network (RNN) Neural Network with a Loop Recurrent Neural Network (RNN) input output x1 y1 Input vector Output attribute RNN

14 Recurrent Neural Network
Recurrent Neural Network (RNN) input output x1 RNN y1 Input vector Output attribute RNN

15 Unfolded representation of RNN
x1 y1 Timestamp = 1 RNN x2 y2 Timestamp = 2 RNN x3 y3 Timestamp = 3 RNN xt yt Timestamp = t RNN

16 Internal state variable
RNN xt-1 yt-1 Timestamp = t-1 st-1 Internal state variable RNN xt yt Timestamp = t Internal state variable RNN xt+1 yt+1 st Timestamp = t+1 st+1 Internal state variable RNN

17 … xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st

18 Limitation It may “memorize” a lot of past events/values
Due to its complex structure, it is more time-consuming for training. RNN

19 RNN Basic RNN Traditional LSTM GRU RNN

20 Basic RNN The basic RNN is very simple.
It contains only one single activation function (e.g., “tanh” and “ReLU”). RNN

21 … xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st

22 … xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t Basic RNN

23 … xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t
Memory Unit Timestamp = t Basic RNN xt+1 yt+1 st Timestamp = t+1 RNN

24 Usually, it is “tanh” or “ReLU”
Basic RNN xt-1 yt-1 Timestamp = t-1 st-1 xt yt Activation Function Timestamp = t Usually, it is “tanh” or “ReLU” Basic RNN xt+1 yt+1 st Timestamp = t+1 RNN

25 … W = 0.7 0.3 0.4 b = 0.4 xt-1 Basic RNN Timestamp = t-1 yt-1 st-1 xt
Activation Function Timestamp = t st = tanh(W . [xt, st-1] + b) st = tanh(W . [xt, st-1] + b) yt = st yt = st Basic RNN xt+1 yt+1 st Timestamp = t+1 RNN

26 In the following, we want to compute (weight) values in the basic RNN.
Similar to the neural network, the basic RNN model has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In the basic RNN, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN

27 Consider this example with two timestamps.
xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use the basic RNN to do the training. RNN

28 Basic RNN xt-1 yt-1 … When t = 1 Timestamp = t-1 x0 y0 st-1 s0 xt yt
x0 y0 st-1 s0 xt yt Activation Function Timestamp = t 1 x1 st = tanh(W . [xt, st-1] + b) st = tanh(W . [xt, st-1] + b) yt = st yt = st y1 Basic RNN xt+1 yt+1 st s1 Timestamp = t+1 2 x2 y2 RNN

29 st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 s1 = tanh(W . [x1, s0] + b) = tanh( ) W = b = 0.4 = tanh( ) = tanh(0.59) s1 = = y1 = y1 = s1 Error = y1 - y = = – 0.3 = RNN

30 st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.5299 0.5299 W = b = 0.4 s1 = y1 = RNN

31 st = tanh(W . [xt, st-1] + b) yt = st Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 s2 = tanh(W . [x2, s1] + b) 0.5299 0.5299 = tanh( ) W = b = 0.4 = tanh( ) = tanh(1.3720) s2 = = y2 = y2 = s2 Error = y2 - y = = – 0.5 = RNN

32 RNN Basic RNN Traditional LSTM GRU RNN

33 Traditional LSTM Disadvantage of Basic RNN
The basic RNN model is too “simple”. It could not simulate our human brain too much. It is not easy for the basic RNN model to converge (i.e., it may take a very long time to train the RNN model) RNN

34 Traditional LSTM Before we give the details of our brain, we want to emphasize that there is an internal state variable (i.e., variable st) to store our memory (i.e., a value) The next RNN to be described is called the LSTM (Long Short-Term Memory) model. RNN

35 Traditional LSTM It could simulate the brain process. Forget Feature
It could “decide” to forget a portion of the internal state variable. Input Feature It could “decide” to input a portion of the input variable for the model It could “decide” the strength of the input for the model (i.e., the activation function) (called the “weight” of the input) RNN

36 Traditional LSTM Output Feature
It could “decide” to output a portion of the output for the model It could “decide” the strength of the output for the model (i.e., the activation function) (called the “weight” of the output) RNN

37 Traditional LSTM Our brain includes the following steps.
Forget component Input component Input activation component Internal state component Output component Final output component Forget gate Input gate Input activation gate Input state gate Output gate Final output gate RNN

38 … xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st

39 … xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t
RNN

40 … xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t
Memory Unit Timestamp = t Traditional LSTM xt+1 yt+1 st Timestamp = t+1 RNN

41 Sigmoid function (net)
Traditional LSTM xt-1 yt-1 Wf = bf = 0.4 Timestamp = t-1 st-1 xt yt ft Timestamp = t  ft = (Wf [xt, yt-1] + bf) Forget gate ft = (Wf [xt, yt-1] + bf) Sigmoid function (net) y = 1 1 + e-net Traditional LSTM xt+1 yt+1 st Timestamp = t+1 RNN

42 … Wi = 0.2 0.3 0.4 bi = 0.2 xt-1 Traditional LSTM Timestamp = t-1 yt-1
ft Timestamp = t  ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) Input gate it = (Wi [xt, yt-1] + bi) Traditional LSTM xt+1 yt+1 st Timestamp = t+1 RNN

43 at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 Wa = ba = 0.5 Timestamp = t-1 st-1 xt yt ft Timestamp = t  ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh at = tanh(Wa [xt, yt-1] + ba) Input activation gate at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt+1 yt+1 st tanh function tanh(net) y = e2 net – 1 e2 net + 1 Timestamp = t+1 RNN

44 at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 Timestamp = t-1 st-1 xt yt ft Timestamp = t  ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh at = tanh(Wa [xt, yt-1] + ba) Traditional LSTM xt+1 yt+1 st Timestamp = t+1 RNN

45 at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at Internal state gate st = ft . st-1 + it . at Traditional LSTM xt+1 yt+1 st Timestamp = t+1 RNN

46 at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at Traditional LSTM xt+1 yt+1 st Timestamp = t+1 RNN

47 at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 Wo = bo = 0.3 Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at  ot = (Wo [xt, yt-1] + bo) Output state gate ot = (Wo [xt, yt-1] + bo) Traditional LSTM xt+1 yt+1 st Timestamp = t+1 RNN

48 at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at  tanh ot = (Wo [xt, yt-1] + bo) x yt = ot . tanh(st) Final Output state gate yt = ot . tanh(st) Traditional LSTM xt+1 yt+1 st Timestamp = t+1 RNN

49 at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 Timestamp = t-1 st-1 xt yt ft Timestamp = t  x ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot st = ft . st-1 + it . at  tanh ot = (Wo [xt, yt-1] + bo) x yt = ot . tanh(st) Traditional LSTM xt+1 yt+1 st Timestamp = t+1 RNN

50 Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation)
In the following, we want to compute (weight) values in the traditional LSTM. Similar to the neural network, the traditional LSTM model has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In the traditional LSTM, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN

51 Consider this example with two timestamps.
xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use the traditional LSTM to do the training. RNN

52 at = tanh(Wa [xt, yt-1] + ba)
Traditional LSTM xt-1 yt-1 When t = 1 Timestamp = t-1 x0 y0 st-1 s0 xt yt ft f1 Timestamp = t 1 x1  x i1 ft = (Wf [xt, yt-1] + bf) it  it = (Wi [xt, yt-1] + bi) a1 at tanh x + at = tanh(Wa [xt, yt-1] + ba) ot o1 st = ft . st-1 + it . at  tanh ot = (Wo [xt, yt-1] + bo) x y1 yt = ot . tanh(st) s1 Traditional LSTM xt+1 yt+1 st Timestamp = t+1 2 x2 y2 RNN

53 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 f1 = (Wf [x1, y0] + bf) = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (0.59) = f1 = RNN

54 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 i1 = (Wi [x1, y0] + bi) = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (0.34) = f1 = i1 = RNN

55 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 a1 = tanh(Wa [x1, y0] + ba) = tanh( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = tanh( ) = tanh(0.62) = f1 = i1 = a1 = RNN

56 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 s1 = f1 . s0 + i1 . a1 = Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f1 = i1 = a1 = s1 = RNN

57 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 o1 = (Wo [x1, y0] + bo) = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (0.74) = f1 = i1 = a1 = s1 = o1 = RNN

58 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 = o1 . tanh(s1) = tanh(0.3220) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f1 = i1 = a1 = s1 = o1 = RNN y1 =

59 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 Error = y1 - y = – 0.3 Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f1 = i1 = a1 = s1 = o1 = RNN y1 =

60 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.2107 0.3220 Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 f1 = i1 = a1 = s1 = o1 = RNN y1 =

61 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 f2 = (Wf [x2, y1] + bf) 0.2107 0.3220 = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (1.2443) = f2 = RNN

62 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 i2 = (Wi [x2, y1] + bi) 0.2107 0.3220 = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (0.6943) = f2 = i2 = RNN

63 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 a2 = tanh(Wa [x2, y1] + ba) 0.2107 0.3220 = tanh( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = tanh( ) = tanh(0.9811) = f2 = i2 = a2 = RNN

64 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 s2 = f2 . s1 + i2 . a2 0.2107 0.3220 = Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f2 = i2 = a2 = s2 = RNN

65 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 o2 = (Wo [x2, y1] + bo) 0.2107 0.3220 = ( ) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = ( ) = (1.7121) = f2 = i2 = a2 = s2 = o2 = RNN

66 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 y2 = o2 . tanh(s2) 0.2107 0.3220 = tanh(0.7525) Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f2 = i2 = a2 = s2 = o2 = RNN y2 =

67 ft = (Wf [xt, yt-1] + bf) it = (Wi [xt, yt-1] + bi) at = tanh(Wa [xt, yt-1] + ba) st = ft . st-1 + it . at ot = (Wo [xt, yt-1] + bo) yt = ot . tanh(st) Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 s0 y1 s1 0.2107 0.3220 Error = y2 - y = – 0.5 Wf = Wi = Wa = Wo = bf = 0.4 bi = 0.2 ba = 0.5 bo = 0.3 = f2 = i2 = a2 = s2 = o2 = RNN y2 =

68 Similar to the “neural network”, the LSTM model (and the basic RNN model) could also have multiple layers and have multiple memory units in each layer. RNN

69 Multi-layer RNN input x1 RNN output y x2 input Memory Unit x1 output y

70 Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN

71 Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN

72 Multi-layer RNN input output x1 y1 x2 y2 x3 y3 x4 y4 x5 Input layer
Hidden layer Output layer

73 RNN Basic RNN Traditional LSTM GRU RNN

74 GRU GRU (Gated Recurrent Unit) is a variation of the traditional LSTM model. Its structure is similar to the traditional LSTM model. But, its structure is “simpler”. Before we introduce GRU, let us see the properties of the traditional LSTM RNN

75 GRU Properties of the traditional LSTM
The traditional LSTM model has a greater power of capturing the properties in the data. Thus, the result generated by this model is usually more accurate. Besides, it could “remember” or “memorize” longer sequences. RNN

76 GRU Since the structure of GRU is simpler than the traditional LSTM model, it has the following advantages The training time is shorter It requires fewer data points to capture the properties of the data. RNN

77 GRU Different from the traditional LSTM model, the GRU model does not have an internal state variable (i.e., variable st) to store our memory (i.e., a value) It regards the “predicted” target attribute value of the previous record (with an internal operation called “reset”) as a reference to store our memory RNN

78 GRU Similarly, the GRU model simulates the brain process.
Reset Feature It could regard the “predicted” target attribute value of the previous record as a reference to store the memory Input Feature It could “decide” the strength of the input for the model (i.e., the activation function) RNN

79 GRU Output Feature It could “combine” a portion of the “predicted” target attribute value of the previous record and a portion of the “processed” input variable The ratio of these 2 portions is determined by the update feature. RNN

80 GRU Our brain includes the following steps. Reset component
Input activation component Update component Final output component Reset gate Input activation gate Update gate Final output gate RNN

81 … xt-1 RNN Timestamp = t-1 yt-1 st-1 xt Timestamp = t RNN yt st

82 … xt-1 Traditional LSTM Timestamp = t-1 yt-1 st-1 xt Timestamp = t
RNN

83 … xt-1 GRU Timestamp = t-1 yt-1 xt Timestamp = t GRU yt
RNN

84 … xt-1 GRU Timestamp = t-1 yt-1 xt Timestamp = t Memory Unit GRU yt
RNN

85 … Wr = 0.7 0.3 0.4 br = 0.4 xt-1 GRU Timestamp = t-1 yt-1 rt xt
 rt = (Wr [xt, yt-1] + br) Reset gate rt = (Wr [xt, yt-1] + br) GRU xt+1 yt+1 Timestamp = t+1 RNN

86 at = tanh(Wa [xt, rt . yt-1] + ba)
GRU xt-1 yt-1 Wa = ba = 0.3 Timestamp = t-1 xt yt rt Timestamp = t  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) at Input activation gate at = tanh(Wa [xt, rt . yt-1] + ba) GRU xt+1 yt+1 Timestamp = t+1 RNN

87 at = tanh(Wa [xt, rt . yt-1] + ba)
GRU xt-1 yt-1 Wu = bu = 0.5 Timestamp = t-1 xt yt rt Timestamp = t  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at ut  Update gate ut = (Wu [xt, yt-1] + bu) GRU xt+1 yt+1 Timestamp = t+1 RNN

88 at = tanh(Wa [xt, rt . yt-1] + ba)
RNN xt-1 yt-1 Timestamp = t-1 xt yt rt Timestamp = t  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut +  1- x GRU xt+1 yt+1 Final output gate yt = (1 - ut) . yt-1 + ut . at Timestamp = t+1 RNN

89 at = tanh(Wa [xt, rt . yt-1] + ba)
RNN xt-1 yt-1 Timestamp = t-1 xt yt rt Timestamp = t  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at yt = (1 - ut) . yt-1 + ut . at x ut +  1- x GRU xt+1 yt+1 Timestamp = t+1 RNN

90 In the following, we want to compute (weight) values in GRU.
Similar to the neural network, the GRU has two steps. Step 1 (Input Forward Propagation) Step 2 (Error Backward Propagation) In the following, we focus on “Input Forward Propagation”. In GRU, “Error Backward Propagation” could be solved by an existing optimization tool (like “Neural Network”). RNN

91 Consider this example with two timestamps.
xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Consider this example with two timestamps. When t = 1 When t = 2 We use GRU to do the training. RNN

92 at = tanh(Wa [xt, rt . yt-1] + ba)
RNN xt-1 yt-1 When t = 1 Timestamp = t-1 x0 y0 xt yt r1 rt Timestamp = t 1 x1  rt = (Wr [xt, yt-1] + br) x tanh at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) at a1 yt = (1 - ut) . yt-1 + ut . at u1 x ut +  y1 1- x GRU xt+1 yt+1 Timestamp = t+1 2 x2 y2 RNN

93 rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 r1 = (Wr [x1, y0] + br) = ( ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = ( ) = (0.59) = r1 = RNN

94 rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 a1 = tanh(Wa [x1, r1 . y0] + ba) = tanh( ∙ ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = tanh( ) = tanh(0.44) = r1 = a1 = RNN

95 rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 u1 = (Wu [x1, y0] + bu) = ( ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = ( ) = (0.62) = r1 = a1 = u1 = RNN

96 rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 = (1 – u1) . y0 + u1 . a1 = (1 – ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = r1 = a1 = u1 = y1 = RNN

97 rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 Error = y1 - y = – 0.3 Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = r1 = a1 = u1 = y1 = RNN

98 rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 0.2690 Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 r1 = a1 = u1 = y1 = RNN

99 rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 r2 = (Wr [x2, y1] + br) 0.2690 = ( ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = ( ) = (1.2676) = r2 = RNN

100 rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 a2 = tanh(Wa [x2, r2 . y1] + ba) 0.2690 = tanh( ∙ ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = tanh( ) = tanh(0.7940) = r2 = a2 = RNN

101 rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 u2 = (Wu [x2, y1] + bu) 0.2690 = ( ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = ( ) = (0.9869) = r2 = a2 = u2 = RNN

102 rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 y2 = (1 – u2) . y1 + u2 . a2 0.2690 = (1 – ) Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = r2 = a2 = u2 = y2 = RNN

103 rt = (Wr [xt, yt-1] + br) at = tanh(Wa [xt, rt . yt-1] + ba) ut = (Wu [xt, yt-1] + bu) yt = (1 - ut) . yt-1 + ut . at Time xt, 1 xt, 2 y t=1 0.1 0.4 0.3 t=2 0.7 0.9 0.5 Step 1 (Input Forward Propagation) y0 y1 0.2690 Error = y2 - y = – 0.5 Wr = br = 0.4 Wa = ba = 0.3 Wu = bu = 0.5 = r2 = a2 = u2 = y2 = RNN

104 Similar to the “neural network”, GRU could also have multiple layers and have multiple memory units in each layer. RNN

105 Multi-layer RNN input x1 RNN output y x2 input Memory Unit x1 output y

106 Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN

107 Multi-layer RNN input RNN x1 output y x2 input output x1 y x2 RNN

108 Multi-layer RNN input output x1 y1 x2 y2 x3 y3 x4 y4 x5 Input layer
Hidden layer Output layer


Download ppt "Other Classification Models: Recurrent Neural Network (RNN)"

Similar presentations


Ads by Google