Deep Learning Neural Network with Memory (1) Hung-yi Lee Prepare motivation application of generating setneces Claim: use predict future Elman-type RNN及Jordan-type RNN Missing: use memory as beginning(?) how RNN consider the context info generate sth. By RNN Possible extra topics: should I mention CTC
Network needs memory to achieve this Memory is important 𝑥 1 𝑥 2 𝑥 3 1 1 1 4 4 Input: 2 dimensions 4 7 4 7 1 + 1 7 7 3 2 1 𝑦 1 𝑦 2 𝑦 3 Output: 1 dimension Hard to learn by neural network 1 2 3 Network needs memory to achieve this
Memory is important Network with Memory 𝑥 1 𝑥 2 𝑥 3 4 7 4 7 1 𝑦 1 𝑦 2 -10 1 1 11 𝑥 1 𝑥 2 𝑥 3 Memory Cell 4 7 4 7 1 1 10 c1 11 11 1 Even the same input, the output may be different. due to the memory! 1 𝑦 1 𝑦 2 𝑦 3 1 2 3 1 1 1 1 1 x1 x2 7 4
Memory is important Network with Memory 𝑥 1 𝑥 2 𝑥 3 4 7 4 7 1 𝑦 1 𝑦 2 -10 1 1 12 𝑥 1 𝑥 2 𝑥 3 Memory Cell 4 7 4 7 1 1 10 c1 12 12 1 1 1 𝑦 1 𝑦 2 𝑦 3 1 2 3 1 1 1 1 1 x1 x2 7 4
Memory is important Network with Memory 𝑥 1 𝑥 2 𝑥 3 4 7 4 7 1 𝑦 1 𝑦 2 -10 1 3 𝑥 1 𝑥 2 𝑥 3 Memory Cell 4 7 4 7 1 1 10 c1 3 3 1 1 1 𝑦 1 𝑦 2 𝑦 3 1 2 3 1 1 1 1 1 x1 x2 1 1
Outline Recurrent Neural Network (RNN) Long short-term memory (LSTM) Today we will simply assume we already know the parameters in the network. Training will be discussed in the next week. Recurrent Neural Network (RNN) Long short-term memory (LSTM) Application on language modeling
Outline Recurrent Neural Network (RNN) Long short-term memory (LSTM) Application on language modeling
Recurrent Neural Network (RNN) The output of neurons are stored in the memory cells. …… copy Fully connected …… …… General idea Elma Jordan The number of neurons and memorys are the same Fully connected Fully connected Learned from data ……
Recurrent Neural Network (RNN) y=Softmax(Woa) The output of neurons are stored in the memory cells. …… Woa copy Wo a=σ(Wix+Whh) …… …… vectors General idea Elma Jordan The number of neurons and memorys are the same h Wix+Whh a Wh Wi x ……
Recurrent Neural Network (RNN) y’=Softmax(Woa’) The output of neurons are stored in the memory cells. …… Woa’ copy Wo a’=σ(Wix’+Wha) …… …… vectors General idea Elma Jordan The number of neurons and memorys are the same h Wix’+Wha a Wh Wi a’ x’ ……
Recurrent Neural Network (RNN) (xi are vectors) Input data: 𝑥 1 𝑥 2 𝑥 3 …… y1 y2 y3 Wo copy Wo copy Wo a1 a2 a3 a1 a2 Wh Wh Wh Wi Wi General idea Wi x1 x2 x3 The same network is used again and again. Output yi depends on x1, x2, …… xi
Outline Recurrent Neural Network (RNN) Long short-term memory (LSTM) Application on language modeling (simplified version)
Long Short-term Memory (LSTM) Other part of the network Special Neuron: 4 inputs, 1 output Signal control the output gate Output Gate (Other part of the network) Memory Cell Forget Gate Signal control the forget gate (Other part of the network) Normal neuron 1 input, 1 output This one 4 input, 1 output Signal control the input gate Input Gate LSTM (Other part of the network) Other part of the network
𝑐 ′ =𝑔 𝑧 𝑓 𝑧 𝑖 +𝑐𝑓 𝑧 𝑓 𝑎 =ℎ 𝑐 ′ 𝑓 𝑧 𝑜 𝑧 𝑜 multiply =ℎ 𝑐 ′ 𝑓 𝑧 𝑜 𝑧 𝑜 multiply Activation function f is usually a sigmoid function 𝑓 𝑧 𝑜 ℎ 𝑐 ′ Between 0 and 1 Mimic open and close gate 𝑐 𝑓 𝑧 𝑓 c 𝑐 ′ 𝑧 𝑓 𝑐𝑓 𝑧 𝑓 𝑐 ′ =𝑔 𝑧 𝑓 𝑧 𝑖 +𝑐𝑓 𝑧 𝑓 𝑓 𝑧 𝑖 𝑔 𝑧 𝑓 𝑧 𝑖 𝑧 𝑖 multiply 𝑔 𝑧 𝑧
…… …… Original Network: Simply replace the neurons with LSTM 𝑎 1 𝑎 2 𝑧 1 𝑧 2 Use a same set of information Note about the memory x1 x2 Input
𝑎 1 𝑎 2 + + + + + + + + x1 x2 Input 4 times of parameters Use a same set of information Note about the memory x1 x2 Input 4 times of parameters
LSTM - Example 3 3 7 7 7 6 𝑥 1 1 3 1 2 4 1 2 1 3 -1 6 1 1 𝑥 2 𝑥 3 𝑦 7 6 Can you find the rule? When x2 = 1, add the numbers of x1 into the memory When x2 = -1, reset the memory When x3 = 1, output the number in the memory.
x1 y 7 𝑦 x2 + 100 x3 x1 1 -10 100 + x2 x1 x3 100 x2 + 10 1 x3 Look at the sequence first 1 -10 + 𝑥 1 3 1 4 1 2 1 3 -1 1 𝑥 2 x1 x2 x3 1 𝑥 3
3 ≈0 y 7 𝑦 1 + ≈0 3 100 3 1 -10 100 3 + 1 3 ≈1 ≈1 100 3 1 + 10 3 1 1 -10 + 𝑥 1 3 1 4 1 2 1 3 -1 1 𝑥 2 3 1 1 𝑥 3
4 ≈0 y 7 𝑦 1 + ≈0 7 100 4 1 -10 100 3 7 + 1 4 ≈1 ≈1 100 4 1 + 10 4 1 1 -10 + 𝑥 1 3 1 4 1 2 1 3 -1 1 𝑥 2 4 1 1 𝑥 3
2 ≈0 y 7 𝑦 + ≈0 7 100 2 1 -10 100 7 + 2 ≈1 ≈0 100 + 10 2 1 1 -10 + 𝑥 1 3 1 4 1 2 1 3 -1 1 𝑥 2 2 1 𝑥 3
1 ≈7 y 7 𝑦 + ≈1 7 100 1 1 1 -10 100 7 + 1 ≈1 ≈0 1 100 + 10 1 1 1 1 -10 + 𝑥 1 3 1 4 1 2 1 3 -1 1 𝑥 2 1 1 1 𝑥 3
3 ≈0 y 7 𝑦 -1 + ≈0 100 3 1 -10 100 7 + -1 3 ≈0 ≈0 100 -1 + 10 3 1 1 -10 + 𝑥 1 3 1 4 1 2 1 3 -1 1 𝑥 2 3 -1 1 𝑥 3
Outline Recurrent Neural Network (RNN) Long short-term memory (LSTM) Application on language modeling
Output = “recognize speech” Language model (LM) Language model: Estimated the probability of word sequence Word sequence: w1, w2, w3, …., wn P(w1, w2, w3, …., wn) Useful in speech recognition Different word sequence can have the same pronunciation wreck 破壞;損害;使受挫 Not only speech recognition, but also translation If P(recognize speech) >P(wreck a nice beach) recognize speech or wreck a nice beach Output = “recognize speech”
Language model 𝑃 beach|nice = 𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ 𝐶 𝑛𝑖𝑐𝑒 P(“wreck a nice beach”) =P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice) How to estimate P(w1, w2, w3, …., wn) Collect a large amount of text data as training data However, the word sequence w1, w2, …., wn may not appear in the training data N-gram language model: P(w1, w2, w3, …., wn) = P(w1|START)P(w2|w1) …... P(wn|wn-1) Estimate P(beach|nice) from training data Count of “nice beach” in the training data 𝑃 beach|nice = 𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ 𝐶 𝑛𝑖𝑐𝑒 Count of “nice” in the training data
Language model - Smoothing Training data: The dog ran …… The cat jumped …… This is called language model smoothing. P( jumped | dog ) = 0 0.0001 Give some small probability P( ran | cat ) = 0 0.0001 The probability is not accurate. The phenomenon happens because we cannot collect all the possible text in the world as training data.
Neural-network based LM P(“wreck a nice beach”) =P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice) P(b|a): not from count, but the NN that can predict the next word. P(next word is “wreck”) P(next word is “a”) P(next word is “nice”) P(next word is “beach”) Neural Network Neural Network Neural Network Neural Network 1-of-N encoding of “START” 1-of-N encoding of “wreck” 1-of-N encoding of “a” 1-of-N encoding of “nice”
Neural-network based LM The hidden layer of the related words are close. …… h2 dog Fully connected rabbit cat h1 …… This is true Discuss latter If P(jump|dog) is large, then P(jump|cat) increase accordingly. Fully connected (even there is not “… cat jump …” in the data) …… Smoothing is automatically done.
1-of-N encoding of the word wi-1 RNN-based LM The probability for each word as the next word wi RNN can also be used in predicting the next word …… copy …… …… …… 1-of-N encoding of the word wi-1
RNN-based LM P(next word is “wreck”) P(next word is “a”) P(next word is “nice”) P(next word is “beach”) Long term: each bi-gram is different depended on the context very long -> to the beginning 1-of-N encoding of “START” 1-of-N encoding of “wreck” 1-of-N encoding of “a” 1-of-N encoding of “nice” Model long-term information Can also consider LSTM
Another Applications Composer http://people.idsia.ch/~juergen/blues/ Sentence generation http://www.cs.toronto.edu/~ilya/rnn.html The meaning of life is that all purer Pont Lent claims were were: the Scarborough the Continent k The meaning of life is both employees. Since some social justice rights were likewise, irregularl The meaning of life is exaggerated using for piercing pushers. There is no elderly obtaining only The meaning of life is actually worshipped within such part of the study, until all eighty-six ba The meaning of life is that over a similar meaning, he is NOHLA's release). The first examples c Machine learning and having it deep and structured to communicate with other Argentings markets for a regular artist Machine learning and having it deep and structured to help the practice of bad keitus. Machine learning and having it deep and structured to acknowledge the message of a decision Beersher much of her husband's past. Machine learning and having it deep and structured and weighing many years before Clinton frequently reveiling ^^^, you get airborne Machine learning and having it deep and structured during enemy lines. haruhiniba, shaped with their hysteria-friendly chasts from an incidence. In an misaka any way, while a crowning of the image is simply called a Core Impaliarik Harry Potter was still called "Keeper'," Lionel Goodin's "Our Happiest Urban Girl" wer
Summary – Network with Memory Recurrent Neural Network (RNN) Long short-term memory (LSTM) Application on language modeling
Appendix
Long term memory …… y1 y2 y3 y4 x1 x2 x3 x4 Wait for a second Say something …… x1 x2 x3 x4
Training the parameters to let 𝑦 close to 𝑦 RNN - Training Training data: 𝑥 1 𝑥 2 𝑥 3 …… 𝑦 1 𝑦 2 𝑦 3 …… Training the parameters to let 𝑦 close to 𝑦 𝑦 1 𝑦 2 𝑦 3 y1 y2 y3 copy copy General idea x1 x2 x3
UNROLL 𝑦 2 y2 𝑦 1 y1 x1 x1 x2
Backpropagation through time (BPTT) 𝑦 3 UNROLL y3 Backpropagation through time (BPTT) When the sequence is long, the network can be extremely deep! Some weights are shared 3 hidden layers x1 x2 x3
The task that needs memory Structured Leargning v.s. RNN POS Tagging Speech Recognition John saw the saw. PN V D N x1 x2 x3 x4 y1 y2 y3 y4 x1 x2 x3 x4 …… Can also be solved by structured learning? What is the difference TSI I N y1 y2 y3 y4 …… Structured learning can also deal with these problems What are their difference?
Outlook Speech recognition Structured + Deep http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Structured + Deep http://research.microsoft.com/pubs/210167/rcrf_v9.pdf
Neural-network based LM Training Training data: “Hello how are you …… ” Input: Target: Neural Network “Hello” “how” have the largest probability “how” “are” have the largest probability “are” “you” have the largest probability …… ……