Deep Learning Neural Network with Memory (1)

Slides:



Advertisements
Similar presentations
Dougal Sutherland, 9/25/13.
Advertisements

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Single Neuron Hung-yi Lee. Single Neuron … bias Activation function.
Automatic Speech Recognition II  Hidden Markov Models  Neural Network.
Kostas Kontogiannis E&CE
Lecture 14 – Neural Networks
Recurrent Neural Networks ECE 398BD Instructor: Shobha Vasudevan.
Neural NetworksNN 11 Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Rutgers CS440, Fall 2003 Neural networks Reading: Ch. 20, Sec. 5, AIMA 2 nd Ed.
Introduction to Recurrent neural networks (RNN), Long short-term memory (LSTM) Wenjie Pei In this coffee talk, I would like to present you some basic.
Neural Networks Lab 5. What Is Neural Networks? Neural networks are composed of simple elements( Neurons) operating in parallel. Neural networks are composed.
Neural Optimization of Evolutionary Algorithm Strategy Parameters Hiral Patel.
Neural Networks. Plan Perceptron  Linear discriminant Associative memories  Hopfield networks  Chaotic networks Multilayer perceptron  Backpropagation.
Cascade Correlation Architecture and Learning Algorithm for Neural Networks.
Appendix B: An Example of Back-propagation algorithm
Backpropagation An efficient way to compute the gradient Hung-yi Lee.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Kai Sheng-Tai, Richard Socher, Christopher D. Manning
Recurrent Neural Networks - Learning, Extension and Applications
Convolutional LSTM Networks for Subcellular Localization of Proteins
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
ECE 471/571 - Lecture 16 Hopfield Network 11/03/15.
Predicting the dropouts rate of online course using LSTM method
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. SHOW.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
First Lecture of Machine Learning Hung-yi Lee. Learning to say “yes/no” Binary Classification.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Authors: F. Zamora-Martínez, V. Frinken, S. España-Boquera, M.J. Castro-Bleda, A. Fischer, H. Bunke Source: Pattern Recognition, Volume 47, Issue 4, April.
Deep Learning RUSSIR 2017 – Day 3
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
RNNs: An example applied to the prediction task
SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Machine Learning for Big Data
Natural Language and Text Processing Laboratory
Recurrent Neural Networks for Natural Language Processing
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Deep Learning: Model Summary
Intro to NLP and Deep Learning
ICS 491 Big Data Analytics Fall 2017 Deep Learning
RNNs: Going Beyond the SRN in Language Prediction
Advanced Artificial Intelligence
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
A First Look at Music Composition using LSTM Recurrent Neural Networks
Final Presentation: Neural Network Doc Summarization
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.
Understanding LSTM Networks
ECE599/692 - Deep Learning Lecture 14 – Recurrent Neural Network (RNN)
Introduction to RNNs for NLP
Other Classification Models: Recurrent Neural Network (RNN)
Lecture 16: Recurrent Neural Networks (RNNs)
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
Machine Translation(MT)
RNNs: Going Beyond the SRN in Language Prediction
Attention.
Neural networks (3) Regularization Autoencoder
Word embeddings (continued)
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
A unified extension of lstm to deep network
Question Answering System
Baseline Model CSV Files Pandas DataFrame Sentence Lists
Recurrent Neural Networks
Sequence-to-Sequence Models
Deep learning: Recurrent Neural Networks CV192
LHC beam mode classification
Prabhas Chongstitvatana Chulalongkorn University
Presentation transcript:

Deep Learning Neural Network with Memory (1) Hung-yi Lee Prepare motivation application of generating setneces Claim: use predict future Elman-type RNN及Jordan-type RNN Missing: use memory as beginning(?) how RNN consider the context info generate sth. By RNN Possible extra topics: should I mention CTC

Network needs memory to achieve this Memory is important 𝑥 1 𝑥 2 𝑥 3 1 1 1 4 4 Input: 2 dimensions 4 7 4 7 1 + 1 7 7 3 2 1 𝑦 1 𝑦 2 𝑦 3 Output: 1 dimension Hard to learn by neural network 1 2 3 Network needs memory to achieve this

Memory is important Network with Memory 𝑥 1 𝑥 2 𝑥 3 4 7 4 7 1 𝑦 1 𝑦 2 -10 1 1 11 𝑥 1 𝑥 2 𝑥 3 Memory Cell 4 7 4 7 1 1 10 c1 11 11 1 Even the same input, the output may be different. due to the memory! 1 𝑦 1 𝑦 2 𝑦 3 1 2 3 1 1 1 1 1 x1 x2 7 4

Memory is important Network with Memory 𝑥 1 𝑥 2 𝑥 3 4 7 4 7 1 𝑦 1 𝑦 2 -10 1 1 12 𝑥 1 𝑥 2 𝑥 3 Memory Cell 4 7 4 7 1 1 10 c1 12 12 1 1 1 𝑦 1 𝑦 2 𝑦 3 1 2 3 1 1 1 1 1 x1 x2 7 4

Memory is important Network with Memory 𝑥 1 𝑥 2 𝑥 3 4 7 4 7 1 𝑦 1 𝑦 2 -10 1 3 𝑥 1 𝑥 2 𝑥 3 Memory Cell 4 7 4 7 1 1 10 c1 3 3 1 1 1 𝑦 1 𝑦 2 𝑦 3 1 2 3 1 1 1 1 1 x1 x2 1 1

Outline Recurrent Neural Network (RNN) Long short-term memory (LSTM) Today we will simply assume we already know the parameters in the network. Training will be discussed in the next week. Recurrent Neural Network (RNN) Long short-term memory (LSTM) Application on language modeling

Outline Recurrent Neural Network (RNN) Long short-term memory (LSTM) Application on language modeling

Recurrent Neural Network (RNN) The output of neurons are stored in the memory cells. …… copy Fully connected …… …… General idea Elma Jordan The number of neurons and memorys are the same Fully connected Fully connected Learned from data ……

Recurrent Neural Network (RNN) y=Softmax(Woa) The output of neurons are stored in the memory cells. …… Woa copy Wo a=σ(Wix+Whh) …… …… vectors General idea Elma Jordan The number of neurons and memorys are the same h Wix+Whh a Wh Wi x ……

Recurrent Neural Network (RNN) y’=Softmax(Woa’) The output of neurons are stored in the memory cells. …… Woa’ copy Wo a’=σ(Wix’+Wha) …… …… vectors General idea Elma Jordan The number of neurons and memorys are the same h Wix’+Wha a Wh Wi a’ x’ ……

Recurrent Neural Network (RNN) (xi are vectors) Input data: 𝑥 1 𝑥 2 𝑥 3 …… y1 y2 y3 Wo copy Wo copy Wo a1 a2 a3 a1 a2 Wh Wh Wh Wi Wi General idea Wi x1 x2 x3 The same network is used again and again. Output yi depends on x1, x2, …… xi

Outline Recurrent Neural Network (RNN) Long short-term memory (LSTM) Application on language modeling (simplified version)

Long Short-term Memory (LSTM) Other part of the network Special Neuron: 4 inputs, 1 output Signal control the output gate Output Gate (Other part of the network) Memory Cell Forget Gate Signal control the forget gate (Other part of the network) Normal neuron 1 input, 1 output This one 4 input, 1 output Signal control the input gate Input Gate LSTM (Other part of the network) Other part of the network

𝑐 ′ =𝑔 𝑧 𝑓 𝑧 𝑖 +𝑐𝑓 𝑧 𝑓 𝑎 =ℎ 𝑐 ′ 𝑓 𝑧 𝑜 𝑧 𝑜 multiply =ℎ 𝑐 ′ 𝑓 𝑧 𝑜 𝑧 𝑜 multiply Activation function f is usually a sigmoid function 𝑓 𝑧 𝑜 ℎ 𝑐 ′ Between 0 and 1 Mimic open and close gate 𝑐 𝑓 𝑧 𝑓 c 𝑐 ′ 𝑧 𝑓 𝑐𝑓 𝑧 𝑓 𝑐 ′ =𝑔 𝑧 𝑓 𝑧 𝑖 +𝑐𝑓 𝑧 𝑓 𝑓 𝑧 𝑖 𝑔 𝑧 𝑓 𝑧 𝑖 𝑧 𝑖 multiply 𝑔 𝑧 𝑧

…… …… Original Network: Simply replace the neurons with LSTM 𝑎 1 𝑎 2 𝑧 1 𝑧 2 Use a same set of information Note about the memory x1 x2 Input

𝑎 1 𝑎 2 + + + + + + + + x1 x2 Input 4 times of parameters Use a same set of information Note about the memory x1 x2 Input 4 times of parameters

LSTM - Example 3 3 7 7 7 6 𝑥 1 1 3 1 2 4 1 2 1 3 -1 6 1 1 𝑥 2 𝑥 3 𝑦 7 6 Can you find the rule? When x2 = 1, add the numbers of x1 into the memory When x2 = -1, reset the memory When x3 = 1, output the number in the memory.

x1 y 7 𝑦 x2 + 100 x3 x1 1 -10 100 + x2 x1 x3 100 x2 + 10 1 x3 Look at the sequence first 1 -10 + 𝑥 1 3 1 4 1 2 1 3 -1 1 𝑥 2 x1 x2 x3 1 𝑥 3

3 ≈0 y 7 𝑦 1 + ≈0 3 100 3 1 -10 100 3 + 1 3 ≈1 ≈1 100 3 1 + 10 3 1 1 -10 + 𝑥 1 3 1 4 1 2 1 3 -1 1 𝑥 2 3 1 1 𝑥 3

4 ≈0 y 7 𝑦 1 + ≈0 7 100 4 1 -10 100 3 7 + 1 4 ≈1 ≈1 100 4 1 + 10 4 1 1 -10 + 𝑥 1 3 1 4 1 2 1 3 -1 1 𝑥 2 4 1 1 𝑥 3

2 ≈0 y 7 𝑦 + ≈0 7 100 2 1 -10 100 7 + 2 ≈1 ≈0 100 + 10 2 1 1 -10 + 𝑥 1 3 1 4 1 2 1 3 -1 1 𝑥 2 2 1 𝑥 3

1 ≈7 y 7 𝑦 + ≈1 7 100 1 1 1 -10 100 7 + 1 ≈1 ≈0 1 100 + 10 1 1 1 1 -10 + 𝑥 1 3 1 4 1 2 1 3 -1 1 𝑥 2 1 1 1 𝑥 3

3 ≈0 y 7 𝑦 -1 + ≈0 100 3 1 -10 100 7 + -1 3 ≈0 ≈0 100 -1 + 10 3 1 1 -10 + 𝑥 1 3 1 4 1 2 1 3 -1 1 𝑥 2 3 -1 1 𝑥 3

Outline Recurrent Neural Network (RNN) Long short-term memory (LSTM) Application on language modeling

Output = “recognize speech” Language model (LM) Language model: Estimated the probability of word sequence Word sequence: w1, w2, w3, …., wn P(w1, w2, w3, …., wn) Useful in speech recognition Different word sequence can have the same pronunciation wreck 破壞;損害;使受挫 Not only speech recognition, but also translation If P(recognize speech) >P(wreck a nice beach) recognize speech or wreck a nice beach Output = “recognize speech”

Language model 𝑃 beach|nice = 𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ 𝐶 𝑛𝑖𝑐𝑒 P(“wreck a nice beach”) =P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice) How to estimate P(w1, w2, w3, …., wn) Collect a large amount of text data as training data However, the word sequence w1, w2, …., wn may not appear in the training data N-gram language model: P(w1, w2, w3, …., wn) = P(w1|START)P(w2|w1) …... P(wn|wn-1) Estimate P(beach|nice) from training data Count of “nice beach” in the training data 𝑃 beach|nice = 𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ 𝐶 𝑛𝑖𝑐𝑒 Count of “nice” in the training data

Language model - Smoothing Training data: The dog ran …… The cat jumped …… This is called language model smoothing. P( jumped | dog ) = 0 0.0001 Give some small probability P( ran | cat ) = 0 0.0001 The probability is not accurate. The phenomenon happens because we cannot collect all the possible text in the world as training data.

Neural-network based LM P(“wreck a nice beach”) =P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice) P(b|a): not from count, but the NN that can predict the next word. P(next word is “wreck”) P(next word is “a”) P(next word is “nice”) P(next word is “beach”) Neural Network Neural Network Neural Network Neural Network 1-of-N encoding of “START” 1-of-N encoding of “wreck” 1-of-N encoding of “a” 1-of-N encoding of “nice”

Neural-network based LM The hidden layer of the related words are close. …… h2 dog Fully connected rabbit cat h1 …… This is true Discuss latter If P(jump|dog) is large, then P(jump|cat) increase accordingly. Fully connected (even there is not “… cat jump …” in the data) …… Smoothing is automatically done.

1-of-N encoding of the word wi-1 RNN-based LM The probability for each word as the next word wi RNN can also be used in predicting the next word …… copy …… …… …… 1-of-N encoding of the word wi-1

RNN-based LM P(next word is “wreck”) P(next word is “a”) P(next word is “nice”) P(next word is “beach”) Long term: each bi-gram is different depended on the context very long -> to the beginning 1-of-N encoding of “START” 1-of-N encoding of “wreck” 1-of-N encoding of “a” 1-of-N encoding of “nice” Model long-term information Can also consider LSTM

Another Applications Composer http://people.idsia.ch/~juergen/blues/ Sentence generation http://www.cs.toronto.edu/~ilya/rnn.html The meaning of life is that all purer Pont Lent claims were were: the Scarborough the Continent k The meaning of life is both employees. Since some social justice rights were likewise, irregularl The meaning of life is exaggerated using for piercing pushers. There is no elderly obtaining only The meaning of life is actually worshipped within such part of the study, until all eighty-six ba The meaning of life is that over a similar meaning, he is NOHLA's release). The first examples c Machine learning and having it deep and structured to communicate with other Argentings markets for a regular artist Machine learning and having it deep and structured to help the practice of bad keitus.  Machine learning and having it deep and structured to acknowledge the message of a decision Beersher much of her husband's past. Machine learning and having it deep and structured and weighing many years before Clinton frequently reveiling ^^^, you get airborne Machine learning and having it deep and structured during enemy lines. haruhiniba, shaped with their hysteria-friendly chasts from an incidence. In an misaka any way, while a crowning of the image is simply called a Core Impaliarik Harry Potter was still called "Keeper'," Lionel Goodin's "Our Happiest Urban Girl" wer

Summary – Network with Memory Recurrent Neural Network (RNN) Long short-term memory (LSTM) Application on language modeling

Appendix

Long term memory …… y1 y2 y3 y4 x1 x2 x3 x4 Wait for a second Say something …… x1 x2 x3 x4

Training the parameters to let 𝑦 close to 𝑦 RNN - Training Training data: 𝑥 1 𝑥 2 𝑥 3 …… 𝑦 1 𝑦 2 𝑦 3 …… Training the parameters to let 𝑦 close to 𝑦 𝑦 1 𝑦 2 𝑦 3 y1 y2 y3 copy copy General idea x1 x2 x3

UNROLL 𝑦 2 y2 𝑦 1 y1 x1 x1 x2

Backpropagation through time (BPTT) 𝑦 3 UNROLL y3 Backpropagation through time (BPTT) When the sequence is long, the network can be extremely deep! Some weights are shared 3 hidden layers x1 x2 x3

The task that needs memory Structured Leargning v.s. RNN POS Tagging Speech Recognition John saw the saw. PN V D N x1 x2 x3 x4 y1 y2 y3 y4 x1 x2 x3 x4 …… Can also be solved by structured learning? What is the difference TSI I N y1 y2 y3 y4 …… Structured learning can also deal with these problems What are their difference?

Outlook Speech recognition Structured + Deep http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf Structured + Deep http://research.microsoft.com/pubs/210167/rcrf_v9.pdf

Neural-network based LM Training Training data: “Hello how are you …… ” Input: Target: Neural Network “Hello” “how” have the largest probability “how” “are” have the largest probability “are” “you” have the largest probability …… ……