Recurrent Neural Networks

Slides:



Advertisements
Similar presentations
Distributed Representations of Sentences and Documents
Advertisements

Deep Learning Neural Network with Memory (1)
Haitham Elmarakeby.  Speech recognition
Predicting the dropouts rate of online course using LSTM method
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. SHOW.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Attention Model in NLP Jichuan ZENG.
Lecture 6 Smaller Network: RNN
Deep Learning RUSSIR 2017 – Day 3
Best viewed with Computer Modern fonts installed
Unsupervised Learning of Video Representations using LSTMs
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.
Best viewed with Computer Modern fonts installed
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Feedforward Networks
Natural Language and Text Processing Laboratory
Recursive Neural Networks
Recurrent Neural Networks for Natural Language Processing
Recurrent Neural Networks
Neural Machine Translation by Jointly Learning to Align and Translate
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Matt Gormley Lecture 16 October 24, 2016
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Deep Learning: Model Summary
Intro to NLP and Deep Learning
ICS 491 Big Data Analytics Fall 2017 Deep Learning
Intelligent Information System Lab
Intro to NLP and Deep Learning
Different Units Ramakrishna Vedantam.
Neural Networks 2 CS446 Machine Learning.
Zhu Han University of Houston
RNNs: Going Beyond the SRN in Language Prediction
RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect
Hidden Markov Models Part 2: Algorithms
Advanced Artificial Intelligence
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
A First Look at Music Composition using LSTM Recurrent Neural Networks
Recurrent Neural Networks
Final Presentation: Neural Network Doc Summarization
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.
Tips for Training Deep Network
Understanding LSTM Networks
Word Embedding Word2Vec.
ECE599/692 - Deep Learning Lecture 14 – Recurrent Neural Network (RNN)
Introduction to RNNs for NLP
Recurrent Neural Networks
Code Completion with Neural Attention and Pointer Networks
Long Short Term Memory within Recurrent Neural Networks
Other Classification Models: Recurrent Neural Network (RNN)
Lecture 16: Recurrent Neural Networks (RNNs)
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
RNNs: Going Beyond the SRN in Language Prediction
Attention.
实习生汇报 ——北邮 张安迪.
Please enjoy.
LSTM: Long Short Term Memory
Word embeddings (continued)
Meta Learning (Part 2): Gradient Descent as LSTM
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Recurrent Neural Networks (RNNs)
A unified extension of lstm to deep network
Question Answering System
Sequence-to-Sequence Models
Deep learning: Recurrent Neural Networks CV192
Bidirectional LSTM-CRF Models for Sequence Tagging
LHC beam mode classification
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Recurrent Neural Networks deeplearning.ai Gated Recurrent Unit (GRU)

Motivation Not all problems can be converted into one with fixed-length inputs and outputs Problems such as Speech Recognition or Time-series Prediction require a system to store and use context information Simple case: Output YES if the number of 1s is even, else NO 1000010101 – YES, 100011 – NO, … Hard/Impossible to choose a fixed context window There can always be a new sample longer than anything seen

Recurrent Neural Networks (RNNs) Recurrent Neural Networks take the previous output or hidden states as inputs. The composite input at time t has some historical information about the happenings at time T < t RNNs are useful as their intermediate values (state) can store information about past inputs for a time that is not fixed a priori

Sample Feed-forward Network y1 h1 x1 t = 1

Sample RNN y3 y2 h3 y1 h2 x3 h1 t = 3 x2 t = 2 x1 t = 1

Sample RNN y3 y2 h3 y1 h2 x3 h1 t = 3 x2 h0 t = 2 x1 t = 1

Sentiment Classification Classify a restaurant review from Yelp! OR movie review from IMDB OR … as positive or negative Inputs: Multiple words, one or more sentences Outputs: Positive / Negative classification “The food was really good” “The chicken crossed the road because it was uncooked”

Sentiment Classification RNN h1 The

Sentiment Classification RNN RNN h1 h2 The food

Sentiment Classification hn RNN RNN RNN h1 h2 hn-1 The food good

Sentiment Classification Linear Classifier hn RNN RNN RNN h1 h2 hn-1 The food good

Sentiment Classification Ignore Ignore Linear Classifier h1 h2 hn RNN RNN RNN h1 h2 hn-1 The food good

Sentiment Classification h = Sum(…) h1 hn h2 RNN RNN RNN h1 h2 hn-1 The food good http://deeplearning.net/tutorial/lstm.html

Sentiment Classification Linear Classifier h = Sum(…) h1 hn h2 RNN RNN RNN h1 h2 hn-1 The food good http://deeplearning.net/tutorial/lstm.html

RNN unit 𝑎 <𝑡> =𝑔( 𝑊 𝑎 𝑎 <𝑡−1> , 𝑥 <𝑡> + 𝑏 𝑎 ) 𝑎 <𝑡> =𝑔( 𝑊 𝑎 𝑎 <𝑡−1> , 𝑥 <𝑡> + 𝑏 𝑎 ) 𝑎 <0> 𝑥 <1> 𝑦 <1> 𝑎 <1> Output activation value

( c ) GRU

Same dimension, for example 100 Element-wise Dot product

GRU (simplified) The cat, which already ate …, was full. [Cho et al., 2014. On the properties of neural machine translation: Encoder-decoder approaches] [Chung et al., 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling]

Relevant gate

Recurrent Neural Networks deeplearning.ai LSTM (long short term memory) unit

Introduction RNN (Recurrent neural network) is a form of neural networks that feed outputs back to the inputs during operation LSTM (Long short-term memory) is a form of RNN. It fixes the  vanishing gradient problem of the original RNN. Application: Sequence to sequence model based using LSTM for machine translation Materials are mainly based on links found in https://www.tensorflow.org/tutorials RNN, LSTM v.7.c

LSTM (Long short-term memory) Standard RNN Input concatenate with output then feed to input again LSTM The repeating structure is more complicated RNN, LSTM v.7.c

GRU and LSTM More powerful and general version GRU LSTM 𝑐 <𝑡> = tanh ( 𝑊 𝑐 Γ 𝑟 ∗ 𝑐 <𝑡−1> , 𝑥 <𝑡> + 𝑏 𝑐 ) Γ 𝑢 = 𝜎 ( 𝑊 𝑢 𝑐 <𝑡−1> , 𝑥 <𝑡> + 𝑏 𝑢 ) 𝑐 <𝑡> = Γ 𝑢 ∗ 𝑐 <𝑡> + 1− Γ 𝑢 ∗ 𝑐 <𝑡−1> Γ 𝑟 = 𝜎 ( 𝑊 𝑟 𝑐 <𝑡−1> , 𝑥 <𝑡> + 𝑏 𝑟 ) 𝑎 <𝑡> = 𝑐 <𝑡> [Hochreiter & Schmidhuber 1997. Long short-term memory]

LSTM in pictures 𝑐 <𝑡−1> 𝑎 <𝑡−1> 𝑐 <𝑡> 𝑥 <𝑡> forget gate update gate tanh output gate ⨁ 𝑎 <𝑡> softmax 𝑐 <𝑡> 𝑜 <𝑡> 𝑓 <𝑡> 𝑖 <𝑡> 𝑦 <𝑡> - * 𝑐 <𝑡> = Γ 𝑢 ∗ 𝑐 <𝑡> + Γ 𝑓 ∗ 𝑐 <𝑡−1> 𝑐 <𝑡> = tanh ( 𝑊 𝑐 𝑎 <𝑡−1> , 𝑥 <𝑡> + 𝑏 𝑐 ) Γ 𝑢 = 𝜎 ( 𝑊 𝑢 𝑎 <𝑡−1> , 𝑥 <𝑡> + 𝑏 𝑢 ) Γ 𝑓 = 𝜎 ( 𝑊 𝑓 𝑎 <𝑡−1> , 𝑥 <𝑡> + 𝑏 𝑓 ) Γ 𝑜 = 𝜎 ( 𝑊 𝑜 𝑎 <𝑡−1> , 𝑥 <𝑡> + 𝑏 𝑜 ) 𝑎 <𝑡> = Γ 𝑜 ∗ 𝑐 <𝑡> 𝑐 <0> 𝑎 <0> 𝑐 <1> 𝑥 <1> ⨁ 𝑎 <1> softmax 𝑦 <1> - * 𝑐 <1> 𝑎 <1> 𝑥 <2> ⨁ 𝑎 <2> softmax 𝑦 <2> 𝑐 <2> - * 𝑐 <2> 𝑎 <2> 𝑥 <3> ⨁ 𝑎 <3> softmax 𝑦 <3> 𝑐 <3> - *

Core idea of LSTM C= State Using gates it can add or remove information to avoid the long term dependencies problem Bengio, et al. (1994) Ct-1 = State of time t-1 Ct = State of time t A gate controlled by  : The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state http://colah.github.io/posts/2015-08-Understanding-LSTMs/ =a sigmoid function. RNN, LSTM v.7.c

First step: forget gate layer Decide what to throw away from the cell state “For the language model example.. the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.” What to be kept/forget “It looks at ht−1 and xt, and outputs a number between 0 and 1 for each number in the cell state Ct−1. A 1 represents “completely keep this” while a 0 represents “completely get rid of this.” ” http://colah.github.io/posts/2015-08-Understanding-LSTMs/ RNN, LSTM v.7.c

Second step (a): input gate layer Decide what information to store in the cell state What to be kept/forget New information added to become the state Ct “For the language model example .. In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.” “Next, a tanh layer creates a vector of new candidate values, ~Ct, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.” http://colah.github.io/posts/2015-08-Understanding-LSTMs/ RNN, LSTM v.7.c

Second step (b): update the old cell state “We multiply the old state by ft, forgetting the things we decided to forget earlier. Then we add it ∗ ~Ct. This is the new candidate values, scaled by how much we decided to update each state value.” http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Ct-1  Ct “For the language model example.. this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.” RNN, LSTM v.7.c

Third step: output layer “Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.” Decide what to output (ht). “For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.” http://colah.github.io/posts/2015-08-Understanding-LSTMs/ RNN, LSTM v.7.c

Size( Xt(nx1) append ht-1(mx1) )=(n+m)x1 X is of size nx1 h is of size mx1 http://kvitajakub.github.io/2016/04/14/rnn-diagrams/ Ct(mx1) Forget gate Ct-1(mx1) U(mx1) i(mx1) ot(mx1) ft(mx1) ht(mx1) ht-1(mx1) Size( Xt(nx1) append ht-1(mx1) )=(n+m)x1 X is of size nx1 RNN, LSTM v.7.c

Recurrent Neural Networks deeplearning.ai Bidirectional RNN

Getting information from the future He said, “Teddy bears are on sale!” He said, “Teddy Roosevelt was a great President!” 𝑦 <7> 𝑥 <2> 𝑥 <7> 𝑦 <1> 𝑦 <2> 𝑥 <3> 𝑦 <3> 𝑎 <0> 𝑥 <1> 𝑎 <1> 𝑎 <2> 𝑎 <3> 𝑎 <7> 𝑥 <5> 𝑦 <4> 𝑦 <5> 𝑥 <6> 𝑦 <6> 𝑎 <5> 𝑎 <6> 𝑎 <4> 𝑥 <4> He said, “Teddy bears are on sale!”

Bidirectional RNN (BRNN)