Recurrent Neural Networks (RNN)

Slides:



Advertisements
Similar presentations
Dougal Sutherland, 9/25/13.
Advertisements

Lecture 14 – Neural Networks
Predicting the dropouts rate of online course using LSTM method
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Conditional Generation by RNN & Attention
Deep Learning Methods For Automated Discourse CIS 700-7
Attention Model in NLP Jichuan ZENG.
Lecture 6 Smaller Network: RNN
Today’s Lecture Neural networks Training
Machine Learning Supervised Learning Classification and Regression
Best viewed with Computer Modern fonts installed
Unsupervised Learning of Video Representations using LSTMs
RNNs: An example applied to the prediction task
SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.
Best viewed with Computer Modern fonts installed
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Feedforward Networks
Deep Learning Amin Sobhani.
Recursive Neural Networks
Recurrent Neural Networks for Natural Language Processing
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Matt Gormley Lecture 16 October 24, 2016
Deep Learning: Model Summary
Intro to NLP and Deep Learning
ICS 491 Big Data Analytics Fall 2017 Deep Learning
Intelligent Information System Lab
Intro to NLP and Deep Learning
Different Units Ramakrishna Vedantam.
Neural networks (3) Regularization Autoencoder
Neural Networks 2 CS446 Machine Learning.
Shunyuan Zhang Nikhil Malik
Neural Networks and Backpropagation
Please, Pay Attention, Neural Attention
RNNs: Going Beyond the SRN in Language Prediction
RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect
Advanced Artificial Intelligence
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Training Neural Networks II
Tips for Training Deep Network
Understanding LSTM Networks
Introduction to RNNs for NLP
Recurrent Neural Networks
Code Completion with Neural Attention and Pointer Networks
LECTURE 35: Introduction to EEG Processing
Convolutional networks
Long Short Term Memory within Recurrent Neural Networks
Other Classification Models: Recurrent Neural Network (RNN)
ML – Lecture 3B Deep NN.
LECTURE 33: Alternative OPTIMIZERS
Lecture 16: Recurrent Neural Networks (RNNs)
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
RNNs: Going Beyond the SRN in Language Prediction
Neural networks (1) Traditional multi-layer perceptrons
Attention.
实习生汇报 ——北邮 张安迪.
Neural networks (3) Regularization Autoencoder
Meta Learning (Part 2): Gradient Descent as LSTM
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
A unified extension of lstm to deep network
Neural Machine Translation
Recurrent Neural Networks
Deep learning: Recurrent Neural Networks CV192
CSC 578 Neural Networks and Deep Learning
Week 7 Presentation Ngoc Ta Aidean Sharghi
LHC beam mode classification
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Recurrent Neural Networks (RNN) Giuseppe Attardi Some slides from Arun Mallya

RNNs are called recurrent because they perform the same task for every element of a sequence, with the output depending on the previous values. RNNs have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences.

Recurrent Neural Network

The hidden state st represents the memory of the network. st captures information about what happened in all the previous time steps. The output ot is calculated solely based on the memory at time t. st typically can’t capture information from too many time steps ago. Unlike a traditional deep neural network, which uses different parameters at each layer, a RNN shares the same parameters (U, V, W above) across all steps. This reflects the fact that we are performing the same task at each step, just with different inputs. This greatly reduces the total number of parameters we need to learn.

RNN Advantages: RNN Disadvantages: Can process any length input Model size doesn’t increase for longer input Computation for step t can (in theory) use information from many steps back Weights are shared across timesteps → representations are shared RNN Disadvantages: Recurrent computation is slow In practice, difficult to access information from many steps back Slide by Richard Socher

How do RNN reduce complexity? h and h’ are vectors with the same dimension Given function f: h’, y = f(h, x) y1 y2 y3 h0 f h1 f h2 f h3 …… x1 x2 x3 No matter how long the input/output sequence is, we only need one function f. If f’s are different, then it becomes a feedforward NN.

Deep RNN RNNs with multiple hidden layers y1 y2 y3 y4 y5 y6 x1 x2 x3 Slide by Arun Mallya

Bidirectional RNN RNNs can process the input sequence in both forward and in reverse direction y1 y2 y3 y4 y5 y6 x1 x2 x3 x4 x5 x6 Popular in speech recognition

Pyramid RNN Reducing the number of time steps Significantly speed up training Reducing the number of time steps Bidirectional RNN W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” ICASSP, 2016

Vanilla RNN h' f h h' y x h x y h’ Wh Wi Wo = s( + + b) = s( ) Note, y is computed from h’

Unfolding the network in time Vanilla RNN ht = s(Wih xt + U ht1 + b) yt = s(W0 ht) Unfolding the network in time

RNN Training

RNN Training Target: obtain the network parameters that optimize the cost function Cost functions: log loss, mean squared root error etc… Tasks: For each timestamp of the input sequence x predict output y (synchronously) For the input sequence x predict the scalar value of y (e.g., at end of sequence) For the input sequence x of length Lx generate the output sequence y of different length Ly Methods: Backpropagation: Reliable and controlled convergence Supported by most of ML frameworks Evolutionary methods, expectation maximization, non- parametric methods, particle swarm optimization Research

Back-propagation Through Time ht = s(Wih xt + U ht1 + b) yt = s(W0 ht) Cross Entropy Loss: 𝐸 𝑦 ,𝑦 =− 𝑡 𝑦 log 𝑦 𝑡 Unfold the network. Repeat for the train data: Given some input sequence 𝒙 For t in 0, N-1: Forward-propagate Initialize hidden state to the past value 𝒉𝑡−1 Obtain output sequence 𝒚 Calculate error E(y, y) Back-propagate error across the unfolded network Average the weights Compute next hidden state value 𝒉𝑡

Back-propagation Through Time Timestep 0 hinit y0 x0 E0 ŷ0 Timestep 1 h0 y1 x1 E1 ŷ1 Timestep 2 h1 x2 E2 ŷ2 h2 𝜕 ℎ 2 𝜕 ℎ 0 = 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 ℎ 1 𝜕 ℎ 0 Apply chain rule: 𝜕 𝐸 2 𝜕𝜃 = 𝑘=0 2 𝜕 𝐸 2 𝜕 𝑦 2 ⋅ 𝜕 𝑦 2 𝜕 ℎ 0 ⋅ 𝜕 ℎ 2 𝜕 ℎ 𝑘 ⋅ 𝜕 ℎ 𝑘 𝜕𝜃 For time 2: 𝜽 - Network parameters

Back-propagation Through Time Timestep 0 hinit y0 x0 E0 ŷ0 Timestep 1 h0 y1 x1 E1 ŷ1 Timestep 2 h1 x2 E2 ŷ2 h2 𝜕 𝐸 0 𝜕 ℎ 0 𝜕 𝐸 1 𝜕 ℎ 1 𝜕 𝐸 2 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 ℎ 0 𝜕 ℎ 2 𝜕 ℎ 1

Problem: vanishing gradients Gradient close to 0 Saturation Saturated neurons gradients → 0 𝝏𝒉𝑡 𝝏𝒉𝑡 𝝏𝒉3 𝝏𝒉2 𝝏𝒉1 = ∙ ⋯ ∙ ∙ ∙ 𝝏𝒉0 𝝏𝒉𝑡−1 𝝏𝒉2 𝝏𝒉1 𝝏𝒉0 Decays exponentially Network stops learning, can’t update Impossible to learn correlations between temporally distant events Drive previous layers gradients to 0 (especially for far time-stamps) Known problem for deep feed-forward networks. For recurrent networks (even shallow) makes impossible to learn long-term dependencies! Smaller weigh parameters leads to faster gradients vanishing. Very big initial parameters make the gradient descent to diverge fast (explode).

Problem: exploding gradients Large increase in the norm of the gradient during training Diagnostics: NaNs; Cost function large fluctuations Network can not converge and weigh parameters do not stabilize Pascanou R. et al, On the difficulty of training recurrent neural networks. arXiv (2012) Solutions: Use gradient clipping Try reduce learning rate Change loss function by setting constrains on weights (L1/L2 norms)

Problems with Vanilla RNN When dealing with a time series, it tends to forget old information. When there is a distant relationship of unknown length, we wish to have a “memory” to it. Vanishing gradient problem.

LSTM

The sigmoid layer outputs numbers between 0-1 determine how much each component should be let through. Pink X gate is point-wise multiplication.

LSTM Output gate This sigmoid gate This decides what info Controls what goes into output This sigmoid gate determines how much information goes thru This decides what info Is to add to the cell state Ct-1 ht-1 Forget input gate gate The core idea is this cell state Ct, it is changed slowly, with only minor linear interactions. It is very easy for information to flow along it unchanged. Why sigmoid or tanh: Sigmoid: 0,1 gating as switch. Vanishing gradient problem in LSTM is handled already. ReLU replaces tanh ok?

it decides what component is to be updated. C’t provides change contents Updating the cell state Decide what part of the cell state to output

RNN vs LSTM

LSTM – Forward/Backward Explore the equations for LSTM forward and backward computations: Illustrated LSTM Forward and Backward Pass

Information Flow in a LSTM xt ht-1 z W = tanh( ) These 4 matrix computation can be done concurrently. xt ht-1 zi Wi = σ( ) ct-1 Controls forget gate Controls input gate Updating information Controls Output gate xt ht-1 zf Wf = σ( ) zf zi z zo xt ht-1 zo Wo = σ( ) ht-1 xt

Information flow of LSTM   Element-wise multiply yt ct-1 ct ct = zf  ct-1 + ziz       tanh ht = zo  tanh(ct)   yt = σ(W’ ht) zf zi z zo ht-1 xt ht Information flow of LSTM

LSTM information flow yt yt+1 ct-1 ct ct+1 tanh tanh zf zi z zo zf zi             tanh tanh     zf zi z zo zf zi z zo ht+1 ht-1 xt ht xt+1

GRU – Gated Recurrent Unit reset gate Update gate It combines the forget and input into a single update gate. It also merges the cell state and hidden state. This is simpler than LSTM. There are many other variants too. X,*: element-wise multiply

Both take xt and ht-1 as inputs and compute ht to pass along. GRUs don't need the cell layer to pass values along. The calculations within each iteration ensure that the ht values being passed along either retain a high amount of old information or are jump-started with a high amount of new information.

Feed-forward vs Recurrent Network Feedforward network does not have input at each step Feedforward network has different parameters for each layer x f1 a1 f2 a2 f3 a3 f4 y t is layer at = ft(at-1) = σ(Wtat-1 + bt) h0 f h1 f h2 f h3 f g y4 x1 x2 x3 x4 t is time step at= f(at-1, xt) = σ(Wh at-1 + Wixt + bi)

Applications of LSTM / RNN

Neural machine translation LSTM

Sequence to sequence chat model

Chat with context M: Hi M: Hello U: Hi M: Hi U: Hi M: Hello Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau, 2015 "Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.

Baidu’s speech recognition using RNN

Attention Mechanism

Image caption generation using attention (From CY Lee lecture) z0 is initial parameter, it is also learned A vector for each region z0 match 0.7 filter filter filter CNN filter filter filter filter filter filter filter filter filter

Image Caption Generation Word 1 A vector for each region z0 z1 Attention to a region weighted sum filter filter filter CNN filter filter filter 0.7 0.1 0.1 0.1 0.0 0.0 filter filter filter filter filter filter

Image Caption Generation Word 1 Word 2 A vector for each region z0 z1 z2 weighted sum filter filter filter CNN filter filter filter 0.0 0.8 0.2 0.0 0.0 0.0 filter filter filter filter filter filter

Image Caption Generation Caption generation story like How to do story like !!!!!!!!!!!!!!!!!! http://www.cs.toronto.edu/~mbweb/ https://github.com/ryankiros/neural-storyteller Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015

Image Caption Generation 滑板相關詞 skateboarding 查看全部 skateboard  1 Dr.eye譯典通 KK [ˋsket͵bord] DJ [ˋskeitbɔ:d]   Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015

Image Caption Generation Deep Visual-Semantic Alignments for Generating Image Descriptions. Source: http://cs.stanford.edu/people/karpathy/deepimagesent/Training RNNs

Another application is summarization Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV, 2015