Lecture 16: Recurrent Neural Networks (RNNs)

Slides:



Advertisements
Similar presentations
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Advertisements

S. Mandayam/ ANN/ECE Dept./Rowan University Artificial Neural Networks / Fall 2004 Shreekanth Mandayam ECE Department Rowan University.
S. Mandayam/ ANN/ECE Dept./Rowan University Artificial Neural Networks / Fall 2004 Shreekanth Mandayam ECE Department Rowan University.
Deep Learning Neural Network with Memory (1)
Supervised Sequence Labelling with Recurrent Neural Networks PRESENTED BY: KUNAL PARMAR UHID:
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
Predicting the dropouts rate of online course using LSTM method
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. SHOW.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Deep Learning Methods For Automated Discourse CIS 700-7
Lecture 6 Smaller Network: RNN
Deep Learning RUSSIR 2017 – Day 3
Convolutional Sequence to Sequence Learning
Best viewed with Computer Modern fonts installed
SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.
Best viewed with Computer Modern fonts installed
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Recursive Neural Networks
Recurrent Neural Networks for Natural Language Processing
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Matt Gormley Lecture 16 October 24, 2016
Deep Learning: Model Summary
Intro to NLP and Deep Learning
ICS 491 Big Data Analytics Fall 2017 Deep Learning
Intelligent Information System Lab
Different Units Ramakrishna Vedantam.
Neural Networks 2 CS446 Machine Learning.
Convolutional Networks
Shunyuan Zhang Nikhil Malik
Neural Networks and Backpropagation
CSE P573 Applications of Artificial Intelligence Neural Networks
Zhu Han University of Houston
A critical review of RNN for sequence learning Zachary C
RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect
Advanced Artificial Intelligence
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Recurrent Neural Networks
Recurrent Neural Networks (RNN)
Recurrent Neural Networks
Final Presentation: Neural Network Doc Summarization
Tips for Training Deep Network
RNNs & LSTM Hadar Gorodissky Niv Haim.
CSE 573 Introduction to Artificial Intelligence Neural Networks
Understanding LSTM Networks
ECE599/692 - Deep Learning Lecture 14 – Recurrent Neural Network (RNN)
Recurrent Neural Networks
Long Short Term Memory within Recurrent Neural Networks
Other Classification Models: Recurrent Neural Network (RNN)
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
Attention.
Neural networks (3) Regularization Autoencoder
Please enjoy.
LSTM: Long Short Term Memory
Backpropagation and Neural Nets
Meta Learning (Part 2): Gradient Descent as LSTM
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Attention for translation
Recurrent Neural Networks (RNNs)
Automatic Handwriting Generation
A unified extension of lstm to deep network
Recurrent Neural Networks
Sequence-to-Sequence Models
Deep learning: Recurrent Neural Networks CV192
Bidirectional LSTM-CRF Models for Sequence Tagging
LHC beam mode classification
Neural Machine Translation by Jointly Learning to Align and Translate
Artificial Neural Networks / Spring 2002
Presentation transcript:

Lecture 16: Recurrent Neural Networks (RNNs) CS480/680: Intro to ML Lecture 16: Recurrent Neural Networks (RNNs) some slides are adapted from Stanford cs231n course slides 11/27/18 Yaoliang Yu

Recap: MLPs and CNNs MLPs: Fixed length of input and output No parameter sharing Units fully connected to the previous layer CNNs Parameter sharing Units only connected to a small region of the previous layer 11/27/18 Yaoliang Yu

RNN: examples (1) Variable length of input/output e.g., image captioning 11/27/18 Yaoliang Yu

RNN: examples (2) Variable length of input/output ``There is nothing to like in this movie.’’  0 ``This movie is fantastic!’’  1 e.g., sentiment classification 11/27/18 Yaoliang Yu

RNN: examples (3) Variable length of input/output 你想和我一起唱歌吗?  Do you want to sing with me? e.g., machine translation 11/27/18 Yaoliang Yu

RNN: examples (4) Variable length of input/output e.g., video classification on frame level 11/27/18 Yaoliang Yu

RNN: parameter sharing Parameter sharing: unit is produced using the same update rule applied to previous ones h(t-1): old state at time t-1 x(t): new input at time t 𝜽: parameter same f() function used for every time step Through these connections the model can retain information about the past, enabling it to discover temporal correlations between events that are far away from each other in the data. 11/27/18 Yaoliang Yu

Unfolding computational graphs (1) RNN with no output time delay 11/27/18 Yaoliang Yu

Unfolding computational graphs (2) Example: RNN produces a (discrete) output at each time step and has recurrent connections between hidden units (many to many) same parameters for every step Vanilla RNN: one single hidden vector h 11/27/18 Yaoliang Yu

Language model: training wrong wrong ht = tanh(Whh h_t-1 + Wxh x_t) 11/27/18 Yaoliang Yu

Language model: testing 11/27/18 Yaoliang Yu

Vanilla RNN gradient problems Challenge of long-term backprop Assume W has eigen-decomposition 𝑾=𝑸𝚲 𝑸 𝑻 ; 𝑾 𝑡 =𝑸 𝚲 t 𝑸 𝑻 assume W real symmetric matrix; eigendecomposition W = QvQ^T, Q ortho backprop: continuously multiply by a matrix W and tanh We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. Jack, who is …., is a great guy. Largest eigenvalue > 1: exploding gradient Largest eigenvalue < 1: vanishing gradient Impossible to learn correlation between temporally distant events 11/27/18 Yaoliang Yu

Exploding gradient Common solution: clipping gradient g: gradient v: threshold If the norm of gradient exceeds some threshold, clip it! overshoot 11/27/18 Yaoliang Yu

Vanishing gradient Common solution: create paths along which the product of gradients is near 1, e.g., LSTM, GRU sigmoid: 0 or 1, consider switch on or off tanh: -1 or 1, consider useful information The flow of information into and out of the unit is guarded by learned input and output gates. element-wise product 11/27/18 Yaoliang Yu

Long short-term memory (LSTM) gradient w.r.t. parameter w: at each time step, first take gradient w.r.t. to hidden state and cell state, then take local gradient (take gradient of hidden and cell state w.r.t. w). Finally, add all gradients together for all time steps. the hidden state h is a function of cell state c, so we an focus on cell state cell state, element-wise backprop, also f is different for each time step, so we do not multiply the same factor over and over again. f is from sigmoid function, between 0, and 1 won’t explode when you consider all these factors together, you will not have extreme cases of gradient vanishing problem, although it is also possible to have this problem. more smooth gradient flow memory cell: c_t, which is the thing that’s able to remember information over time. ct: cell state ht: hidden state 11/27/18 Yaoliang Yu

LSTM: cell state Update of cell state can simulate a counter The ability to add to the previous value means these units Can simulate a counter; you are asked to show that if the forget gate is on and the input and output gates are off, it just passes the memory cell gradients through unmodified at each time step. Therefore, the LSTM architecture is resistant to exploding and vanishing gradients, although mathematically both phenomena are still possible. can simulate a counter 11/27/18 Yaoliang Yu

Gated recurrent unit (GRU) GRU: 2 gates (r, z) LSTM: 3 gates (i, f, o) 11/27/18 Yaoliang Yu

Bidirectional RNNs Idea: combine an RNN that moves forward through time with another RNN that moves backward through time Motivation: need future information E.g., name entity recognition He said: ``Teddy bears are on sale!’’ (not a person’s name) He said: ``Teddy Roosevelt was a great President!’’ (yes) 11/27/18 Yaoliang Yu

Illustration h(t): propagate information forward g(t): propagate information backward o(t): output uses both h(t) and g(t): 11/27/18 Yaoliang Yu

Deep RNNs Usually 2-3 hidden layers 𝑙: 𝑙-th hidden layer 𝑡: 𝑡-th time step 11/27/18 Yaoliang Yu

Questions? 11/27/18 Yaoliang Yu