Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Lecture outline Part 1 – NLP and RNN Introduction “The Unreasonable Effectiveness of Recurrent Neural Networks” Basic Recurrent Neural Network NLP example Long Short Term Memory RNN’s Part 2 – Image Captioning Algorithms using RNN’s
The Unreasonable Effectiveness of Recurrent Neural Networks Taken from Andrej Karpathy’s blog So far – “Old School” Neural Networks – fixed length inputs and outputs RNN’s - operate over sequences of vectors (input or output) Image Captions Sentiment Analysis Machine Translation “Word Prediction”
The Unreasonable Effectiveness of Recurrent Neural Networks Algebraic Geometry-Latex
The Unreasonable Effectiveness of Recurrent Neural Networks Shakespeare:
Word Vectors Classical Word Representation is “one hot”: Each word is represented by a sparse vector
Word Vectors A more modern approach: Represent words in a dense vector ( ) “Semantically” close vectors are close In the Vector Space. Semantic Relations are preserved in Vector Space: “king”+”woman”-”man”=“queen”
Word Vectors A word Vector can be written as : where is a “one hot” vector, Beneficial for most Deep learning tasks
RNN– Language Model (Based on Richard Socher’s lecture – Deep Learning in NLP Stanford) A language model computes a probability for a sequence of words: Examples: Word ordering: Word Choice: Useful for machine translation and speech recognition
Recurrent Neural Networks Language Model Each output depends on all previous inputs
RNN– Language Model Input : Word Vectors – At each time, compute: Output:
Recurrent Neural Networks-Language Model Total Objective is to maximize the log-likelihood w.r.t parameters “one hot” vector containing the true word log-likelihood:
RNN’s – HARD TO TRAIN!
Vanishing/Exploding gradient problem For Stochastic Gradient Descent we calculate the derivative of the loss w.r.t the Parameters: Reminder: where: Applying Chain Rule:
Vanishing/Exploding gradient problem Update equation: By Chain rule:
Vanishing/Exploding gradient problem Gradients can be very large or very small – “small W” – vanishing gradient Long time dependencies “Large W” (bad for optimization)
LSTM’s Long Short term memory Invented in 1991 by Hochreiter and Schmidnbaur Solved vanishing and exploding gradients using gating Taken from Christopher Olah’s blog
LSTM’s Equations: Different notations. H(t) instead of y(t). C(t) instead of h(t)
LSTM’s “Forget Gate” Ft = 0 forget, ft = 1. keep Examples: Period “.” New Subject gender
LSTM’s “Input gate layer” What information ngoes into the new Candidate Ctilda. “input gate layer” decides which values we’ll update
LSTM’s Updating memory cell No longer Exp.: Information Can Flow: No exponential Ct = f(t-1)*f(t-2)*Ct-2
LSTM’s Finally, Setting the output we need to decide what we’re going to output
Conclusions RNN’s are very powerful RNN’s are hard to train Nowadays - gating (LSTM’s) is the way to go! Acknowledgments: Andrej Karpathy - http://karpathy.github.io/2015/05/21/rnn- effectiveness/ Richard Socher - http://cs224d.stanford.edu/ Christopher Olah - http://colah.github.io/posts/2015-08-Understanding- LSTMs/