Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Lecture outline Part 1 – NLP and RNN Introduction “The Unreasonable Effectiveness of Recurrent Neural Networks” Basic Recurrent Neural Network NLP example Long Short Term Memory RNN’s Part 2 – Image Captioning Algorithms using RNN’s

The Unreasonable Effectiveness of Recurrent Neural Networks
Taken from Andrej Karpathy’s blog So far – “Old School” Neural Networks – fixed length inputs and outputs RNN’s - operate over sequences of vectors (input or output) Image Captions Sentiment Analysis Machine Translation “Word Prediction”

Algebraic Geometry-Latex

Shakespeare:

Word Vectors Classical Word Representation is “one hot”:
Each word is represented by a sparse vector

Word Vectors A more modern approach: Represent words in a dense vector
( ) “Semantically” close vectors are close In the Vector Space. Semantic Relations are preserved in Vector Space: “king”+”woman”-”man”=“queen”

Word Vectors A word Vector can be written as :
where is a “one hot” vector, Beneficial for most Deep learning tasks

RNN– Language Model (Based on Richard Socher’s lecture – Deep Learning in NLP Stanford) A language model computes a probability for a sequence of words: Examples: Word ordering: Word Choice: Useful for machine translation and speech recognition

Recurrent Neural Networks Language Model
Each output depends on all previous inputs

RNN– Language Model Input : Word Vectors – At each time, compute:
Output:

Recurrent Neural Networks-Language Model
Total Objective is to maximize the log-likelihood w.r.t parameters “one hot” vector containing the true word log-likelihood:

RNN’s – HARD TO TRAIN!

Vanishing/Exploding gradient problem
For Stochastic Gradient Descent we calculate the derivative of the loss w.r.t the Parameters: Reminder: where: Applying Chain Rule:

Update equation: By Chain rule:

Gradients can be very large or very small – “small W” – vanishing gradient Long time dependencies “Large W” (bad for optimization)

LSTM’s Long Short term memory
Invented in 1991 by Hochreiter and Schmidnbaur Solved vanishing and exploding gradients using gating Taken from Christopher Olah’s blog

LSTM’s Equations: Different notations. H(t) instead of y(t). C(t) instead of h(t)

LSTM’s “Forget Gate” Ft = 0 forget, ft = 1. keep Examples: Period “.”
New Subject gender

LSTM’s “Input gate layer”
What information ngoes into the new Candidate Ctilda. “input gate layer” decides which values we’ll update

LSTM’s Updating memory cell No longer Exp.: Information Can Flow:
No exponential Ct = f(t-1)*f(t-2)*Ct-2

LSTM’s Finally, Setting the output
we need to decide what we’re going to output

Conclusions RNN’s are very powerful RNN’s are hard to train
Nowadays - gating (LSTM’s) is the way to go! Acknowledgments: Andrej Karpathy - effectiveness/ Richard Socher - Christopher Olah - LSTMs/

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Similar presentations

Presentation on theme: "Image Captions With Deep Learning Yulia Kogan & Ron Shiff"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Similar presentations

Presentation on theme: "Image Captions With Deep Learning Yulia Kogan & Ron Shiff"— Presentation transcript:

Similar presentations

About project

Feedback