Download presentation
Presentation is loading. Please wait.
Published byHannelore Mann Modified over 6 years ago
1
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
2
Lecture outline Part 1 – NLP and RNN Introduction “The Unreasonable Effectiveness of Recurrent Neural Networks” Basic Recurrent Neural Network NLP example Long Short Term Memory RNN’s Part 2 – Image Captioning Algorithms using RNN’s
3
The Unreasonable Effectiveness of Recurrent Neural Networks
Taken from Andrej Karpathy’s blog So far – “Old School” Neural Networks – fixed length inputs and outputs RNN’s - operate over sequences of vectors (input or output) Image Captions Sentiment Analysis Machine Translation “Word Prediction”
4
The Unreasonable Effectiveness of Recurrent Neural Networks
Algebraic Geometry-Latex
5
The Unreasonable Effectiveness of Recurrent Neural Networks
Shakespeare:
7
Word Vectors Classical Word Representation is “one hot”:
Each word is represented by a sparse vector
8
Word Vectors A more modern approach: Represent words in a dense vector
( ) “Semantically” close vectors are close In the Vector Space. Semantic Relations are preserved in Vector Space: “king”+”woman”-”man”=“queen”
9
Word Vectors A word Vector can be written as :
where is a “one hot” vector, Beneficial for most Deep learning tasks
10
RNN– Language Model (Based on Richard Socher’s lecture – Deep Learning in NLP Stanford) A language model computes a probability for a sequence of words: Examples: Word ordering: Word Choice: Useful for machine translation and speech recognition
11
Recurrent Neural Networks Language Model
Each output depends on all previous inputs
12
RNN– Language Model Input : Word Vectors – At each time, compute:
Output:
13
Recurrent Neural Networks-Language Model
Total Objective is to maximize the log-likelihood w.r.t parameters “one hot” vector containing the true word log-likelihood:
14
RNN’s – HARD TO TRAIN!
15
Vanishing/Exploding gradient problem
For Stochastic Gradient Descent we calculate the derivative of the loss w.r.t the Parameters: Reminder: where: Applying Chain Rule:
16
Vanishing/Exploding gradient problem
Update equation: By Chain rule:
17
Vanishing/Exploding gradient problem
Gradients can be very large or very small – “small W” – vanishing gradient Long time dependencies “Large W” (bad for optimization)
18
LSTM’s Long Short term memory
Invented in 1991 by Hochreiter and Schmidnbaur Solved vanishing and exploding gradients using gating Taken from Christopher Olah’s blog
19
LSTM’s Equations: Different notations. H(t) instead of y(t). C(t) instead of h(t)
20
LSTM’s “Forget Gate” Ft = 0 forget, ft = 1. keep Examples: Period “.”
New Subject gender
21
LSTM’s “Input gate layer”
What information ngoes into the new Candidate Ctilda. “input gate layer” decides which values we’ll update
22
LSTM’s Updating memory cell No longer Exp.: Information Can Flow:
No exponential Ct = f(t-1)*f(t-2)*Ct-2
23
LSTM’s Finally, Setting the output
we need to decide what we’re going to output
24
Conclusions RNN’s are very powerful RNN’s are hard to train
Nowadays - gating (LSTM’s) is the way to go! Acknowledgments: Andrej Karpathy - effectiveness/ Richard Socher - Christopher Olah - LSTMs/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.