Download presentation
Presentation is loading. Please wait.
Published byΚυρία Πρωτονοτάριος Modified over 6 years ago
1
Attention-based Caption Description Mun Jonghwan
2
Caption generation Caption에 대한 설명
3
Caption generation This problem require A man skiing down the snow
covered mountain with a dark sky in the background. INPUT OUTPUT This problem require Identifying and detecting objects, scenes, people, etc Reasoning about relationships, properties and activity of objects Combining several sources of information into a coherent sentence
4
Contents Encoder-Decoder Attention based caption generation Discussion
Future plan Caption으로 할 수 있는 것
5
Encoder-Decoder (E-D)
Encode an image into a representation Decode the representation into a caption Infer the word of caption step by step Caption으로 할 수 있는 것
6
Decoder with LSTM CNN start “A” 𝑃( 𝑦 1 |𝑖𝑚𝑔) 𝑦 1 LSTM 𝑥 1
A group of people shopping at an outdoor market There are many vegetables at the fruit stand
7
Decoder with LSTM CNN start “A” “A” “group” 𝑃( 𝑦 2 | 𝑦 1 ,𝑖𝑚𝑔) 𝑦 1
𝑥 1 “A” 𝑦 1 LSTM “A” 𝑥 2 “group” 𝑦 2 𝑃( 𝑦 2 | 𝑦 1 ,𝑖𝑚𝑔) CNN A group of people shopping at an outdoor market There are many vegetables at the fruit stand 𝑦 1
8
Decoder with LSTM CNN start “A” “A” “group” “group” “of”
𝑥 1 “A” 𝑦 1 LSTM “A” 𝑥 2 “group” 𝑦 2 LSTM “group” 𝑥 3 “of” 𝑦 3 𝑃( 𝑦 3 | 𝑦 2 , 𝑦 1 ,𝑖𝑚𝑔) CNN A group of people shopping at an outdoor market There are many vegetables at the fruit stand 𝑦 2
9
Decoder with LSTM CNN A group of people shopping at an outdoor market
start 𝑥 1 “A” 𝑦 1 LSTM “A” 𝑥 2 “group” 𝑦 2 LSTM “group” 𝑥 3 “of” 𝑦 3 LSTM “market” 𝑥 𝑛 END 𝑦 𝑛 CNN A group of people shopping at an outdoor market There are many vegetables at the fruit stand A group of people shopping at an outdoor market LSTM shares the parameter
10
LSTM Based on previous state and word, predict next word
𝑖 𝑡 𝑓 𝑡 𝑜 𝑡 𝑔 𝑡 = 𝜎 𝜎 𝜎 𝑡𝑎𝑛ℎ 𝑇 𝐷+𝑚,𝑛 𝐸 𝑥 𝑡 𝑚 𝑡−1 𝑐 𝑡 = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + 𝑖 𝑡 ⊙ 𝑔 𝑡 𝑚 𝑡 = 𝑜 𝑡 ⊙tanh( 𝑐 𝑡 ) Caption으로 할 수 있는 것 Based on previous state and word, predict next word
11
Limitation of E-D E-D should compress all the necessary information of a whole image into a representation Difficult to capture detail of image Difficult to describe compositionally novel images Caption으로 할 수 있는 것
12
Attention-based E-D (A-E-D)
Encode an image into several representations How to encode image? (Encoder) Predict word based on previous state, word and relevant context How to compute context? (Decoder) LSTM START 𝑥 1 A 𝑦 1 LSTM A 𝑥 2 group 𝑦 2 LSTM group 𝑥 3 of 𝑦 3 LSTM of 𝑥 4 men 𝑦 4 LSTM men 𝑥 5 playing 𝑦 5 LSTM playing 𝑥 6 Frisbee 𝑦 6 LSTM Frisbee 𝑥 7 in 𝑦 7 LSTM in 𝑥 8 the 𝑦 8 LSTM the 𝑥 9 park 𝑦 9 LSTM park 𝑥 10 END 𝑦 10 Caption으로 할 수 있는 것
13
How to encode image? 4th convolutional layer from Oxford VGGnet 19layer Each annotation correspond to sub-region of image annotation vector 𝑎 𝑖 Caption으로 할 수 있는 것
14
How to compute context? Compute the weight of each annotation for next word based on previous state 𝑒 𝑡𝑖 = 𝑓 𝑎𝑡𝑡 𝑎 𝑖 , 𝑚 𝑡−1 𝑓 𝑎𝑡𝑡 𝑎 𝑖 , 𝑚 𝑡−1 = 𝑈 𝑎𝑡𝑡 · tanh 𝑉· 𝑎 𝑖 +𝑊· 𝑚 𝑡−1 𝛼 𝑡𝑖 = exp( 𝑒 𝑡𝑖 ) 𝑘=1 𝐿 exp( 𝑒 𝑡𝑘 ) Context is weighted sum of annotations 𝑧 𝑡 = 𝑖 𝛼 𝑡𝑖 𝑎 𝑖 Caption으로 할 수 있는 것 𝑖 𝑡 𝑓 𝑡 𝑜 𝑡 𝑔 𝑡 = 𝜎 𝜎 𝜎 𝑡𝑎𝑛ℎ 𝑇 𝐷+𝑚,𝑛 𝐸 𝑥 𝑡 𝑚 𝑡−1 𝑖 𝑡 𝑓 𝑡 𝑜 𝑡 𝑔 𝑡 = 𝜎 𝜎 𝜎 𝑡𝑎𝑛ℎ 𝑇 𝐷+𝑚+𝑛,𝑛 𝐸 𝑥 𝑡 𝑚 𝑡−1 𝑧 𝑡
15
Reproducing attention-based E-D
Only basic tokenization (vocabulary size 31,572) Early stopping based on BLEU-1 score Center cropped 224x224 image -> just resize Train(82,782) / Validation (40,504) from COCO 5,000 as validation and 40,504 as test leaderboard paper reproducing BLEU-1 68.9 70.7 65.5 Caption으로 할 수 있는 것
16
Discussion annotations from 4th convolutional layer
Low level representation Caption is generated with general or common words a stop sign High level representation Caption으로 할 수 있는 것 on the side of a road gt : Stop sign at the intersection of two rather rural roads
17
Discussion Adjacent words attend similar annotations
Two giraffes standing Caption으로 할 수 있는 것 next to each other on a field
18
Discussion Adjacent words attend similar annotations People sitting at
Caption으로 할 수 있는 것 a table with plate of food
19
Discussion Adjacent words attend similar annotations
Representative sub-regions are attended Vocabulary (31,572 -> 1209) Little number of words to be attended Context is sum of weighted annotations a herd of sheep Caption으로 할 수 있는 것
20
Future Plan Attention based E-D + Visual Concept convolution
output map (12×12×1000) convolution output vector (1×1×1000) query image (565x565) Caption으로 할 수 있는 것 dog sitting L S T M L S T M L S T M Top K words (dog, man, sitting, …) Decoder Saliency region 20
21
Thank you
22
Encoder-Decoder Pros Cons Caption length is unbounded d RNN
START “A” “group” “of” “market” END A group of people shopping at an outdoor market conv fc <Whole flow of Encoder-Decoder>
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.