Attention-based Caption Description 2015.6.30 Mun Jonghwan.

Slides:



Advertisements
Similar presentations
Spatial Pyramid Pooling in Deep Convolutional
Advertisements

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Image Captioning Approaches
Convolutional LSTM Networks for Subcellular Localization of Proteins
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. SHOW.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)
Attention Model in NLP Jichuan ZENG.
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.
CNN-RNN: A Unified Framework for Multi-label Image Classification
SUNY Korea BioData Mining Lab - Journal Review
Faster R-CNN – Concepts
What Convnets Make for Image Captioning?
Hierarchical Question-Image Co-Attention for Visual Question Answering
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Recurrent Neural Networks for Natural Language Processing
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Combining CNN with RNN for scene labeling (segmentation)
Intro to NLP and Deep Learning
Lecture 5 Smaller Network: CNN
Neural Machine Translation By Learning to Jointly Align and Translate
Above and below the object level
Image Question Answering
Attention Is All You Need
Paraphrase Generation Using Deep Learning
Vessel Extraction in X-Ray Angiograms Using Deep Learning
Final Presentation: Neural Network Doc Summarization
Word Embedding Word2Vec.
The Big Health Data–Intelligent Machine Paradox
Papers 15/08.
Neural Speech Synthesis with Transformer Network
Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences 1, Zhizhong.
Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
Ying Dai Faculty of software and information science,
RCNN, Fast-RCNN, Faster-RCNN
Learning Object Context for Dense Captioning
Natural Language to SQL(nl2sql)
Attention.
Zeroshot Learning Mun Jonghwan.
Ali Hakimi Parizi, Paul Cook
Please enjoy.
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Mahdi Kalayeh David Hill
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Learn to Comment Mentor: Mahdi M. Kalayeh
Department of Computer Science Ben-Gurion University of the Negev
Jointly Generating Captions to Aid Visual Question Answering
Human-object interaction
Visual Question Answering
Presented by: Anurag Paul
Neural Machine Translation using CNN
Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.
Vision and language: attention, navigation, and making it work ‘in the wild’ Peter Anderson Australian National University -> Macquarie University -> Georgia.
Presented By: Harshul Gupta
Learning Deconvolution Network for Semantic Segmentation
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Sequence-to-Sequence Models
Week 7 Presentation Ngoc Ta Aidean Sharghi
Neural Machine Translation by Jointly Learning to Align and Translate
Visual Grounding.
CVPR 2019 Poster.
Presentation transcript:

Attention-based Caption Description 2015.6.30 Mun Jonghwan

Caption generation Caption에 대한 설명

Caption generation This problem require A man skiing down the snow covered mountain with a dark sky in the background. INPUT OUTPUT This problem require Identifying and detecting objects, scenes, people, etc Reasoning about relationships, properties and activity of objects Combining several sources of information into a coherent sentence

Contents Encoder-Decoder Attention based caption generation Discussion Future plan Caption으로 할 수 있는 것

Encoder-Decoder (E-D) Encode an image into a representation Decode the representation into a caption Infer the word of caption step by step Caption으로 할 수 있는 것

Decoder with LSTM CNN start “A” 𝑃( 𝑦 1 |𝑖𝑚𝑔) 𝑦 1 LSTM 𝑥 1 A group of people shopping at an outdoor market There are many vegetables at the fruit stand

Decoder with LSTM CNN start “A” “A” “group” 𝑃( 𝑦 2 | 𝑦 1 ,𝑖𝑚𝑔) 𝑦 1 𝑥 1 “A” 𝑦 1 LSTM “A” 𝑥 2 “group” 𝑦 2 𝑃( 𝑦 2 | 𝑦 1 ,𝑖𝑚𝑔) CNN A group of people shopping at an outdoor market There are many vegetables at the fruit stand 𝑦 1

Decoder with LSTM CNN start “A” “A” “group” “group” “of” 𝑥 1 “A” 𝑦 1 LSTM “A” 𝑥 2 “group” 𝑦 2 LSTM “group” 𝑥 3 “of” 𝑦 3 𝑃( 𝑦 3 | 𝑦 2 , 𝑦 1 ,𝑖𝑚𝑔) CNN A group of people shopping at an outdoor market There are many vegetables at the fruit stand 𝑦 2

Decoder with LSTM CNN A group of people shopping at an outdoor market start 𝑥 1 “A” 𝑦 1 LSTM “A” 𝑥 2 “group” 𝑦 2 LSTM “group” 𝑥 3 “of” 𝑦 3 LSTM “market” 𝑥 𝑛 END 𝑦 𝑛 CNN A group of people shopping at an outdoor market There are many vegetables at the fruit stand A group of people shopping at an outdoor market LSTM shares the parameter

LSTM Based on previous state and word, predict next word 𝑖 𝑡 𝑓 𝑡 𝑜 𝑡 𝑔 𝑡 = 𝜎 𝜎 𝜎 𝑡𝑎𝑛ℎ 𝑇 𝐷+𝑚,𝑛 𝐸 𝑥 𝑡 𝑚 𝑡−1 𝑐 𝑡 = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + 𝑖 𝑡 ⊙ 𝑔 𝑡 𝑚 𝑡 = 𝑜 𝑡 ⊙tanh⁡( 𝑐 𝑡 ) Caption으로 할 수 있는 것 Based on previous state and word, predict next word

Limitation of E-D E-D should compress all the necessary information of a whole image into a representation Difficult to capture detail of image Difficult to describe compositionally novel images Caption으로 할 수 있는 것

Attention-based E-D (A-E-D) Encode an image into several representations How to encode image? (Encoder) Predict word based on previous state, word and relevant context How to compute context? (Decoder) LSTM START 𝑥 1 A 𝑦 1 LSTM A 𝑥 2 group 𝑦 2 LSTM group 𝑥 3 of 𝑦 3 LSTM of 𝑥 4 men 𝑦 4 LSTM men 𝑥 5 playing 𝑦 5 LSTM playing 𝑥 6 Frisbee 𝑦 6 LSTM Frisbee 𝑥 7 in 𝑦 7 LSTM in 𝑥 8 the 𝑦 8 LSTM the 𝑥 9 park 𝑦 9 LSTM park 𝑥 10 END 𝑦 10 Caption으로 할 수 있는 것

How to encode image? 4th convolutional layer from Oxford VGGnet 19layer Each annotation correspond to sub-region of image annotation vector 𝑎 𝑖 Caption으로 할 수 있는 것

How to compute context? Compute the weight of each annotation for next word based on previous state 𝑒 𝑡𝑖 = 𝑓 𝑎𝑡𝑡 𝑎 𝑖 , 𝑚 𝑡−1 𝑓 𝑎𝑡𝑡 𝑎 𝑖 , 𝑚 𝑡−1 = 𝑈 𝑎𝑡𝑡 · tanh 𝑉· 𝑎 𝑖 +𝑊· 𝑚 𝑡−1 𝛼 𝑡𝑖 = exp⁡( 𝑒 𝑡𝑖 ) 𝑘=1 𝐿 exp⁡( 𝑒 𝑡𝑘 ) Context is weighted sum of annotations 𝑧 𝑡 = 𝑖 𝛼 𝑡𝑖 𝑎 𝑖 Caption으로 할 수 있는 것 𝑖 𝑡 𝑓 𝑡 𝑜 𝑡 𝑔 𝑡 = 𝜎 𝜎 𝜎 𝑡𝑎𝑛ℎ 𝑇 𝐷+𝑚,𝑛 𝐸 𝑥 𝑡 𝑚 𝑡−1 𝑖 𝑡 𝑓 𝑡 𝑜 𝑡 𝑔 𝑡 = 𝜎 𝜎 𝜎 𝑡𝑎𝑛ℎ 𝑇 𝐷+𝑚+𝑛,𝑛 𝐸 𝑥 𝑡 𝑚 𝑡−1 𝑧 𝑡

Reproducing attention-based E-D Only basic tokenization (vocabulary size 31,572) Early stopping based on BLEU-1 score Center cropped 224x224 image -> just resize Train(82,782) / Validation (40,504) from COCO 5,000 as validation and 40,504 as test leaderboard paper reproducing BLEU-1 68.9 70.7 65.5 Caption으로 할 수 있는 것

Discussion annotations from 4th convolutional layer Low level representation Caption is generated with general or common words a stop sign High level representation Caption으로 할 수 있는 것 on the side of a road gt : Stop sign at the intersection of two rather rural roads

Discussion Adjacent words attend similar annotations Two giraffes standing Caption으로 할 수 있는 것 next to each other on a field .

Discussion Adjacent words attend similar annotations People sitting at Caption으로 할 수 있는 것 a table with plate of food .

Discussion Adjacent words attend similar annotations Representative sub-regions are attended Vocabulary (31,572 -> 1209) Little number of words to be attended Context is sum of weighted annotations a herd of sheep Caption으로 할 수 있는 것

Future Plan Attention based E-D + Visual Concept convolution output map (12×12×1000) convolution output vector (1×1×1000) query image (565x565) Caption으로 할 수 있는 것 dog sitting L S T M L S T M L S T M Top K words (dog, man, sitting, …) Decoder Saliency region 20

Thank you

Encoder-Decoder Pros Cons Caption length is unbounded d RNN START “A” “group” “of” “market” END A group of people shopping at an outdoor market conv fc <Whole flow of Encoder-Decoder>