Attention-based Caption Description 2015.6.30 Mun Jonghwan.

Slides:

Advertisements

Similar presentations

Spatial Pyramid Pooling in Deep Convolutional

Advertisements

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

Image Captioning Approaches

Convolutional LSTM Networks for Subcellular Localization of Proteins

NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. SHOW.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)

Attention Model in NLP Jichuan ZENG.

Convolutional Sequence to Sequence Learning

Unsupervised Learning of Video Representations using LSTMs

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

SUNY Korea BioData Mining Lab - Journal Review

Faster R-CNN – Concepts

What Convnets Make for Image Captioning?

Hierarchical Question-Image Co-Attention for Visual Question Answering

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Recurrent Neural Networks for Natural Language Processing

Neural Machine Translation by Jointly Learning to Align and Translate

Attention Is All You Need

Show and Tell: A Neural Image Caption Generator (CVPR 2015)

Combining CNN with RNN for scene labeling (segmentation)

Intro to NLP and Deep Learning

Lecture 5 Smaller Network: CNN

Neural Machine Translation By Learning to Jointly Align and Translate

Above and below the object level

Image Question Answering

Attention Is All You Need

Paraphrase Generation Using Deep Learning

Vessel Extraction in X-Ray Angiograms Using Deep Learning

Final Presentation: Neural Network Doc Summarization

Word Embedding Word2Vec.

The Big Health Data–Intelligent Machine Paradox

Neural Speech Synthesis with Transformer Network

Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences 1, Zhizhong.

Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2

Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions

Ying Dai Faculty of software and information science,

RCNN, Fast-RCNN, Faster-RCNN

Learning Object Context for Dense Captioning

Natural Language to SQL(nl2sql)

Zeroshot Learning Mun Jonghwan.

Ali Hakimi Parizi, Paul Cook

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Mahdi Kalayeh David Hill

Attention for translation

-- Ray Mooney, Association for Computational Linguistics (ACL) 2014

Learn to Comment Mentor: Mahdi M. Kalayeh

Department of Computer Science Ben-Gurion University of the Negev

Jointly Generating Captions to Aid Visual Question Answering

Human-object interaction

Visual Question Answering

Presented by: Anurag Paul

Neural Machine Translation using CNN

Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.

Vision and language: attention, navigation, and making it work ‘in the wild’ Peter Anderson Australian National University -> Macquarie University -> Georgia.

Presented By: Harshul Gupta

Learning Deconvolution Network for Semantic Segmentation

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Sequence-to-Sequence Models

Week 7 Presentation Ngoc Ta Aidean Sharghi

Neural Machine Translation by Jointly Learning to Align and Translate

Visual Grounding.

CVPR 2019 Poster.

Presentation transcript:

Attention-based Caption Description 2015.6.30 Mun Jonghwan

Caption generation Caption에 대한 설명

Caption generation This problem require A man skiing down the snow covered mountain with a dark sky in the background. INPUT OUTPUT This problem require Identifying and detecting objects, scenes, people, etc Reasoning about relationships, properties and activity of objects Combining several sources of information into a coherent sentence

Contents Encoder-Decoder Attention based caption generation Discussion Future plan Caption으로 할 수 있는 것

Encoder-Decoder (E-D) Encode an image into a representation Decode the representation into a caption Infer the word of caption step by step Caption으로 할 수 있는 것

Decoder with LSTM CNN start “A” 𝑃( 𝑦 1 |𝑖𝑚𝑔) 𝑦 1 LSTM 𝑥 1 A group of people shopping at an outdoor market There are many vegetables at the fruit stand

Decoder with LSTM CNN start “A” “A” “group” 𝑃( 𝑦 2 | 𝑦 1 ,𝑖𝑚𝑔) 𝑦 1 𝑥 1 “A” 𝑦 1 LSTM “A” 𝑥 2 “group” 𝑦 2 𝑃( 𝑦 2 | 𝑦 1 ,𝑖𝑚𝑔) CNN A group of people shopping at an outdoor market There are many vegetables at the fruit stand 𝑦 1

Decoder with LSTM CNN start “A” “A” “group” “group” “of” 𝑥 1 “A” 𝑦 1 LSTM “A” 𝑥 2 “group” 𝑦 2 LSTM “group” 𝑥 3 “of” 𝑦 3 𝑃( 𝑦 3 | 𝑦 2 , 𝑦 1 ,𝑖𝑚𝑔) CNN A group of people shopping at an outdoor market There are many vegetables at the fruit stand 𝑦 2

Decoder with LSTM CNN A group of people shopping at an outdoor market start 𝑥 1 “A” 𝑦 1 LSTM “A” 𝑥 2 “group” 𝑦 2 LSTM “group” 𝑥 3 “of” 𝑦 3 LSTM “market” 𝑥 𝑛 END 𝑦 𝑛 CNN A group of people shopping at an outdoor market There are many vegetables at the fruit stand A group of people shopping at an outdoor market LSTM shares the parameter

LSTM Based on previous state and word, predict next word 𝑖 𝑡 𝑓 𝑡 𝑜 𝑡 𝑔 𝑡 = 𝜎 𝜎 𝜎 𝑡𝑎𝑛ℎ 𝑇 𝐷+𝑚,𝑛 𝐸 𝑥 𝑡 𝑚 𝑡−1 𝑐 𝑡 = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + 𝑖 𝑡 ⊙ 𝑔 𝑡 𝑚 𝑡 = 𝑜 𝑡 ⊙tanh⁡( 𝑐 𝑡 ) Caption으로 할 수 있는 것 Based on previous state and word, predict next word

Limitation of E-D E-D should compress all the necessary information of a whole image into a representation Difficult to capture detail of image Difficult to describe compositionally novel images Caption으로 할 수 있는 것

Attention-based E-D (A-E-D) Encode an image into several representations How to encode image? (Encoder) Predict word based on previous state, word and relevant context How to compute context? (Decoder) LSTM START 𝑥 1 A 𝑦 1 LSTM A 𝑥 2 group 𝑦 2 LSTM group 𝑥 3 of 𝑦 3 LSTM of 𝑥 4 men 𝑦 4 LSTM men 𝑥 5 playing 𝑦 5 LSTM playing 𝑥 6 Frisbee 𝑦 6 LSTM Frisbee 𝑥 7 in 𝑦 7 LSTM in 𝑥 8 the 𝑦 8 LSTM the 𝑥 9 park 𝑦 9 LSTM park 𝑥 10 END 𝑦 10 Caption으로 할 수 있는 것

How to encode image? 4th convolutional layer from Oxford VGGnet 19layer Each annotation correspond to sub-region of image annotation vector 𝑎 𝑖 Caption으로 할 수 있는 것

How to compute context? Compute the weight of each annotation for next word based on previous state 𝑒 𝑡𝑖 = 𝑓 𝑎𝑡𝑡 𝑎 𝑖 , 𝑚 𝑡−1 𝑓 𝑎𝑡𝑡 𝑎 𝑖 , 𝑚 𝑡−1 = 𝑈 𝑎𝑡𝑡 · tanh 𝑉· 𝑎 𝑖 +𝑊· 𝑚 𝑡−1 𝛼 𝑡𝑖 = exp⁡( 𝑒 𝑡𝑖 ) 𝑘=1 𝐿 exp⁡( 𝑒 𝑡𝑘 ) Context is weighted sum of annotations 𝑧 𝑡 = 𝑖 𝛼 𝑡𝑖 𝑎 𝑖 Caption으로 할 수 있는 것 𝑖 𝑡 𝑓 𝑡 𝑜 𝑡 𝑔 𝑡 = 𝜎 𝜎 𝜎 𝑡𝑎𝑛ℎ 𝑇 𝐷+𝑚,𝑛 𝐸 𝑥 𝑡 𝑚 𝑡−1 𝑖 𝑡 𝑓 𝑡 𝑜 𝑡 𝑔 𝑡 = 𝜎 𝜎 𝜎 𝑡𝑎𝑛ℎ 𝑇 𝐷+𝑚+𝑛,𝑛 𝐸 𝑥 𝑡 𝑚 𝑡−1 𝑧 𝑡

Reproducing attention-based E-D Only basic tokenization (vocabulary size 31,572) Early stopping based on BLEU-1 score Center cropped 224x224 image -> just resize Train(82,782) / Validation (40,504) from COCO 5,000 as validation and 40,504 as test leaderboard paper reproducing BLEU-1 68.9 70.7 65.5 Caption으로 할 수 있는 것

Discussion annotations from 4th convolutional layer Low level representation Caption is generated with general or common words a stop sign High level representation Caption으로 할 수 있는 것 on the side of a road gt : Stop sign at the intersection of two rather rural roads

Discussion Adjacent words attend similar annotations Two giraffes standing Caption으로 할 수 있는 것 next to each other on a field .

Discussion Adjacent words attend similar annotations People sitting at Caption으로 할 수 있는 것 a table with plate of food .

Discussion Adjacent words attend similar annotations Representative sub-regions are attended Vocabulary (31,572 -> 1209) Little number of words to be attended Context is sum of weighted annotations a herd of sheep Caption으로 할 수 있는 것

Future Plan Attention based E-D + Visual Concept convolution output map (12×12×1000) convolution output vector (1×1×1000) query image (565x565) Caption으로 할 수 있는 것 dog sitting L S T M L S T M L S T M Top K words (dog, man, sitting, …) Decoder Saliency region 20

Thank you

Encoder-Decoder Pros Cons Caption length is unbounded d RNN START “A” “group” “of” “market” END A group of people shopping at an outdoor market conv fc <Whole flow of Encoder-Decoder>