Week 3 Presentation Ngoc Ta Aidean Sharghi.

Slides:



Advertisements
Similar presentations
Distributed Representations of Sentences and Documents
Advertisements

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
Convolutional LSTM Networks for Subcellular Localization of Proteins
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
A Hierarchical Deep Temporal Model for Group Activity Recognition
Naifan Zhuang, Jun Ye, Kien A. Hua
R-NET: Machine Reading Comprehension With Self-Matching Networks
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Recursive Neural Networks
Recurrent Neural Networks for Natural Language Processing
Deep Predictive Model for Autonomous Driving
Adversarial Learning for Neural Dialogue Generation
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Visualizing and Understanding Neural Models in NLP
Neural networks (3) Regularization Autoencoder
Shunyuan Zhang Nikhil Malik
Presenter: Hajar Emami
Textual Video Prediction
Attention-based Caption Description Mun Jonghwan.
Final Presentation: Neural Network Doc Summarization
Understanding LSTM Networks
The Big Health Data–Intelligent Machine Paradox
Papers 15/08.
Outline Background Motivation Proposed Model Experimental Results
Neural Speech Synthesis with Transformer Network
Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences 1, Zhizhong.
Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2
Lecture 16: Recurrent Neural Networks (RNNs)
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
Lip movement Synthesis from Text
Machine Translation(MT)
Unsupervised Pretraining for Semantic Parsing
Learning Object Context for Dense Captioning
Natural Language to SQL(nl2sql)
Neural networks (3) Regularization Autoencoder
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Learn to Comment Mentor: Mahdi M. Kalayeh
Department of Computer Science Ben-Gurion University of the Negev
Sequence to Sequence Video to Text
Automatic Handwriting Generation
Deep Learning in Bioinformatics
Human-object interaction
Presented by: Anurag Paul
Ask and Answer Questions
Sequence to Sequence Music Generation
CVPR19.
Presented By: Harshul Gupta
Baseline Model CSV Files Pandas DataFrame Sentence Lists
Sequence-to-Sequence Models
Week 8 Presentation Ngoc Ta Aidean Sharghi.
Week 3 Volodymyr Bobyr.
Bidirectional LSTM-CRF Models for Sequence Tagging
Week 7 Presentation Ngoc Ta Aidean Sharghi
WEEK 4 PRESENTATION NGOC TA AIDEAN SHARGHI.
Week 6 Presentation Ngoc Ta Aidean Sharghi.
Neural Machine Translation by Jointly Learning to Align and Translate
Visual Grounding.
CVPR 2019 Poster.
The experiment based on hier-attention
Presentation transcript:

Week 3 Presentation Ngoc Ta Aidean Sharghi

Research Papers Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning End-to-End Dense Video Captioning with Masked Transformer Video Captioning via Hierarchical Reinforcement Learning M3 : Multimodal Memory Modelling for Video Captioning

End-to-End Dense Video Captioning with Masked Transformer Code available: https://github.com/salesforce/densecap Dataset: Activitynet Captions and YouCookII Method: Using an end-to-end model which is composed of an encoder and two decoders. Encoder encodes Proposal decoder Captioning decoder Using self-attention for dense video captioning 1. Encoder encodes the input video to proper visual representations. 2. Proposal decoder then decodes from this representation with different anchors to form video event proposals. 3. Captioning decoder employs a differentiable masking network to restrict its attention to the proposal event, ensures the consistency between the proposal and captioning during training. Using self-attention (enables the use of efficient non-recurrent structure during encoding)

Video Captioning via Hierarchical Reinforcement Learning Code: N/A Dataset: MSR-VTT dataset and Charades dataset Method: Aims at improving the fine-grained generation of video descriptions with rich activities. Gains better improvement on detailed descriptions of longer videos. Pretrained CNN model Low-level Bi-LSTM2 encoder High-level LSTM3 encoder HRL works as a decoder which is composed of three components: a low-level worker, a high-level manager, and an internal critic The whole pipeline terminates once an end of sentence token is reached. Video frame features are first extracted by a pretrained CNN model Passed through a low-level Bi-LSTM2 encoder and a high-level LSTM3 encoder to obtain low-level and high-level encoder output. HRL works as a decoder which is composed of three components: a low-level worker, a high-level manager, and an internal critic The whole pipeline terminates once an end of sentence token is reached.

M3 : Multimodal Memory Modelling for Video Captioning Code: N/A Dataset: MSVD and MSR-VTT Method: Using M3 which contains: CNN-based video encoder Multimodal memory LSTM-based text decoder CNN-based video encoder first extracts video frame/clip features using pretrained 2D/3D CNNs LSTM-based text decoder models the sentence generation then writes the updated representation to the memory. Multimodal memory contains a memory matrix Mem to interact with video and sentence Procedure: 1. Write hidden states to update memory 2. Read the updated memory content to perform soft-attention 3. Write selected visual information to update memory again 4. Read the updated memory content for next word prediction

Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning Code available: https://github.com/JaywongWang/DenseVideoCaptioning Dataset : ActivityNet Captions (20k YouTube untrimmed videos from real life, 120 seconds long on average.) Goals: to automatically localize events in video and describe each one with sentence Method : Using a novel bidirectional proposal framework( Bidirectional SST) to encode both past and future contexts, with the motivation that both past and future contexts help better localize the current event.

Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning In the forward pass, they learn k binary classifiers corresponding to k anchors densely at each time step In backward pass we reverse both video sequence input and predict proposals backward. This means that the forward pass encodes past context and current event information, while the backward pass encodes future context and current event information Finally they merge proposal scores for the same predictions from the two passes and output final proposals.

Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning

Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning To construct more discriminative proposal representation, they fused proposal state information and detected video content together which help discriminate highly overlapped events. To output more confident results, they used joint ranking technique to select high-confidence proposal caption pairs by taking both proposal score and caption confidence into consideration.

Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning

THANK YOU