Week 3 Presentation Ngoc Ta Aidean Sharghi.

Slides:

Advertisements

Similar presentations

Distributed Representations of Sentences and Documents

Advertisements

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.

Convolutional LSTM Networks for Subcellular Localization of Proteins

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

A Hierarchical Deep Temporal Model for Group Activity Recognition

Naifan Zhuang, Jun Ye, Kien A. Hua

R-NET: Machine Reading Comprehension With Self-Matching Networks

Convolutional Sequence to Sequence Learning

Unsupervised Learning of Video Representations using LSTMs

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Recursive Neural Networks

Recurrent Neural Networks for Natural Language Processing

Deep Predictive Model for Autonomous Driving

Adversarial Learning for Neural Dialogue Generation

Show and Tell: A Neural Image Caption Generator (CVPR 2015)

A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

Visualizing and Understanding Neural Models in NLP

Neural networks (3) Regularization Autoencoder

Shunyuan Zhang Nikhil Malik

Presenter: Hajar Emami

Textual Video Prediction

Attention-based Caption Description Mun Jonghwan.

Final Presentation: Neural Network Doc Summarization

Understanding LSTM Networks

The Big Health Data–Intelligent Machine Paradox

Outline Background Motivation Proposed Model Experimental Results

Neural Speech Synthesis with Transformer Network

Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences 1, Zhizhong.

Yi Zhao1, Yanyan Shen*1, Yanmin Zhu1, Junjie Yao2

Lecture 16: Recurrent Neural Networks (RNNs)

Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions

Lip movement Synthesis from Text

Machine Translation(MT)

Unsupervised Pretraining for Semantic Parsing

Learning Object Context for Dense Captioning

Natural Language to SQL(nl2sql)

Neural networks (3) Regularization Autoencoder

Attention for translation

-- Ray Mooney, Association for Computational Linguistics (ACL) 2014

Learn to Comment Mentor: Mahdi M. Kalayeh

Department of Computer Science Ben-Gurion University of the Negev

Sequence to Sequence Video to Text

Automatic Handwriting Generation

Deep Learning in Bioinformatics

Human-object interaction

Presented by: Anurag Paul

Ask and Answer Questions

Sequence to Sequence Music Generation

Presented By: Harshul Gupta

Baseline Model CSV Files Pandas DataFrame Sentence Lists

Sequence-to-Sequence Models

Week 8 Presentation Ngoc Ta Aidean Sharghi.

Week 3 Volodymyr Bobyr.

Bidirectional LSTM-CRF Models for Sequence Tagging

Week 7 Presentation Ngoc Ta Aidean Sharghi

WEEK 4 PRESENTATION NGOC TA AIDEAN SHARGHI.

Week 6 Presentation Ngoc Ta Aidean Sharghi.

Neural Machine Translation by Jointly Learning to Align and Translate

Visual Grounding.

CVPR 2019 Poster.

The experiment based on hier-attention

Presentation transcript:

Week 3 Presentation Ngoc Ta Aidean Sharghi

Research Papers Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning End-to-End Dense Video Captioning with Masked Transformer Video Captioning via Hierarchical Reinforcement Learning M3 : Multimodal Memory Modelling for Video Captioning

End-to-End Dense Video Captioning with Masked Transformer Code available: https://github.com/salesforce/densecap Dataset: Activitynet Captions and YouCookII Method: Using an end-to-end model which is composed of an encoder and two decoders. Encoder encodes Proposal decoder Captioning decoder Using self-attention for dense video captioning 1. Encoder encodes the input video to proper visual representations. 2. Proposal decoder then decodes from this representation with different anchors to form video event proposals. 3. Captioning decoder employs a differentiable masking network to restrict its attention to the proposal event, ensures the consistency between the proposal and captioning during training. Using self-attention (enables the use of efficient non-recurrent structure during encoding)

Video Captioning via Hierarchical Reinforcement Learning Code: N/A Dataset: MSR-VTT dataset and Charades dataset Method: Aims at improving the fine-grained generation of video descriptions with rich activities. Gains better improvement on detailed descriptions of longer videos. Pretrained CNN model Low-level Bi-LSTM2 encoder High-level LSTM3 encoder HRL works as a decoder which is composed of three components: a low-level worker, a high-level manager, and an internal critic The whole pipeline terminates once an end of sentence token is reached. Video frame features are first extracted by a pretrained CNN model Passed through a low-level Bi-LSTM2 encoder and a high-level LSTM3 encoder to obtain low-level and high-level encoder output. HRL works as a decoder which is composed of three components: a low-level worker, a high-level manager, and an internal critic The whole pipeline terminates once an end of sentence token is reached.

M3 : Multimodal Memory Modelling for Video Captioning Code: N/A Dataset: MSVD and MSR-VTT Method: Using M3 which contains: CNN-based video encoder Multimodal memory LSTM-based text decoder CNN-based video encoder first extracts video frame/clip features using pretrained 2D/3D CNNs LSTM-based text decoder models the sentence generation then writes the updated representation to the memory. Multimodal memory contains a memory matrix Mem to interact with video and sentence Procedure: 1. Write hidden states to update memory 2. Read the updated memory content to perform soft-attention 3. Write selected visual information to update memory again 4. Read the updated memory content for next word prediction

Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning Code available: https://github.com/JaywongWang/DenseVideoCaptioning Dataset : ActivityNet Captions (20k YouTube untrimmed videos from real life, 120 seconds long on average.) Goals: to automatically localize events in video and describe each one with sentence Method : Using a novel bidirectional proposal framework( Bidirectional SST) to encode both past and future contexts, with the motivation that both past and future contexts help better localize the current event.

Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning In the forward pass, they learn k binary classifiers corresponding to k anchors densely at each time step In backward pass we reverse both video sequence input and predict proposals backward. This means that the forward pass encodes past context and current event information, while the backward pass encodes future context and current event information Finally they merge proposal scores for the same predictions from the two passes and output final proposals.

Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning

Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning To construct more discriminative proposal representation, they fused proposal state information and detected video content together which help discriminate highly overlapped events. To output more confident results, they used joint ranking technique to select high-confidence proposal caption pairs by taking both proposal score and caption confidence into consideration.

Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning

THANK YOU