Visual Question Answering

Slides:

Advertisements

Similar presentations

Sec 1-4 Concepts: Classifying Angles Objectives: Given an angle, name, measure and classify it as measured by a s.g.

Advertisements

Why it is Hard to Label Our Concepts Jesse Snedeker and Lila Gleitman Harvard and U. Penn.

SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China

A) 80 b) 53 c) 13 d) x 2 = : 10 = 3, x 3 = 309.

Building the Design Studio of the Future Aaron Adler Jacob Eisenstein Michael Oltmans Lisa Guttentag Randall Davis October 23, 2004.

Section 2: Replication of DNA

Multiplication Facts X 3 = 2. 8 x 4 = 3. 7 x 2 =

Audio Fingerprinting as a New Task for MIREX-2014 Chung-Che Wang Jyh-Shing Roger Jang.

SATMathVideos.Net A set S consists of all multiples of 4. Which of the following sets are contained within set S? A) S2 only B) S4 only C) S2 and S4 D)

Learning video saliency from human gaze using candidate selection CVPR2013 Poster.

Which list of numbers is ordered from least to greatest? 10 –3, , 1, 10, , 1, 10, 10 2, 10 – , 10 –3, 1, 10, , 10 –3,

Height Estimation from Egocentric Video- Week 1 Dr. Ali Borji Aisha Urooj Khan Jessie Finocchiaro UCF CRCV REU 2016.

Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong

Section 2: Replication of DNA

Unsupervised Learning of Video Representations using LSTMs

Automatic Advertisement Rating

Multiplication Strategies

Textual Video Prediction Week 2

Summary of Week 1 (May 23 – May 27, 2016)

Query Based Video Summarization

Visualizing and Understanding Neural Models in NLP

Section 2: Replication of DNA

Week 6 Cecilia La Place.

mengye ren, ryan kiros, richard s. zemel

Latest Microsoft Real Questions Exam Dumps

VQA: Visual Question Answering

Deceptive News Prediction Clickbait Score Inference

Type Topic in here! Created by Educational Technology Network

Textual Video Prediction

Image Question Answering

Action Recognition in Temporally Untrimmed Videos

Section 2: Replication of DNA

Visual Question Generation

Video understanding using part based object detection models

Optimizing Channel Selection for Seizure Detection

סדר דין פלילי – חקיקה ומהות ההליך הפלילי

Tina Jiang. , Vivek Natarajan. , Xinlei Chen

Data Driven Attributes for Action Detection

You must show all steps of your working out.

Learn to Comment Mentor: Mahdi M. Kalayeh

LANGUAGE EDUCATION.

Visual Question Answering

Visual Manipulation Relationship Network for Autonomous Robotics

Textual Video Prediction

Action Recognition.

The experiments based on Recurrent Neural Networks

Query-based video summarization

Presented By: Harshul Gupta

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Moving Target Detection Using Infrared Sensors

Visual Question Answering Aaron Honculada, Aisha Urooj Khan, Dr

Week 8 Presentation Ngoc Ta Aidean Sharghi.

Learning complex visual concepts

CRCV REU 2019 Kara Schatz.

Appearance Transformer (AT)

CRCV REU 2019 Week 8 Aaron Honculada.

CRCV REU 2019 Week 5.

CRCV REU 2019 Aaron Honculada.

Week 7 Presentation Ngoc Ta Aidean Sharghi

How secure is your autonomous/self-driving vehicle?

Additional text exploring the video clip.

CRCV REU 2019 Week 4.

Visual Grounding.

CRCV REU 2019 Aaron Honculada.

CVPR 2019 Oral Samvit Jain; Xin Wang; Joseph E. Gonzalez

Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision.

Presentation transcript:

Visual Question Answering Aaron Honculada Aisha Urooj Dr. Mubarak Shah, Dr. Niels Lobo

TVQA Dataset 460 hours of video 152,545 Question and Answer Pairs 21,793 clips (60-90 sec) Multimodal Compositionality Video-QA Associated natural language (subtitles)

Questions Main Question part Grounding part Each clip has 7 questions Temporal Localization Each clip has 7 questions Each question has 5 multiple choice answers

TVQA

TVQA Subtitles Visual Concepts Video Features Object detection Concatenate Remove duplicates Video Features ResNet

Model Used

Baseline Models LSTM BiLSTM

Baseline Models Baseline CNN+LSTM

Results Model Used TVQA + S Accuracy (%) Reported 65.15% Replication 65.74%

Results Model Used TVQA + S TVQA + V Accuracy (%) Reported 65.15% 45.03% Replication 65.74% 45.25%

Results Model Used TVQA + S TVQA + V TVQA + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% Replication 65.74% 45.25% 44.42%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q LSTM 42.74% BiLSTM 42.48%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q LSTM 42.74% 42.71% BiLSTM 42.48% 42.67%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q LSTM 42.74% 42.71% 42.61% BiLSTM 42.48% 42.67%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q S + V + Q LSTM 42.74% 42.71% 42.61% 42.39% BiLSTM 42.48% 42.67% 42.84%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q (FC) V + Q S + V + Q LSTM 42.74% 42.71% 42.61% 42.85% 42.39% BiLSTM 42.48% 42.67% 42.86% 42.84%

Results

Results

Results

Results

Results

Results

Summary and Next Steps Reproduced Results Baseline Results Look into network mistakes and address them Main Goal: Boost Performance Using Visual Cues effectively