Visual Question Answering

Slides:



Advertisements
Similar presentations
Sec 1-4 Concepts: Classifying Angles Objectives: Given an angle, name, measure and classify it as measured by a s.g.
Advertisements

Why it is Hard to Label Our Concepts Jesse Snedeker and Lila Gleitman Harvard and U. Penn.
SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China
A) 80 b) 53 c) 13 d) x 2 = : 10 = 3, x 3 = 309.
Building the Design Studio of the Future Aaron Adler Jacob Eisenstein Michael Oltmans Lisa Guttentag Randall Davis October 23, 2004.
Section 2: Replication of DNA
Multiplication Facts X 3 = 2. 8 x 4 = 3. 7 x 2 =
Audio Fingerprinting as a New Task for MIREX-2014 Chung-Che Wang Jyh-Shing Roger Jang.
SATMathVideos.Net A set S consists of all multiples of 4. Which of the following sets are contained within set S? A) S2 only B) S4 only C) S2 and S4 D)
Learning video saliency from human gaze using candidate selection CVPR2013 Poster.
Which list of numbers is ordered from least to greatest? 10 –3, , 1, 10, , 1, 10, 10 2, 10 – , 10 –3, 1, 10, , 10 –3,
Height Estimation from Egocentric Video- Week 1 Dr. Ali Borji Aisha Urooj Khan Jessie Finocchiaro UCF CRCV REU 2016.
Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong
Section 2: Replication of DNA
Unsupervised Learning of Video Representations using LSTMs
Automatic Advertisement Rating
Multiplication Strategies
Textual Video Prediction Week 2
Summary of Week 1 (May 23 – May 27, 2016)
Query Based Video Summarization
Visualizing and Understanding Neural Models in NLP
Section 2: Replication of DNA
Week 6 Cecilia La Place.
mengye ren, ryan kiros, richard s. zemel
Latest Microsoft Real Questions Exam Dumps
VQA: Visual Question Answering
Deceptive News Prediction Clickbait Score Inference
Type Topic in here! Created by Educational Technology Network
Textual Video Prediction
Image Question Answering
Action Recognition in Temporally Untrimmed Videos
Section 2: Replication of DNA
Visual Question Generation
Video understanding using part based object detection models
Optimizing Channel Selection for Seizure Detection
סדר דין פלילי – חקיקה ומהות ההליך הפלילי
Tina Jiang. , Vivek Natarajan. , Xinlei Chen
Data Driven Attributes for Action Detection

Pricing.
You must show all steps of your working out.
Question 1.
Learn to Comment Mentor: Mahdi M. Kalayeh
LANGUAGE EDUCATION.
Visual Question Answering
Visual Manipulation Relationship Network for Autonomous Robotics
Textual Video Prediction
Action Recognition.
The experiments based on Recurrent Neural Networks
Query-based video summarization
Presented By: Harshul Gupta
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Moving Target Detection Using Infrared Sensors
Visual Question Answering Aaron Honculada, Aisha Urooj Khan, Dr
Week 8 Presentation Ngoc Ta Aidean Sharghi.
Learning complex visual concepts
CRCV REU 2019 Kara Schatz.
Appearance Transformer (AT)
CRCV REU 2019 Week 8 Aaron Honculada.
CRCV REU 2019 Week 5.
CRCV REU 2019 Aaron Honculada.
Week 7 Presentation Ngoc Ta Aidean Sharghi
How secure is your autonomous/self-driving vehicle?
Additional text exploring the video clip.
CRCV REU 2019 Week 4.
Visual Grounding.
CRCV REU 2019 Aaron Honculada.
CVPR 2019 Oral Samvit Jain; Xin Wang; Joseph E. Gonzalez
Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision.
Presentation transcript:

Visual Question Answering Aaron Honculada Aisha Urooj Dr. Mubarak Shah, Dr. Niels Lobo

TVQA Dataset 460 hours of video 152,545 Question and Answer Pairs 21,793 clips (60-90 sec) Multimodal Compositionality Video-QA Associated natural language (subtitles)

Questions Main Question part Grounding part Each clip has 7 questions Temporal Localization Each clip has 7 questions Each question has 5 multiple choice answers

TVQA

TVQA Subtitles Visual Concepts Video Features Object detection Concatenate Remove duplicates Video Features ResNet

Model Used

Baseline Models LSTM BiLSTM

Baseline Models Baseline CNN+LSTM

Results Model Used TVQA + S Accuracy (%) Reported 65.15% Replication 65.74%

Results Model Used TVQA + S TVQA + V Accuracy (%) Reported 65.15% 45.03% Replication 65.74% 45.25%

Results Model Used TVQA + S TVQA + V TVQA + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% Replication 65.74% 45.25% 44.42%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q LSTM 42.74% BiLSTM 42.48%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q LSTM 42.74% 42.71% BiLSTM 42.48% 42.67%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q LSTM 42.74% 42.71% 42.61% BiLSTM 42.48% 42.67%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q S + V + Q LSTM 42.74% 42.71% 42.61% 42.39% BiLSTM 42.48% 42.67% 42.84%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q (FC) V + Q S + V + Q LSTM 42.74% 42.71% 42.61% 42.85% 42.39% BiLSTM 42.48% 42.67% 42.86% 42.84%

Results

Results

Results

Results

Results

Results

Summary and Next Steps Reproduced Results Baseline Results Look into network mistakes and address them Main Goal: Boost Performance Using Visual Cues effectively