Presentation is loading. Please wait.

Presentation is loading. Please wait.

Visual Question Answering

Similar presentations


Presentation on theme: "Visual Question Answering"— Presentation transcript:

1 Visual Question Answering
Aaron Honculada Aisha Urooj Dr. Mubarak Shah, Dr. Niels Lobo

2 TVQA Dataset 460 hours of video 152,545 Question and Answer Pairs
21,793 clips (60-90 sec) Multimodal Compositionality Video-QA Associated natural language (subtitles)

3 Questions Main Question part Grounding part Each clip has 7 questions
Temporal Localization Each clip has 7 questions Each question has 5 multiple choice answers

4 TVQA

5 TVQA Subtitles Visual Concepts Video Features Object detection
Concatenate Remove duplicates Video Features ResNet

6 Model Used

7 Baseline Models LSTM BiLSTM

8 Baseline Models Baseline CNN+LSTM

9 Results Model Used TVQA + S Accuracy (%) Reported 65.15% Replication
65.74%

10 Results Model Used TVQA + S TVQA + V Accuracy (%) Reported 65.15%
45.03% Replication 65.74% 45.25%

11 Results Model Used TVQA + S TVQA + V TVQA + IMG Accuracy (%) Reported
65.15% 45.03% 43.78% Replication 65.74% 45.25% 44.42%

12 Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG
Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52%

13 Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG
Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q LSTM 42.74% BiLSTM 42.48%

14 Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG
Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q LSTM 42.74% 42.71% BiLSTM 42.48% 42.67%

15 Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG
Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q LSTM 42.74% 42.71% 42.61% BiLSTM 42.48% 42.67%

16 Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG
Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q S + V + Q LSTM 42.74% 42.71% 42.61% 42.39% BiLSTM 42.48% 42.67% 42.84%

17 Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG
Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q (FC) V + Q S + V + Q LSTM 42.74% 42.71% 42.61% 42.85% 42.39% BiLSTM 42.48% 42.67% 42.86% 42.84%

18 Results

19 Results

20 Results

21 Results

22 Results

23 Results

24 Summary and Next Steps Reproduced Results Baseline Results
Look into network mistakes and address them Main Goal: Boost Performance Using Visual Cues effectively


Download ppt "Visual Question Answering"

Similar presentations


Ads by Google