Download presentation
Presentation is loading. Please wait.
1
Visual Question Answering
Aaron Honculada Aisha Urooj Dr. Mubarak Shah, Dr. Niels Lobo
2
TVQA Dataset 460 hours of video 152,545 Question and Answer Pairs
21,793 clips (60-90 sec) Multimodal Compositionality Video-QA Associated natural language (subtitles)
3
Questions Main Question part Grounding part Each clip has 7 questions
Temporal Localization Each clip has 7 questions Each question has 5 multiple choice answers
4
TVQA
5
TVQA Subtitles Visual Concepts Video Features Object detection
Concatenate Remove duplicates Video Features ResNet
6
Model Used
7
Baseline Models LSTM BiLSTM
8
Baseline Models Baseline CNN+LSTM
9
Results Model Used TVQA + S Accuracy (%) Reported 65.15% Replication
65.74%
10
Results Model Used TVQA + S TVQA + V Accuracy (%) Reported 65.15%
45.03% Replication 65.74% 45.25%
11
Results Model Used TVQA + S TVQA + V TVQA + IMG Accuracy (%) Reported
65.15% 45.03% 43.78% Replication 65.74% 45.25% 44.42%
12
Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG
Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52%
13
Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG
Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q LSTM 42.74% BiLSTM 42.48%
14
Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG
Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q LSTM 42.74% 42.71% BiLSTM 42.48% 42.67%
15
Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG
Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q LSTM 42.74% 42.71% 42.61% BiLSTM 42.48% 42.67%
16
Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG
Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q S + V + Q LSTM 42.74% 42.71% 42.61% 42.39% BiLSTM 42.48% 42.67% 42.84%
17
Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG
Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q (FC) V + Q S + V + Q LSTM 42.74% 42.71% 42.61% 42.85% 42.39% BiLSTM 42.48% 42.67% 42.86% 42.84%
18
Results
19
Results
20
Results
21
Results
22
Results
23
Results
24
Summary and Next Steps Reproduced Results Baseline Results
Look into network mistakes and address them Main Goal: Boost Performance Using Visual Cues effectively
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.