Visual Question Answering Aaron Honculada Aisha Urooj Dr. Mubarak Shah, Dr. Niels Lobo
TVQA Dataset 460 hours of video 152,545 Question and Answer Pairs 21,793 clips (60-90 sec) Multimodal Compositionality Video-QA Associated natural language (subtitles)
Questions Main Question part Grounding part Each clip has 7 questions Temporal Localization Each clip has 7 questions Each question has 5 multiple choice answers
TVQA
TVQA Subtitles Visual Concepts Video Features Object detection Concatenate Remove duplicates Video Features ResNet
Model Used
Baseline Models LSTM BiLSTM
Baseline Models Baseline CNN+LSTM
Results Model Used TVQA + S Accuracy (%) Reported 65.15% Replication 65.74%
Results Model Used TVQA + S TVQA + V Accuracy (%) Reported 65.15% 45.03% Replication 65.74% 45.25%
Results Model Used TVQA + S TVQA + V TVQA + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% Replication 65.74% 45.25% 44.42%
Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52%
Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q LSTM 42.74% BiLSTM 42.48%
Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q LSTM 42.74% 42.71% BiLSTM 42.48% 42.67%
Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q LSTM 42.74% 42.71% 42.61% BiLSTM 42.48% 42.67%
Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q S + V + Q LSTM 42.74% 42.71% 42.61% 42.39% BiLSTM 42.48% 42.67% 42.84%
Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65.15% 45.03% 43.78% N/A Replication 65.74% 45.25% 44.42% 45.52% Q S + Q V + Q (FC) V + Q S + V + Q LSTM 42.74% 42.71% 42.61% 42.85% 42.39% BiLSTM 42.48% 42.67% 42.86% 42.84%
Results
Results
Results
Results
Results
Results
Summary and Next Steps Reproduced Results Baseline Results Look into network mistakes and address them Main Goal: Boost Performance Using Visual Cues effectively