Download presentation
Presentation is loading. Please wait.
1
VQA: Visual Question Answering
Presented by: Vivek Pradhan
2
Problem Definition Given an image we want to answer an open ended question about the image.
3
Best Performing Model Question and image encodings are fed to a LSTM to generate the answer. Scene level and object level information is available since it works well on these questions.
4
Experiments Why don’t humans have perfect accuracy?
What does the model really learn?
5
Human Performance Humans do not do a perfect job.
Captions give a lot of information. The model does not outperform humans with captions. It will be interesting to see if the set of answers that humans get right with captions significantly overlap with the model’s correct answers.
6
Annotator Accuracy
7
Annotator Accuracy
8
Annotator Accuracy
9
Annotator Accuracy
10
Naïve Baselines - Performance
Method Accuracy Most Popular answer by question type 36.18% Answer chosen by answer to NN 40.61% Clearly it is possible to learn some shortcuts to solve this task. Captions Co-Occurrence Likely Answer by Scene Category Likely Answer by Object Detection
11
Possible Extensions Will the addition of object labels and scene categories help? Can we use a object detection model and scene classification model to provide better signals than VGG?
12
What does the model learn?
13
What does the model learn?
It seems to be using the scene category label.
14
What does the model learn?
15
Model Learns Scene Label
In a baseball game. A ball is most likely object to be hit. In a beach. Likely to be surfing.
16
What does the model learn?
17
What does the model learn?
18
Model Learns Co-Occurrence of objects
What objects are likely to be on top of a car.
19
What does the model learn?
20
What does the model learn?
21
Model learns actions associated with objects
A dog is likely to be eating. A pizza is likely to be eaten by a man.
22
Conclusion The model is learning many shortcut ways of answering questions. It might be benefited by getting a list of objects and the scene category of the image, this can be tested using ground truth annotations.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.