VQA: Visual Question Answering

Slides:

Advertisements

Similar presentations

Producing Images and Movies in Chimera An introduction to several useful tools for creating publication-quality images and movies.

Advertisements

Insert Category 1 $100 $200 $300 $400 $500 $ 500$500 Insert Category 2 Insert Category 3 Insert Category 4 Insert Category 5.

Using “Like” Who is it? Prepositions SchoolHobbies Unscramble.

Active Voice. The subject of the sentence performs an action. A sentence using active voice follows this: + + Mary mails the letter. DOER ACTION RECEIVER.

School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.

Category 1 Category 5 Category 4 Category 3 Category

Context-based vision system for place and object recognition Antonio Torralba Kevin Murphy Bill Freeman Mark Rubin Presented by David Lee Some slides borrowed.

Category Category Category Category Category

Machine learning & object recognition Cordelia Schmid Jakob Verbeek.

Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

A Visual Stepping Stone to AI

SHAHAB iCV Research Group.

Unsupervised Learning of Video Representations using LSTMs

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

What Convnets Make for Image Captioning?

Solving Quadratic Equations by the Quadratic Formula

This is your 30-Second NDEAM Training

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Week 7 Lesson 1 Food Chains

PLEASE write today’s date (4/4/17) in the TUESDAY box of your warm up!

Why Experts Offer A Good Finished Basement Service

Look-ahead before you leap

Huazhong University of Science and Technology

The Scientific Method.

mengye ren, ryan kiros, richard s. zemel

Welcome To Mellcrest Ltd

Image Question Answering

Accounting for the relative importance of objects in image retrieval

Understanding Randomness

Attention-based Caption Description Mun Jonghwan.

Walter J. Scheirer, Samuel E. Anthony, Ken Nakayama & David D. Cox

Find 4 A + 2 B if {image} and {image} Select the correct answer.

The Scientific Method.

The Scientific Method.

The Scientific Method.

China and the One Child Policy

Pizza Box Solar Oven EXTENSION QUESTIONS 1 2

Sketch a triangle that has acute angle {image} , and sin( {image} ) = Choose the right trigonometric ratios of {image} . cos( {image} ) = 0.24 tan(

Let {image} Use substitution to determine which elements of S satisfy the inequality {image} Select the correct answer(s): {image}

Deductive Reasoning.

Equivalent Fractions Raising and Reducing.

#1 #2 #3 Solve: 21 ÷ Solve: Solve: 3 (7 + 4) 12 ÷ 3 – 2 + 1

Family History Technology Workshop

EOC 1 & 2 Short Answer Questions

Evaluate the limit: {image} Choose the correct answer from the following:

Listening Lesson Spring 2018

Topic 1: My Interests, skills and future choices

Topic 1: My Interests, skills and future choices

The Three Levels of Questions: (AKA The Cinderella exercise)

Solve the equation: 6 x - 2 = 7 x + 7 Select the correct answer.

Do you trust the 8 Ball? The 8 Ball always knows..

DO NOW 3/15/2016 Find the values

Do Now Task Look back at your notes from your homework – compare these with the people on your table.

Your task You are to research and present information about child workers in another country. All of you should include: The map and flag of the country.

Video Notes Organizer What did I learn today?.

texts related to “The Good Lie”

If you get your genes from your parents, why might you look like your grandparents?

Jointly Generating Captions to Aid Visual Question Answering

Automatic Handwriting Generation

Visual Question Answering

Find equations of lines

It’s not just about games.

Presented By: Harshul Gupta

CRCV REU 2019 Week 5.

Report 2 Brandon Silva.

Given that {image} {image} Evaluate the limit: {image} Choose the correct answer from the following:

CRCV REU 2019 Week 4.

Running Your Visual Basic Program

Presentation transcript:

VQA: Visual Question Answering Presented by: Vivek Pradhan

Problem Definition Given an image we want to answer an open ended question about the image.

Best Performing Model Question and image encodings are fed to a LSTM to generate the answer. Scene level and object level information is available since it works well on these questions.

Experiments Why don’t humans have perfect accuracy? What does the model really learn?

Human Performance Humans do not do a perfect job. Captions give a lot of information. The model does not outperform humans with captions. It will be interesting to see if the set of answers that humans get right with captions significantly overlap with the model’s correct answers.

Annotator Accuracy

Annotator Accuracy

Annotator Accuracy

Annotator Accuracy

Naïve Baselines - Performance Method Accuracy Most Popular answer by question type 36.18% Answer chosen by answer to NN 40.61% Clearly it is possible to learn some shortcuts to solve this task. Captions Co-Occurrence Likely Answer by Scene Category Likely Answer by Object Detection

Possible Extensions Will the addition of object labels and scene categories help? Can we use a object detection model and scene classification model to provide better signals than VGG?

What does the model learn?

What does the model learn? It seems to be using the scene category label.

What does the model learn?

Model Learns Scene Label In a baseball game. A ball is most likely object to be hit. In a beach. Likely to be surfing.

What does the model learn?

What does the model learn?

Model Learns Co-Occurrence of objects What objects are likely to be on top of a car.

What does the model learn?

What does the model learn?

Model learns actions associated with objects A dog is likely to be eating. A pizza is likely to be eaten by a man.

Conclusion The model is learning many shortcut ways of answering questions. It might be benefited by getting a list of objects and the scene category of the image, this can be tested using ground truth annotations.