Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)

Slides:



Advertisements
Similar presentations
Beyond Mindless Labeling: Really Leveraging Humans to Build Intelligent Machines Devi Parikh Virginia Tech.
Advertisements

Patch to the Future: Unsupervised Visual Prediction
Large-Scale Object Recognition with Weak Supervision
Recognition: A machine learning approach
Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson
Pedestrian Recognition Machine Perception and Modeling of Human Behavior Manfred Lau.
Computer Vision CS 776 Spring 2014 Recognition Machine Learning Prof. Alex Berg.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Unsupervised Visual Representation Learning by Context Prediction
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CNN-RNN: A Unified Framework for Multi-label Image Classification
What Convnets Make for Image Captioning?
Hierarchical Question-Image Co-Attention for Visual Question Answering
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Object Detection based on Segment Masks
Dhruv Batra Georgia Tech
Object detection with deformable part-based models
Sofus A. Macskassy Fetch Technologies
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Understanding and Predicting Image Memorability at a Large Scale
Evaluating Techniques for Image Classification
Matt Gormley Lecture 16 October 24, 2016
Overview of Challenge Aishwarya Agrawal (Virginia Tech)
Natural Language Processing of Knee MRI Reports
Machine Learning Basics
Week 6 Cecilia La Place.
Summary Presentation.
Supervised Training of Deep Networks
mengye ren, ryan kiros, richard s. zemel
VQA: Visual Question Answering
Image Question Answering
State-of-the-art face recognition systems
Machine Learning Week 1.
Human-level control through deep reinforcement learning
Zan Gao, Deyu Wang, Xiangnan He, Hua Zhang
Bird-species Recognition Using Convolutional Neural Network
Attention-based Caption Description Mun Jonghwan.
Tina Jiang. , Vivek Natarajan. , Xinlei Chen
Emre O. Neftci  iScience  Volume 5, Pages (July 2018) DOI: /j.isci
Example Histogram c) Interpret the following histogram that captures the percentage of body-fat in a testgroup [4]:  
Word Embedding Word2Vec.
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
Neural Networks Geoff Hulten.
Outline Background Motivation Proposed Model Experimental Results
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Graph Neural Networks Amog Kamsetty January 30, 2019.
Word embeddings (continued)
Heterogeneous convolutional neural networks for visual recognition
Reuben Feinman Research advised by Brenden Lake
Automatic Handwriting Generation
Visual Question Answering
Word representations David Kauchak CS158 – Fall 2016.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Modeling IDS using hybrid intelligent systems
Presented By: Harshul Gupta
Visual Question Answering Aaron Honculada, Aisha Urooj Khan, Dr
Directional Occlusion with Neural Network
Faithful Multimodal Explanation for VQA
Do Better ImageNet Models Transfer Better?
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)

Visual Question Answering (VQA) What is Visual Question Answering? 2

VQA Task 3

What is the mustache made of? 4

VQA Task What is the mustache made of? AI System 5

VQA Task What is the mustache made of? bananas AI System 6

Papers using VQA … and many more 7

Papers using VQA 8

VQA CVPR16 9

~ 30 teams 10

Current machine performance around 60-66% Human performance at 83% How to identify where we need progress? How to compare strengths and weaknesses? How to develop insights into failure modes? Need to understand the behavior of VQA models Observations 11

Outline Do VQA models generalize to novel instances? Do VQA models ‘listen’ to the entire question? Do VQA models really ‘look’ at the image? 12

Outline Do VQA models generalize to novel instances? Do VQA models ‘listen’ to the entire question? Do VQA models really ‘look’ at the image? 13

Outline Do VQA models generalize to novel instances? Do VQA models ‘listen’ to the entire question? Do VQA models really ‘look’ at the image? 14

Models Without attention (baseline model) –CNN + LSTM (Lu et al. 2015) –Accuracy = 54.13% (on VQA validation split) With attention –Hierarchical Co-attention (Lu et al. 2016) –Accuracy = 57.02% (on VQA validation split) 15

Slide credit: Devi Parikh Fully-Connected 1024 Point-wise multiplication Fully-Connected 1000 “2” 1024 Softmax over 1000 most frequent answers 1000 Fully-Connected “How many horses are in this image?” 2 × 2 × 512 LSTM Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected 4096 output units from last hidden layer (VGG-19, Normalized) Without attention model [Lu et al. 2015] 16

Slide credit: Devi Parikh Q: what is the color of the bird? CNN isthe color of the bird? what is the color of the bird ? What is What is the color of the bird ? Answer: white Slide credit: Adapted from Jiasen Lu With attention model [Lu et al. 2016] What is the color of the bird ? 17

Outline Do VQA models generalize to novel instances? Do VQA models ‘listen’ to the entire question? Do VQA models really ‘look’ at the image? 18

Generalization to Novel Instances Do VQA models make mistakes because test instances are too different from training ones? 1.Lower test accuracy test QI pairs are too different from training QI pairs? 2.Lower test accuracy test QI pairs are “familiar” but test labels are too different from training labels? 19

Generalization to Novel Instances Do VQA models make mistakes because test instances are too different from training ones? 1.Lower test accuracy test QI pairs are too different from training QI pairs? 2.Lower test accuracy test QI pairs are “familiar” but test labels are too different from training labels? 20

Generalization to Novel Instances Experiment 1.Find k-NN training QI pairs, for each test QI pair 2.Compute average distance between k-NN training QI pairs and test QI pair 3.Measure correlation between average distance and test accuracy 21

Generalization to Novel Instances K-NN Space Combined Q+I embedding 22

Generalization to Novel Instances K-NN Space Combined Q+I embedding 23

Generalization to Novel Instances K-NN Space Combined Q+I embedding 24

Generalization to Novel Instances Results Significant negative correlation Without AttentionWith Attention Correlation-0.41 k=50)-0.42 k=15) 25 VQA models are not very good at generalizing to novel test QI pairs VQA models are “myopic”

Generalization to Novel Instances Results Significant percentage of mistakes can be successfully predicted The analysis provides a way for models to predict their own oncoming failures  human-like models Without AttentionWith Attention % of mistakes that can be successfully predicted 67.5%66.7% 26

Test Sample Nearest Neighbor Training Samples Q: What type of reception is being attended? GT Ans: wedding Predicted Ans: cake 27

Test Sample Nearest Neighbor Training Samples Q: What type of reception is being attended? GT Ans: wedding Predicted Ans: cake 28

Test Sample Nearest Neighbor Training Samples Q: What type of reception is being attended? GT Ans: wedding Predicted Ans: cake 29 Q: What type of exercise equipment is shown? GT Ans: bike

Test Sample Nearest Neighbor Training Samples Q: What type of reception is being attended? GT Ans: wedding Predicted Ans: cake 30 Q: What type of exercise equipment is shown? GT Ans: bike Q: What type of dessert is this man having? GT Ans: cake

Test Sample Nearest Neighbor Training Samples Q: What type of reception is being attended? GT Ans: wedding Predicted Ans: cake 31 Q: What type of exercise equipment is shown? GT Ans: bike Q: What type of dessert is this man having? GT Ans: cake Q: What dessert is on the table? GT Ans: cake

Generalization to Novel Instances Do VQA models make mistakes because test instances are too different from training ones? 1.Lower test accuracy test QI pairs are too different from training QI pairs? 2.Lower test accuracy test QI pairs are “familiar” but test labels are too different from training labels? 32

Generalization to Novel Instances Experiment 1.Find k-NN training QI pairs, for each test QI pair 2.Compute average distance (in Word2Vec space) between GT answers of k-NN training QI pairs and GT answer of test QI pair 3.Measure correlation between average distance and test accuracy 33

Generalization to Novel Instances Results Significant negative correlation Without AttentionWith Attention Correlation-0.62 k=50)-0.62 k=15) 34 VQA models tend to regurgitate answers seen during training

Test Sample Nearest Neighbor Training Samples Q: What color are the safety cones? GT Ans: green Predicted Ans: orange 35

Test Sample Nearest Neighbor Training Samples Q: What color are the safety cones? GT Ans: green Predicted Ans: orange 36

Test Sample Nearest Neighbor Training Samples Q: What color are the safety cones? GT Ans: green Predicted Ans: orange 37 Q: What color are the cones? GT Ans: orange

Test Sample Nearest Neighbor Training Samples Q: What color are the safety cones? GT Ans: green Predicted Ans: orange 38 Q: What color are the cones? GT Ans: orange Q: What color is the cone? GT Ans: orange

Test Sample Nearest Neighbor Training Samples Q: What color are the safety cones? GT Ans: green Predicted Ans: orange 39 Q: What color are the cones? GT Ans: orange Q: What color is the cone? GT Ans: orange Q: What color are the cones? GT Ans: orange

Outline Do VQA models generalize to novel instances? Do VQA models ‘listen’ to the entire question? Do VQA models really ‘look’ at the image? 40

Listening to the Entire Question Q: How many horses Q: How many horses are Q: How many horses are on Q: How many horses are on the beach Q: How many horses are on the Q: How many horses are on the beach? Predicted Ans: 2 Q: HowPredicted Ans? Q: How manyPredicted Ans? 41

Listening to the Entire Question Experiment 1.Test the model with partial questions of increasing lengths 2.Compute percentage of questions for which partial question responses are same as full question responses 42

Listening to the Entire Question 41% 43

Listening to the Entire Question 49% 44

Listening to the Entire Question Result VQA models converge on predicted answer after half the question for significant % of questions Without AttentionWith Attention % of questions41%49% 45 VQA models often “jump to conclusions”

Listening to the Entire Question 37% 54% 46

Listening to the Entire Question 42% 57% 47

Correct Response GT Ans: yes 48

Incorrect Response GT Ans: 6 49

Incorrect Response GT Ans: yes 50

Incorrect Response GT Ans: spring 51

Outline Do VQA models generalize to novel instances? Do VQA models ‘listen’ to the entire question? Do VQA models really ‘look’ at the image? 52

Looking at the Image Q: How many zebras? Predicted Ans: 2 Q: How many zebras? Predicted Ans? 53

Looking at the Image Experiment 1.Compute the % of times (say X), the response does not change across images for a given question 2.Plot histogram of X across questions 54

Looking at the Image 55

Looking at the Image 56% 56

Looking at the Image 42% 57

Looking at the Image Results 1.VQA models do not change answers across images for significant % of questions Without AttentionWith Attention % of questions56%42% 58 VQA models are “stubborn” Attention based models are less “stubborn” than non- attention based models

Looking at the Image Q: What does the red sign say? Predicted Ans: stop Correct Response Incorrect Responses 59

Looking at the Image Q: How many zebras? Predicted Ans: 2 Correct Response Incorrect Responses 60

Looking at the Image 56% 54% 61

Looking at the Image 60% 57% 62

Looking at the Image All Correct Responses Q: What covers the ground? Predicted Ans: snow 63

Looking at the Image Observations 1.Producing same responses across images seems to be statistically favorable 2.Label biases in the dataset 64

Conclusion Novel techniques for characterizing the behavior of deep VQA models Today’s VQA models – –are “myopic” –often “jump to conclusions” –are “stubborn” 65

To be noted Correct behavior depending on dataset? Good to know the current behavior Is the behavior desired? Anthropomorphic adjectives purely pedagogical 66

Thanks! Questions? 67