Faithful Multimodal Explanation for VQA

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Visual Attention More information in visual field than we can process at a given moment Solutions Shifts of Visual Attention related to eye movements Some.
Tutorials via Social Networking. Samer El-Daher, Lucie Pollard School of Science.
Dinggang Shen Development and Dissemination of Robust Brain MRI Measurement Tools ( 1R01EB ) Department of Radiology and BRIC UNC-Chapel Hill IDEA.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Sampletalk Technology Presentation Andrew Gleibman
1 Issues in Assessment in Higher Education: Science Higher Education Forum on Scientific Competencies Medellin-Colombia Nov 2-4, 2005 Dr Hans Wagemaker.
Haitham Elmarakeby.  Speech recognition
Prepared Speech By – put your name here.
Zachary Starr Dept. of Computer Science, University of Missouri, Columbia, MO 65211, USA Digital Image Processing Final Project Dec 11 th /16 th, 2014.
Chapter 9. The PlayMate System ( 2/2 ) in Cognitive Systems Monographs. Rüdiger Dillmann et al. Course: Robots Learning from Humans Summarized by Nan Changjun.
Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)
Ensembling Diverse Approaches to Question Answering
Scalable Person Re-identification on Supervised Smoothed Manifold
SUNY Korea BioData Mining Lab - Journal Review
Claim, Evidence, Reasoning: How to Write a Scientific Explanation
BSc Computing and Information Systems Module: M2X8630 Research and Development Methods Introduction to Research Methods.
Hierarchical Question-Image Co-Attention for Visual Question Answering
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
A Personal Tour of Machine Learning and Its Applications
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Semantic Parsing for Question Answering
Neural Machine Translation by Jointly Learning to Align and Translate
Expository Essay “to inform”.
Week 6 Cecilia La Place.
Vector-Space (Distributional) Lexical Semantics
mengye ren, ryan kiros, richard s. zemel
Deceptive News Prediction Clickbait Score Inference
Project Implementation for ITCS4122
Integrating Learning of Dialog Strategies and Semantic Parsing
Enhanced-alignment Measure for Binary Foreground Map Evaluation
Ensembling Diverse Approaches to Question Answering
Attention-based Caption Description Mun Jonghwan.
Wei Liu, Chaofeng Chen and Kwan-Yee K. Wong
Tobias Heimann - DKFZ Ipek Oguz - UNC Ivo Wolf - DKFZ
Architecture Outcome Plan
Soft Error Detection for Iterative Applications Using Offline Training
Learning to Sportscast: A Test of Grounded Language Acquisition
The Big Health Data–Intelligent Machine Paradox
Source: Pattern Recognition Vol. 38, May, 2005, pp
Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.
Learning Object Context for Dense Captioning
Chapter 17 Technical Instructions
Computer Vision in Cell Biology
A Data-Driven Question Generation Model for Educational Content
Neural Modular Networks
Attention for translation
Learn to Comment Mentor: Mahdi M. Kalayeh
MultiModality Registration using Hilbert-Schmidt Estimators
Department of Computer Science University of Texas at Austin
Lecture 21: Machine Learning Overview AP Computer Science Principles
Jointly Generating Captions to Aid Visual Question Answering
Image Attribute Classification using Disentangled Embeddings on Multimodal Data Introduction Many visual domains (housing, vehicle surroundings) require.
Presented by: Anurag Paul
An example for theory-based evaluation ERDF Berlin
Ask and Answer Questions
1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind
Angel A. Cantu, Nami Akazawa Department of Computer Science
Learning complex visual concepts
Bidirectional LSTM-CRF Models for Sequence Tagging
Week 7 Presentation Ngoc Ta Aidean Sharghi
Explainable Electrocardiogram Classifications using Neural Networks
Evaluate the integral {image}
Visual Grounding.
Welcome Introduction to InformationVisualization:
Lecture 9: Machine Learning Overview AP Computer Science Principles
Presentation transcript:

Faithful Multimodal Explanation for VQA Jialin Wu Raymond Mooney Department of Computer Science University of Texas at Austin

Visual Question Answering (Agrawal, et al., 2016) Answer natural language questions about an image.

Textual Explanations Generate a natural-language sentence that justifies the answer (Hendricks et al. 2016). Use an LSTM to generate an explanatory sentence given embeddings of: image question answer Train this LSTM on human-provided explanatory sentences.

Sample Textual Explanations (Park et al., 2018)

Multimodal Explanations Combine both a visual and textual explanation to justify an answer to a question (Park et al., 2018).

Post Hoc Rationalizations Previous textual and multimodal explanations for VQA (Park et al., 2018) are not “faithful” or “introspective.” They do not reflect any details of the internal processing of the network or how it actually computed the answer. They are just trained to mimic human explanations in an attempt to “justify” the answer and get humans to trust it.

Faithful Multimodal Explanations We are attempting to produce more faithful explanations that actually reflect important aspects of the VQA system’s internal processing. Focus explanation on including detected objects that are highly attended to during the VQA network’s generation of the answer. Trained to generate human explanations, but explicitly biased to include references to these objects.

Sample Faithful Multimodal Explanation

High-Level VQA Architecture

Textual Explanation Generator We finally train an LSTM to generate an explanatory sentence from embeddings of the segmented objects. Trained on supervised data to produce human-like textual explanations. But, trained to encourage the explanation to cover the segments highly attended to by VQA to make it faithfully reflect the focus of the network that computed the answer.

Multimodal Explanation Generator Words generated while attending to a particular visual segment are highlighted and linked to the corresponding segmentation in the visual explanation by depicting them both in the same color.

High-Level System Architecture

Sample Explanation

Evaluating Textual Explanations Compare system explanation to “gold standard” human explanations using standard machine translations metrics for judging similarity of sentences. Ask human judges on Mechanical Turk to compare system explanation to human explanation and judge which is better (allowing for ties). Report percentage of time algorithm beats or ties human.

Textual Evaluation Results Automated Metrics Human Eval

Evaluating Multimodal Explanations Ask human judges on Mechanical Turk to qualitatively evaluate the final multimodal explanations by answering two questions: “ How well do the highlighted image regions support the answer to the question?” “How well do the colored image segments highlight the appropriate regions for the corresponding colored words in the explanation?”

Multimodal Explanation Results

Conclusions Multimodal explanations for VQA that integrate both textual and visual information are particularly useful. Our approach that uses attention over high-level object segmentations to drive both VQA and human-like explanation is promising and superior to previous “post hoc rationalizations.”