Faithful Multimodal Explanation for VQA

Slides:

Advertisements

Similar presentations

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.

Advertisements

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.

Visual Attention More information in visual field than we can process at a given moment Solutions Shifts of Visual Attention related to eye movements Some.

Tutorials via Social Networking. Samer El-Daher, Lucie Pollard School of Science.

Dinggang Shen Development and Dissemination of Robust Brain MRI Measurement Tools ( 1R01EB ) Department of Radiology and BRIC UNC-Chapel Hill IDEA.

Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation

Sampletalk Technology Presentation Andrew Gleibman

1 Issues in Assessment in Higher Education: Science Higher Education Forum on Scientific Competencies Medellin-Colombia Nov 2-4, 2005 Dr Hans Wagemaker.

Haitham Elmarakeby.  Speech recognition

Prepared Speech By – put your name here.

Zachary Starr Dept. of Computer Science, University of Missouri, Columbia, MO 65211, USA Digital Image Processing Final Project Dec 11 th /16 th, 2014.

Chapter 9. The PlayMate System （ 2/2 ） in Cognitive Systems Monographs. Rüdiger Dillmann et al. Course: Robots Learning from Humans Summarized by Nan Changjun.

Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)

Ensembling Diverse Approaches to Question Answering

Scalable Person Re-identification on Supervised Smoothed Manifold

SUNY Korea BioData Mining Lab - Journal Review

Claim, Evidence, Reasoning: How to Write a Scientific Explanation

BSc Computing and Information Systems Module: M2X8630 Research and Development Methods Introduction to Research Methods.

Hierarchical Question-Image Co-Attention for Visual Question Answering

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

A Personal Tour of Machine Learning and Its Applications

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Semantic Parsing for Question Answering

Neural Machine Translation by Jointly Learning to Align and Translate

Expository Essay “to inform”.

Week 6 Cecilia La Place.

Vector-Space (Distributional) Lexical Semantics

mengye ren, ryan kiros, richard s. zemel

Deceptive News Prediction Clickbait Score Inference

Project Implementation for ITCS4122

Integrating Learning of Dialog Strategies and Semantic Parsing

Enhanced-alignment Measure for Binary Foreground Map Evaluation

Ensembling Diverse Approaches to Question Answering

Attention-based Caption Description Mun Jonghwan.

Wei Liu, Chaofeng Chen and Kwan-Yee K. Wong

Tobias Heimann - DKFZ Ipek Oguz - UNC Ivo Wolf - DKFZ

Architecture Outcome Plan

Soft Error Detection for Iterative Applications Using Offline Training

Learning to Sportscast: A Test of Grounded Language Acquisition

The Big Health Data–Intelligent Machine Paradox

Source: Pattern Recognition Vol. 38, May, 2005, pp

Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.

Learning Object Context for Dense Captioning

Chapter 17 Technical Instructions

Computer Vision in Cell Biology

A Data-Driven Question Generation Model for Educational Content

Neural Modular Networks

Attention for translation

Learn to Comment Mentor: Mahdi M. Kalayeh

MultiModality Registration using Hilbert-Schmidt Estimators

Department of Computer Science University of Texas at Austin

Lecture 21: Machine Learning Overview AP Computer Science Principles

Jointly Generating Captions to Aid Visual Question Answering

Image Attribute Classification using Disentangled Embeddings on Multimodal Data Introduction Many visual domains (housing, vehicle surroundings) require.

Presented by: Anurag Paul

An example for theory-based evaluation ERDF Berlin

Ask and Answer Questions

1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind

Angel A. Cantu, Nami Akazawa Department of Computer Science

Learning complex visual concepts

Bidirectional LSTM-CRF Models for Sequence Tagging

Week 7 Presentation Ngoc Ta Aidean Sharghi

Explainable Electrocardiogram Classifications using Neural Networks

Evaluate the integral {image}

Visual Grounding.

Welcome Introduction to InformationVisualization:

Lecture 9: Machine Learning Overview AP Computer Science Principles

Presentation transcript:

Faithful Multimodal Explanation for VQA Jialin Wu Raymond Mooney Department of Computer Science University of Texas at Austin

Visual Question Answering (Agrawal, et al., 2016) Answer natural language questions about an image.

Textual Explanations Generate a natural-language sentence that justifies the answer (Hendricks et al. 2016). Use an LSTM to generate an explanatory sentence given embeddings of: image question answer Train this LSTM on human-provided explanatory sentences.

Sample Textual Explanations (Park et al., 2018)

Multimodal Explanations Combine both a visual and textual explanation to justify an answer to a question (Park et al., 2018).

Post Hoc Rationalizations Previous textual and multimodal explanations for VQA (Park et al., 2018) are not “faithful” or “introspective.” They do not reflect any details of the internal processing of the network or how it actually computed the answer. They are just trained to mimic human explanations in an attempt to “justify” the answer and get humans to trust it.

Faithful Multimodal Explanations We are attempting to produce more faithful explanations that actually reflect important aspects of the VQA system’s internal processing. Focus explanation on including detected objects that are highly attended to during the VQA network’s generation of the answer. Trained to generate human explanations, but explicitly biased to include references to these objects.

Sample Faithful Multimodal Explanation

High-Level VQA Architecture

Textual Explanation Generator We finally train an LSTM to generate an explanatory sentence from embeddings of the segmented objects. Trained on supervised data to produce human-like textual explanations. But, trained to encourage the explanation to cover the segments highly attended to by VQA to make it faithfully reflect the focus of the network that computed the answer.

Multimodal Explanation Generator Words generated while attending to a particular visual segment are highlighted and linked to the corresponding segmentation in the visual explanation by depicting them both in the same color.

High-Level System Architecture

Sample Explanation

Evaluating Textual Explanations Compare system explanation to “gold standard” human explanations using standard machine translations metrics for judging similarity of sentences. Ask human judges on Mechanical Turk to compare system explanation to human explanation and judge which is better (allowing for ties). Report percentage of time algorithm beats or ties human.

Textual Evaluation Results Automated Metrics Human Eval

Evaluating Multimodal Explanations Ask human judges on Mechanical Turk to qualitatively evaluate the final multimodal explanations by answering two questions: “ How well do the highlighted image regions support the answer to the question?” “How well do the colored image segments highlight the appropriate regions for the corresponding colored words in the explanation?”

Multimodal Explanation Results

Conclusions Multimodal explanations for VQA that integrate both textual and visual information are particularly useful. Our approach that uses attention over high-level object segmentations to drive both VQA and human-like explanation is promising and superior to previous “post hoc rationalizations.”