Attention is not Explanation

Slides:

Advertisements

Similar presentations

Significance Testing.  A statistical method that uses sample data to evaluate a hypothesis about a population  1. State a hypothesis  2. Use the hypothesis.

Advertisements

1 Autocompletion for Mashups Ohad Greenshpan, Tova Milo, Neoklis Polyzotis Tel-Aviv University UCSC.

Daozheng Chen 1, Mustafa Bilgic 2, Lise Getoor 1, David Jacobs 1, Lilyana Mihalkova 1, Tom Yeh 1 1 Department of Computer Science, University of Maryland,

Evaluation Adam Bodnar CPSC 533C Monday, April 5, 2004.

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

Coating Cans Wear Analysis Kevin Jacob DSES_6070.

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Sparse vs. Ensemble Approaches to Supervised Learning

Distributed Representations of Sentences and Documents

CSE808 F'99Xiangping Chen1 Simulation of Rare Events in Communications Networks J. Keith Townsend Zsolt Haraszti James A. Freebersyser Michael Devetsikiotis.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Claims about a Population Mean when σ is Known Objective: test a claim.

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Illegal Border Crossers and Wildfires in AZ Geoffrey Krassy GEOG 594A Prof Todd Bacastow.

1 Image Matching using Local Symmetry Features Daniel Cabrini Hauagge Noah Snavely Cornell University.

Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.

Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S

1 Section 8.4 Testing a claim about a mean (σ known) Objective For a population with mean µ (with σ known), use a sample (with a sample mean) to test a.

Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.

RiskTeam/ Zürich, 6 July 1998 Andreas S. Weigend, Data Mining Group, Information Systems Department, Stern School of Business, NYU 2: 1 Nonlinear Models.

Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.

Rationalizing Neural Predictions

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

R-NET: Machine Reading Comprehension With Self-Matching Networks

Convolutional Sequence to Sequence Learning

Neural Response Generation via GAN with an Approximate Embedding Layer

Dimensionality Reduction and Principle Components Analysis

RNNs: An example applied to the prediction task

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Deep Learning for Program Repair

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

WSRec: A Collaborative Filtering Based Web Service Recommender System

Adversarial Learning for Neural Dialogue Generation

Neural Machine Translation by Jointly Learning to Align and Translate

Intro to NLP and Deep Learning

Word2Vec CS246 Junghoo “John” Cho.

Presenter: Hajar Emami

Adversarially Tuned Scene Generation

Attention Is All You Need

Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang

RNNs: Going Beyond the SRN in Language Prediction

Distributed Representation of Words, Sentences and Paragraphs

Learning with information of features

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Word Embedding Word2Vec.

Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.

Indented Tree or Graph? A Usability Study of Ontology Visualization Techniques in the Context of Class Mapping Evaluation 本体可视化技术在类型匹配评估中的可用性研究 Qingxia.

Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences 1, Zhizhong.

Unsupervised Pretraining for Semantic Parsing

Natural Language to SQL(nl2sql)

RNNs: Going Beyond the SRN in Language Prediction

Word embeddings (continued)

Attention for translation

-- Ray Mooney, Association for Computational Linguistics (ACL) 2014

Predicting Voter Choice from Census Data

Fig. 2. Illustration of the procedure for finding spatiotemporal regions correspoding to significant differences between conditions using nonparametric.

Using Clustering to Make Prediction Intervals For Neural Networks

WSExpress: A QoS-Aware Search Engine for Web Services

Question Answering System

Presented By: Harshul Gupta

Baseline Model CSV Files Pandas DataFrame Sentence Lists

Week 7 Presentation Ngoc Ta Aidean Sharghi

LHC beam mode classification

Neural Machine Translation by Jointly Learning to Align and Translate

Do Better ImageNet Models Transfer Better?

Introduction to Machine Learning

Presentation transcript:

Attention is not Explanation NAACL’19 Sarthak Jain, Byron C. Wallace Northeastern University 2018-207/647,31%, 2019-424/1198,22.6% 5-5-5, 5-5-3

Background Attention Mechanism

Background-Attention Given sequence h and query Q Calculate attention distribution Additive function Scaled dot-product function Get attention vector:

Question Is the attention mechanism really get the semantic attention?

Is the attention provide transparency? Do attention weights correlate with measures of feature importance? Would alternative attention weights necessarily yield different predictions?

Experiment Model y h dense layer encoder (BiRNN) attention h embedding one hot h Q

Dataset

Correlation with Feature Importance Gradient based measure Leave one feature out

Result for Correlation Orange=>Positive, Purple=>Negative O,P,G=>Neutral, Contradiction, Entailment Gradients

Result for Correlation Leave One Out

Statistically Significant

Random Attention Weights

Result for Random Permutation Orange=>Positive, Purple=>Negative O,P,G=>Neutral, Contradiction, Entailment

Adversarial Attention Optimize a relaxed version with Adam SGD

Result for Adversarial Attention 0.69

Conclusion correlation between feature importance measures and learned attention weights is weak counterfactual attentions often have no effect on model output limitations only consider a handful of attention variants only evaluate tasks with unstructured output spaces (no seq2seq)

Adversarial Heatmaps Example

Adversarial Heatmaps Example