Attention is not Explanation

Slides:



Advertisements
Similar presentations
Significance Testing.  A statistical method that uses sample data to evaluate a hypothesis about a population  1. State a hypothesis  2. Use the hypothesis.
Advertisements

1 Autocompletion for Mashups Ohad Greenshpan, Tova Milo, Neoklis Polyzotis Tel-Aviv University UCSC.
Daozheng Chen 1, Mustafa Bilgic 2, Lise Getoor 1, David Jacobs 1, Lilyana Mihalkova 1, Tom Yeh 1 1 Department of Computer Science, University of Maryland,
Evaluation Adam Bodnar CPSC 533C Monday, April 5, 2004.
Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson
Coating Cans Wear Analysis Kevin Jacob DSES_6070.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Sparse vs. Ensemble Approaches to Supervised Learning
Distributed Representations of Sentences and Documents
CSE808 F'99Xiangping Chen1 Simulation of Rare Events in Communications Networks J. Keith Townsend Zsolt Haraszti James A. Freebersyser Michael Devetsikiotis.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Claims about a Population Mean when σ is Known Objective: test a claim.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Illegal Border Crossers and Wildfires in AZ Geoffrey Krassy GEOG 594A Prof Todd Bacastow.
1 Image Matching using Local Symmetry Features Daniel Cabrini Hauagge Noah Snavely Cornell University.
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
1 Section 8.4 Testing a claim about a mean (σ known) Objective For a population with mean µ (with σ known), use a sample (with a sample mean) to test a.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
RiskTeam/ Zürich, 6 July 1998 Andreas S. Weigend, Data Mining Group, Information Systems Department, Stern School of Business, NYU 2: 1 Nonlinear Models.
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
Rationalizing Neural Predictions
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
R-NET: Machine Reading Comprehension With Self-Matching Networks
Convolutional Sequence to Sequence Learning
Neural Response Generation via GAN with an Approximate Embedding Layer
Dimensionality Reduction and Principle Components Analysis
RNNs: An example applied to the prediction task
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Learning for Program Repair
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
WSRec: A Collaborative Filtering Based Web Service Recommender System
Adversarial Learning for Neural Dialogue Generation
Neural Machine Translation by Jointly Learning to Align and Translate
Intro to NLP and Deep Learning
Word2Vec CS246 Junghoo “John” Cho.
Presenter: Hajar Emami
Adversarially Tuned Scene Generation
Attention Is All You Need
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
RNNs: Going Beyond the SRN in Language Prediction
Distributed Representation of Words, Sentences and Paragraphs
Learning with information of features
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Word Embedding Word2Vec.
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
Papers 15/08.
Indented Tree or Graph? A Usability Study of Ontology Visualization Techniques in the Context of Class Mapping Evaluation 本体可视化技术在类型匹配评估中的可用性研究 Qingxia.
Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences 1, Zhizhong.
Unsupervised Pretraining for Semantic Parsing
Natural Language to SQL(nl2sql)
RNNs: Going Beyond the SRN in Language Prediction
Attention.
Word embeddings (continued)
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Predicting Voter Choice from Census Data
Fig. 2. Illustration of the procedure for finding spatiotemporal regions correspoding to significant differences between conditions using nonparametric.
Using Clustering to Make Prediction Intervals For Neural Networks
WSExpress: A QoS-Aware Search Engine for Web Services
Question Answering System
Presented By: Harshul Gupta
Baseline Model CSV Files Pandas DataFrame Sentence Lists
Week 7 Presentation Ngoc Ta Aidean Sharghi
LHC beam mode classification
Neural Machine Translation by Jointly Learning to Align and Translate
Do Better ImageNet Models Transfer Better?
Introduction to Machine Learning
Presentation transcript:

Attention is not Explanation NAACL’19 Sarthak Jain, Byron C. Wallace Northeastern University 2018-207/647,31%, 2019-424/1198,22.6% 5-5-5, 5-5-3

Background Attention Mechanism

Background-Attention Given sequence h and query Q Calculate attention distribution Additive function Scaled dot-product function Get attention vector:

Question Is the attention mechanism really get the semantic attention?

Is the attention provide transparency? Do attention weights correlate with measures of feature importance? Would alternative attention weights necessarily yield different predictions?

Experiment Model y h dense layer encoder (BiRNN) attention h embedding one hot h Q

Dataset

Correlation with Feature Importance Gradient based measure Leave one feature out

Result for Correlation Orange=>Positive, Purple=>Negative O,P,G=>Neutral, Contradiction, Entailment Gradients

Result for Correlation Leave One Out

Statistically Significant

Random Attention Weights

Result for Random Permutation Orange=>Positive, Purple=>Negative O,P,G=>Neutral, Contradiction, Entailment

Adversarial Attention Optimize a relaxed version with Adam SGD

Result for Adversarial Attention 0.69

Conclusion correlation between feature importance measures and learned attention weights is weak counterfactual attentions often have no effect on model output limitations only consider a handful of attention variants only evaluate tasks with unstructured output spaces (no seq2seq)

Adversarial Heatmaps Example

Adversarial Heatmaps Example