Jianbo Chen*, Le Song†✦, Martin J. Wainwright*◇ , Michael I. Jordan*

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Stochastic Models Of Resource Allocation For Services Stochastic Models of Resource Allocation for Services Ralph D. Badinelli Virginia Tech.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Pattern Recognition and Machine Learning
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Distributed Representations of Sentences and Documents
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
CHAPTER 15 S IMULATION - B ASED O PTIMIZATION II : S TOCHASTIC G RADIENT AND S AMPLE P ATH M ETHODS Organization of chapter in ISSO –Introduction to gradient.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.
2-1 Sample Spaces and Events Random Experiments Figure 2-1 Continuous iteration between model and physical system.
2-1 Sample Spaces and Events Random Experiments Figure 2-1 Continuous iteration between model and physical system.
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Learning From Measurements in Exponential Families Percy Liang, Michael I. Jordan and Dan Klein ICML 2009 Presented by Haojun Chen Images in these slides.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
CS 9633 Machine Learning Support Vector Machines
Deep Feedforward Networks
Visual Recognition Tutorial
Variational filtering in generated coordinates of motion
Deep Predictive Model for Autonomous Driving
Generative Adversarial Networks
By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS
Classification with Perceptrons Reading:
Machine Learning Basics
Convolutional Networks
Distributed Submodular Maximization in Massive Datasets
Reinforcement learning with unsupervised auxiliary tasks
CAMCOS Report Day December 9th, 2015 San Jose State University
Fenglong Ma1, Jing Gao1, Qiuling Suo1
Combining Base Learners
Human-level control through deep reinforcement learning
Variational Knowledge Graph Reasoning
Bayesian Models in Machine Learning
Collaborative Filtering Matrix Factorization Approach
Research Interests.
Vessel Extraction in X-Ray Angiograms Using Deep Learning
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall CHAPTER 15 SIMULATION-BASED OPTIMIZATION II: STOCHASTIC GRADIENT AND.
Matteo Fischetti, University of Padova
The Big Health Data–Intelligent Machine Paradox
Deep Learning and Mixed Integer Optimization
On Convolutional Neural Network
SVM-based Deep Stacking Networks
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
RCNN, Fast-RCNN, Faster-RCNN
Explainable Machine Learning
Model generalization Brief summary of methods
Parsing Unrestricted Text
Parametric Methods Berlin Chen, 2005 References:
Neural networks (3) Regularization Autoencoder
Attention for translation
Primal Sparse Max-Margin Markov Networks
Deep learning enhanced Markov State Models (MSMs)
Learning and Memorization
CS249: Neural Language Model
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Learning to Explain: An Information-theoretic Framework on Model Interpretation Jianbo Chen*, Le Song†✦, Martin J. Wainwright*◇ , Michael I. Jordan* UC Berkeley*, Georgia Tech† , Ant Financial✦ and Voleon Group◇

Motivations for Model Interpretation Application of machine learning Medicine Financial markets Criminal justice Complex models Deep neural networks Random forests Kernel methods

Instancewise Feature Selection Inputs: A model A sample (A sentence, an image, etc.) Outputs: Importance scores of each feature (word, pixel, etc.) Feature importance is allowed to vary across instances.

Existing Work Parzen window approximation + Gradient [Baehrens et al. , 2010] Saliency map [Simonyan et al. , 2013] LRP [Bach et al., 2015] LIME [Ribeiro et al., 2016] Kernel SHAP [Lundberg & Lee 2017] Integrated Gradients [Sundararajan et al., 2017] DeepLIFT [Shrikumar et al., 2017] ……

Properties Training-required Efficient Additive Model-agnostic

Properties of different methods

Our approach (L2X) Globally learns a local explainer. Removes the constraint of local feature additivity.

Some Notations Input Model S: A feature subset of size k Explainer : XS: The sub-vector of chosen features

Our Framework Maximize the mutual information between selected features and the response variable , over the explainer :

Mutual Information A measure of dependence between two random variables. How much the knowledge of X reduces the uncertainty about Y. Definition:

An Information-theoretic Interpretation Theorem 1: Letting denote the expectation over ,,, define Then is a global optimum of the following problem:

Intractability of the Objective Intractable hhh Summing over all choices of S.

Approximations of the Objective A variational lower bound A neural network for parametrizing distributions Continuous relaxation of subset sampling

A Tractable Variational Formulation

Maximizing Variational Lower Bound Objective:

A single neural network for parametrizing Parametrize by , such that

Summing over combinations

Continuous relaxation of subset sampling

Continuous relaxation of subset sampling :: : such that

Continuous relaxation of subset sampling :: : such that Approximation of Categorical:

Continuous relaxation of subset sampling :: : such that Approximation of Categorical: Sample k out of d features:

Final Objective Reduce the previous problem to . : Auxiliary random variables. : Parameters of the explainer. : Parameters of the variational distribution.

L2X Training Stage Explaining Stage Use stochastic gradient methods to optimize the following: Explaining Stage Rank features according to the class probability .

Synthetic Experiments Orange Skin (4 out of 10) XOR (2 out of 10) Nonlinear additive model (4 out of 10) Switch feature (Switch important features based on the sign of the first feature)

Median Rank of True Features

The training time of L2X is shown in translucent bars. Time Complexity The training time of L2X is shown in translucent bars.

Real-world Experiments IMDB movie review with word-based CNN IMDB movie review with hierarchical LSTM MNIST with CNN

IMDB Movie Review with word-based CNN

IMDB Movie Review with Hierarchical LSTM

MNIST with CNN

Quantitative Results Post-hoc accuracy: Alignment between model prediction on selected features and on the full original sample. Human accuracy: Alignment between human evaluation on selected features and the model prediction on full original sample. Human accuracy given selected words: 84.4% Human accuracy given original samples: 83.7%

Links to Code and Current Work Generation of adversarial examples: https://arxiv.org/abs/1805.12316 Efficient Shapley-based model interpretation. Poster: # 63

Learning to Explain: An Information-theoretic Framework on Model Interpretation Jianbo Chen*, Le Song†✦, Martin J. Wainwright*◇ , Michael I. Jordan* UC Berkeley*, Georgia Tech† , Ant Financial✦ and Voleon Group◇