Multivariate analyses and decoding Kay H. Brodersen Computational Neuroeconomics Group Institute of Empirical Research in Economics University of Zurich.

Slides:

Advertisements

Similar presentations

Hierarchical Models and

Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Biointelligence Laboratory, Seoul National University

Machine learning continued Image source:

A (very) brief introduction to multivoxel analysis “stuff” Jo Etzel, Social Brain Lab

Chapter 4: Linear Models for Classification

Classical inference and design efficiency Zurich SPM Course 2014

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Chapter 2: Pattern Recognition

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Ensemble Learning: An Introduction

Machine Learning CMPT 726 Simon Fraser University

Multi-voxel Pattern Analysis (MVPA) and “Mind Reading” By: James Melrose.

Linear Discriminant Functions Chapter 5 (Duda et al.)

Modeling fMRI data generated by overlapping cognitive processes with unknown onsets using Hidden Process Models Rebecca A. Hutchinson (1) Tom M. Mitchell.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Learning to Identify Overlapping and Hidden Cognitive Processes from fMRI Data Rebecca Hutchinson, Tom Mitchell, Indra Rustandi Carnegie Mellon University.

Experimental Design Tali Sharot & Christian Kaul With slides taken from presentations by: Tor Wager Christian Ruff.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

Multivariate models for fMRI data Klaas Enno Stephan (with 90% of slides kindly contributed by Kay H. Brodersen) Translational Neuromodeling Unit (TNU)

How To Do Multivariate Pattern Analysis

This week: overview on pattern recognition (related to machine learning)

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

1 / 41 Inference and Computation with Population Codes 13 November 2012 Inference and Computation with Population Codes Alexandre Pouget, Peter Dayan,

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Bayesian Inference and Posterior Probability Maps Guillaume Flandin Wellcome Department of Imaging Neuroscience, University College London, UK SPM Course,

1 E. Fatemizadeh Statistical Pattern Recognition.

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

Distributed Representative Reading Group. Research Highlights 1Support vector machines can robustly decode semantic information from EEG and MEG 2Multivariate.

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

MVPD – Multivariate pattern decoding Christian Kaul MATLAB for Cognitive Neuroscience.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

C O R P O R A T E T E C H N O L O G Y Information & Communications Neural Computation Machine Learning Methods on functional MRI Data Siemens AG Corporate.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.

Ch. 5 Bayesian Treatment of Neuroimaging Data Will Penny and Karl Friston Ch. 5 Bayesian Treatment of Neuroimaging Data Will Penny and Karl Friston 18.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.

6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

1 st level analysis: Design matrix, contrasts, and inference Stephane De Brito & Fiona McNabe.

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Multivariate analysis (Machine learning) Supervised True answer is known Classification Answer is categorical Regression Answer is continuous (ordered.

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Generally Discriminant Analysis

Bayesian Methods in Brain Imaging

Hierarchical Models and

Parametric Methods Berlin Chen, 2005 References:

Parietal and Frontal Cortex Encode Stimulus-Specific Mnemonic Representations during Visual Working Memory Edward F. Ester, Thomas C. Sprague, John T.

Multivariate Methods Berlin Chen

Will Penny Wellcome Trust Centre for Neuroimaging,

Presentation transcript:

Multivariate analyses and decoding Kay H. Brodersen Computational Neuroeconomics Group Institute of Empirical Research in Economics University of Zurich Machine Learning and Pattern Recognition Group Department of Computer Science ETH Zurich

1 Introduction

3 Why multivariate? Haxby et al Science

4 Why multivariate? Kriegeskorte et al NeuroImage  Multivariate approaches can reveal information jointly encoded by several voxels.

5  Multivariate approaches can exploit a sampling bias in voxelized images. Why multivariate? Boynton 2005 Nature Neuroscience

6  Mass-univariate approaches treat each voxel independently of all other voxels such that the implicit likelihood factorises over voxels:  Spatial dependencies between voxels are introduced after estimation, during inference, through random field theory. This allows us to make multivariate inferences over voxels (i.e., cluster-level or set-level inference).  Multivariate approaches, by contrast, relax the assumption about independence and enable inference about distributed responses without requiring focal activations or certain topological response features. They can therefore be more powerful than mass-univariate analyses.  The key challenge for all multivariate approaches is the high dimensionality of multivariate brain data. Mass-univariate vs. multivariate analyses

7 Models & terminology stimulus behaviour encoding of properties of stimulus and behaviour 1 Encoding or decoding? 2 Univoxel or multivoxel? 3 Classification or regression? context 0 Prediction or inference?  The goal of prediction is to maximize the accuracy with which brain states can be decoded from fMRI data.  The goal of inference is to decide between competing hypotheses about structure- function mappings in the brain. For example: compare a model that links distributed neuronal activity to a cognitive state with a model that does not.

8  Encoding or decoding?  An encoding model (or generative model) relates context (independent variable) to brain activity (dependent variable).  A decoding model (or recognition model) relates brain activity (independent variable) to context (dependent variable).  Univoxel or multivoxel?  In a univoxel model, brain activity is the signal measured in one voxel. (Special case: mass-univariate.)  In a multivoxel model, brain activity is the signal measured in many voxels.  Regression or classification?  In a regression model, the dependent variable is continuous.  In a classification model, the dependent variable is categorical (typically binary). Models & terminology e.g., or

2 Classification

10 Classification Training examples Test examples AABABAABAAABA ??? Accuracy estimate [% correct] Trials fMRI timeseries Feature extraction Feature selection Classification Voxels e.g., voxels

11 Most classification algorithms are based on a linear model that discriminates the two classes. Linear vs. nonlinear classifiers If the data are not linearly separable, a nonlinear classifier may still be able to tell different classes apart. here: discriminative point classifiers

12  We need to train and test our classifier on separate datasets. Why?  Using the same examples for training and testing means overfitting may remain unnoticed, implying an optimistic accuracy estimate.  Instead, what we are interested in is generalizability: the ability of our algorithm to correctly classify previously unseen examples.  An efficient splitting procedure is cross-validation. Training and testing Examples Training examples Test examples Folds

13 Target questions for decoding studies (c) Temporal pattern localization(d) Pattern characterization Accuracy [%] 50 % 100 % Intra-trial time Accuracy rises above chance Participant indicates decision (a) Pattern discrimination (overall classification) (b) Spatial pattern localization 80% 55% Accuracy [%] 50 % 100 % Classification task Truth or lie? Left or right button? Inferring a representational space and extrapolation to novel classes Healthy or diseased? Brodersen et al. (2009) The New Collection Mitchell et al Science

14 Performance evaluation – example  Given 100 trials, leave-10-out cross-validation, we measure performance by counting the number of correct predictions on each fold:  How probable is it to get 64 out of 100 correct if we had been guessing?  Thus, we have made a Binomial assumption about the Null model to show that our result is statistically significant at the 0.01 level.  Problem: accuracy is not a good performance measure. (a) Overall classification out of 10 test examples correct Brodersen et al. (2010) ICPR

15 The support vector machine  Intuitively, the support vector machine finds a hyperplane that maximizes the margin between the plane and the nearest examples on either side.  For nonlinear mappings, the kernel converts a low-dimensional nonlinear problem into a high-dimensional linear problem.

16 Deconvolved BOLD signal  result: one beta value per trial, phase, and voxel Temporal feature extraction time [volumes] trials and trial phases trial 1 decide phase trial 1 result phase trial 2 result phase... trial-by-trial design matrix plus confounds

17 (b) Spatial information mapping METHOD 1 Consider the entire brain, and find out which voxels are jointly discriminative  e.g., based on a classifier with a constraint on sparseness in features Hampton & O’Doherty (2007); Grosenick et al. (2008, 2009) METHOD 2 At each voxel, consider a small local environment, and compute a distance score  e.g., based on a CCA Nandy & Cordes (2003) Magn. Reson. Med.  e.g., based on a classifier  e.g., based on Euclidean distances  e.g., based on Mahalanobis distances Kriegeskorte et al. (2006, 2007a, 2007b) Serences & Boynton (2007) J Neuroscience  e.g., based on the mutual information

18 (b) Spatial information mapping outcome t ≥ 5 t = 3 decision t ≥ 5 t = 3 x = 12 mm Example 2 – decoding which option was chosen Brodersen et al. (2009) HBM Hampton & O‘Doherty (2007) PNAS Example 1 – decoding whether a subject will switch or stay

19 (c) Temporal information mapping Soon et al. (2008) Nature Neuroscience Example – decoding which button was pressed motor cortex frontopolar cortex classification accuracy decision response

20 (c) Pattern characterization Example – decoding which vowel a subject heard, and which speaker had uttered it Formisano et al. (2008) Science voxel 1 voxel 2... fingerprint plot (one plot per class)

21  Constraints on experimental design  When estimating trial-wise Beta values, we need longer ITIs (typically 8 – 15 s).  At the same time, we need many trials (typically 100+).  Classes should be balanced.  Computationally expensive  e.g., fold-wise feature selection  e.g., permutation testing  Classification accuracy is a surrogate statistic  Classification algorithms involve many heuristics Limitations

3 Multivariate Bayesian decoding

23  Multivariate analyses in SPM are not implemented in terms of the classification schemes outlined in the previous section.  Instead, SPM brings classification into the conventional inference framework of hierarchical models and their inversion.  MVB can be used to address two questions:  Overall classification – using a cross-validation scheme (as seen earlier)  Inference on different forms of structure-function mappings – e.g., smooth or sparse coding (new) Multivariate Bayesian decoding (MVB)

24 Model Encoding models X as a cause Decoding models X as a consequence

25  Decoding models are typically ill-posed: there is an infinite number of equally likely solutions. We therefore require constraints or priors to estimate the voxel weights.  SPM specifies several alternative coding hypotheses in terms of empirical spatial priors on voxel weights. Empirical priors on voxel weights Null: Spatial vectors: Smooth vectors: Singular vectors: Support vectors: Friston et al. (2008) NeuroImage

26  MVB can be illustrated using SPM’s attention-to-motion example dataset. Buechel & Friston 1999 Cerebral Cortex Friston et al NeuroImage MVB – example  This dataset is based on a simple block design. Each block is a combination of some of the following three factors:  photic– there is some visual stimulus  motion– there is motion  attention– subjects are paying attention  We form a design matrix by convolving box-car functions with a canonical haemodynamic response function. design matrix photic motion attention constant blocks of 10 scans

27 MVB – example

28  MVB-based predictions closely match the observed responses. But crucially, they don’t perfectly match them. Perfect match would indicate overfitting. MVB – example

29  The highest model evidence is achieved by a model that recruits 4 partitions. The weights attributed to each voxel in the sphere are sparse and multimodal. This suggests sparse coding. MVB – example log BF = 3

4 Further model-based approaches

31  Challenge 1 – feature selection and weighting to make the ill-posed many-to-one mapping tractable  Challenge 2 – neurobiological interpretability of models to improve the usefulness of insights that can be gained from multivariate analysis results Challenges for all decoding approaches

32  Approach 1 – identification (inferring a representational space) 1. estimation of an encoding model 2. nearest-neighbour classification or voting Further model-based approaches (1) Mitchell et al. (2008) Science

33  Approach 2 – reconstruction / optimal decoding 1. estimation of an encoding model 2. model inversion Further model-based approaches (2) Paninski et al. (2007) Progr Brain Res Pillow et al. (2008) Nature Miyawaki et al. (2009) Neuron

34  Approach 3 – decoding with model-based feature construction Further model-based approaches (3) Brodersen et al. (2010) NeuroImage Brodersen et al. (2010) HBM

35  Multivariate analyses can make use of information jointly encoded by several voxels and may therefore offer higher sensitivity than mass-univariate analyses.  There is some confusion about terminology in current publications. Remember the distinction between prediction vs. inference, encoding vs. decoding, univoxel vs. multivoxel, and classification vs. regression.  The main target questions in classification studies are (i) pattern discrimination, (ii) spatial information mapping, (iii) temporal information mapping, and (iv) pattern characterization.  Multivariate Bayes offers an alternative scheme that maps multivariate patterns of activity onto brain states within the conventional statistical framework.  The future is likely to see more model-based approaches. Summary

5 Supplementary slides

37  Classification is the most common type of multivariate fMRI analysis to date. By classification we mean: to decode a categorical label from multivoxel activity.  Lautrup et al. (1994) reported the first classification scheme for functional neuroimaging data.  Classification was then reintroduced by Haxby et al. (2001). In their study, the overall spatial pattern of activity was found to be more informative in distinguishing object categories than any brain region on its own. The most common multivariate analysis is classification Haxby et al Science

38 Temporal unit of classification  The temporal unit of classification specifies the amount of data that forms an individual example. Typical units are:  one trial  trial-by-trial classification  one block  block-by-block classification  one subject  across-subjects classification  Choosing a temporal unit of classification reveals a trade-off:  smaller units mean noisier examples but a larger training set  larger units mean cleaner examples but a smaller training set  The most common temporal unit of classification is an individual trial.

39 Temporal unit of classification Subject 1 time [s] n = 24 subjects, 180 trials each time [s] Normalized BOLD response from ventral striatum rewarded trials unrewarded trials Brodersen, Hunt, Walton, Rushworth, Behrens 2009 HBM

40 Interpolated raw BOLD signal  result: any desired number of sampling points per trial and voxel Alternative temporal feature extraction delayresultdecision microtime signal (a.u.) averaged signal across all trials subject 1 subject 2 subject 3

41 Alternative temporal feature extraction Deconvolved BOLD signal, expressed in terms of 3 basis functions  Step 1: sample many HRFs from given parameter intervals  Step 2: find set of 3 orthogonal basis functions that can be used to approximate the sampled functions  result: three values per trial, phase, and voxel Basis fn 1 Basis fn 2 Basis fn 3 Step 1 Step 2

42 Classification of methods for feature selection  A priori structural feature selection  A priori functional feature selection  Fold-wise univariate feature selection  Scoring  Choosing a number of features  Fold-wise multivariate feature selection  Filtering methods  Wrapper methods  Embedded methods  Fold-wise hybrid feature selection  Searchlight feature selection  Recursive feature elimination  Sparse logistic regression  Unsupervised feature-space compression

43 Training and testing a classifier  Training phase  The classifier is given a set of n labelled training samples from some data space, where is a d-dimensional attribute vector is its corresponding class.  The goal of the learning algorithm is to find a function that adequately describes the underlying attributes/class relation.  For example, a linear learning machine finds a function which assigns a given point x to the class such that some performance measure is maximized, for example:

44 Training and testing a classifier  Test phase  The classifier is now confronted with a test set of unlabelled examples and assigns each example x to an estimated class  We could then measure generalization performance in terms of the relative number of correctly classified test examples:

45  Nonlinear prediction problems can be turned into linear problems by using a nonlinear projection of the data onto a high-dimensional feature space.  This technique is used by a class of prediction algorithms called kernel machines.  The most popular kernel method is the support vector machine (SVM).  SVMs make training and testing computationally efficient.  We can easily reconstruct feature weights:  However, SVM predictions do not have a probabilistic interpretation. The support vector machine

46  Evaluating the performance of a classification algorithm critically requires a measure of the degree to which unseen examples have been identified with their correct class labels.  The procedure of averaging across accuracies obtained on individual cross-validation folds is flawed in two ways. First, it does not allow for the derivation of a meaningful confidence interval. Second,it leads to an optimistic estimate when a biased classifier is tested on an imbalanced dataset.  Both problems can be overcome by replacing the conventional point estimate of accuracy by an estimate of the posterior distribution of the balanced accuracy. Performance evaluation Brodersen, Ong, Buhmann, Stephan (2010) ICPR

47  The simulations show how a biased classifier applied to an imbalanced test set leads to a hugely optimistic estimate of generalizability when measured in terms of the accuracy rather than the balanced accuracy. Performance evaluation Brodersen, Ong, Buhmann, Stephan (2010) ICPR

48 Multivariate Bayes – maximization of the model evidence

49  MVB can be illustrated using SPM’s attention-to-motion example dataset. Buechel & Friston 1999 Cerebral Cortex Friston et al NeuroImage  This dataset is based on a simple block design. Each block belongs to one of the following conditions:  fixation– subjects see a fixation cross  static– subjects see stationary dots  no attention– subjects see moving dots  attention– subjects monitor moving dots for changes in velocity  We wish to decode whether or not subjects were exposed to motion. We begin by recombining the conditions into three orthogonal conditions:  photic– there is some form of visual stimulus  motion– there is motion  attention– subjects are required to pay attention Multivariate Bayesian decoding – example

50  Approach 1 – identification (inferring a representational space) Further model-based approaches Kay et al Science

51 On classification  Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: A tutorial overview. NeuroImage, 45(1, Supplement 1), S199-S209.  O'Toole, A. J., Jiang, F., Abdi, H., Penard, N., Dunlop, J. P., & Parent, M. A. (2007). Theoretical, Statistical, and Practical Perspectives on Pattern-based Classification Approaches to the Analysis of Functional Neuroimaging Data. Journal of Cognitive Neuroscience, 19(11),  Haynes, J., & Rees, G. (2006). Decoding mental states from brain activity in humans. Nature Reviews Neuroscience, 7(7),  Norman, K. A., Polyn, S. M., Detre, G. J., & Haxby, J. V. (2006). Beyond mind-reading: multivoxel pattern analysis of fMRI data. Trends in Cognitive Sciences, 10(9),  Brodersen, K. H., Haiss, F., Ong, C., Jung, F., Tittgemeyer, M., Buhmann, J., Weber, B., et al. (2010). Model-based feature construction for multivariate decoding. NeuroImage (in press). On multivariate Bayesian decoding  Friston, K., Chu, C., Mourao-Miranda, J., Hulme, O., Rees, G., Penny, W., et al. (2008). Bayesian decoding of brain images. NeuroImage, 39(1), Further reading