1 An introduction to machine learning for fMRI Francisco Pereira Botvinick Lab Princeton Neuroscience Institute Princeton University.

Slides:



Advertisements
Similar presentations
FMRI Methods Lecture 10 – Using natural stimuli. Reductionism Reducing complex things into simpler components Explaining the whole as a sum of its parts.
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
SPM 2002 C1C2C3 X =  C1 C2 Xb L C1 L C2  C1 C2 Xb L C1  L C2 Y Xb e Space of X C1 C2 Xb Space X C1 C2 C1  C3 P C1C2  Xb Xb Space of X C1 C2 C1 
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Model generalization Test error Bias, variance and complexity
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
A (very) brief introduction to multivoxel analysis “stuff” Jo Etzel, Social Brain Lab
Model Assessment, Selection and Averaging
Variance and covariance M contains the mean Sums of squares General additive models.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Circular analysis in systems neuroscience – with particular attention to cross-subject correlation mapping Nikolaus Kriegeskorte Laboratory of Brain and.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
Multi-voxel Pattern Analysis (MVPA) and “Mind Reading” By: James Melrose.
FMRI: Biological Basis and Experiment Design Lecture 26: Significance Review of GLM results Baseline trends Block designs; Fourier analysis (correlation)
Today Concepts underlying inferential statistics
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
General Linear Model & Classical Inference
General Linear Model & Classical Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM M/EEGCourse London, May.
Today Evaluation Measures Accuracy Significance Testing
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Hypothesis Testing in Linear Regression Analysis
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
How To Do Multivariate Pattern Analysis
Basics of fMRI Inference Douglas N. Greve. Overview Inference False Positives and False Negatives Problem of Multiple Comparisons Bonferroni Correction.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
General Linear Model & Classical Inference London, SPM-M/EEG course May 2014 C. Phillips, Cyclotron Research Centre, ULg, Belgium
Analysis of fMRI data with linear models Typical fMRI processing steps Image reconstruction Slice time correction Motion correction Temporal filtering.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
FMRI Methods Lecture7 – Review: analyses & statistics.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CpSc 810: Machine Learning Evaluation of Classifier.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
So you want to run an MVPA experiment… Lindsay Morgan April 9, 2012.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Pattern Classification of Attentional Control States S. G. Robison, D. N. Osherson, K. A. Norman, & J. D. Cohen Dept. of Psychology, Princeton University,
MVPD – Multivariate pattern decoding Christian Kaul MATLAB for Cognitive Neuroscience.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Classification Ensemble Methods 1
Statistical Analysis An Introduction to MRI Physics and Analysis Michael Jay Schillaci, PhD Monday, April 7 th, 2007.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
FMRI and Behavioral Studies of Human Face Perception Ronnie Bryan Vision Lab
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
Beginners statistics Assoc Prof Terry Haines. 5 simple steps 1.Understand the type of measurement you are dealing with 2.Understand the type of question.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Multivariate Pattern Analysis of fMRI data. Goal of this lecture Introduction of basic concepts & a few commonly used approaches to multivariate pattern.
ECE 5424: Introduction to Machine Learning
Contrasts & Statistical Inference
Inferential Statistics
Adaptive multi-voxel representation of stimuli, rules and responses
Multivariate Methods Berlin Chen
Contrasts & Statistical Inference
Multivariate Methods Berlin Chen, 2005 References:
Contrasts & Statistical Inference
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

1 An introduction to machine learning for fMRI Francisco Pereira Botvinick Lab Princeton Neuroscience Institute Princeton University

2 what is machine learning?  how to build computer systems that automatically improve with experience

3 what is machine learning?  how to build computer systems that automatically improve with experience  building models of data for  predicting numeric variables (regression)  predicting categoric variables (classification)  grouping data points (clustering) ...

4 what is machine learning?  how to build computer systems that automatically improve with experience  building models of data for  predicting numeric variables (regression)  predicting categoric variables (classification)  grouping data points (clustering) ...  overlaps with applied statistics

5 why use it at all? to tell a story about data [adapted from slides by Russ Poldrack]

6 once upon a time there was a sample... [adapted from slides by Russ Poldrack]

7... and then came a beautiful model... fit model by estimating parameters [adapted from slides by Russ Poldrack]

8 Is RT really related to age? very suggestive, but... [adapted from slides by Russ Poldrack]

9 is RT related to age? Model: RT =  0 +  1 *age +  parameters in the population [adapted from slides by Russ Poldrack]

10 is RT related to age? Model: RT =  0 +  1 *age +   est =(X’X) -1 X’y parameters in the population parameters estimated from the sample [adapted from slides by Russ Poldrack]

11 is RT related to age? Model: RT =  0 +  1 *age +   est =(X’X) -1 X’y Hypothesis: Is  1 different from 0? parameters in the population parameters estimated from the sample [adapted from slides by Russ Poldrack]

12 Null hypothesis:  1 = 0 Alternative:  1 ~= 0 is RT related to age? [adapted from slides by Russ Poldrack]

13 ^ Null hypothesis:  1 = 0 Alternative:  1 ~= 0 How likely is the parameter estimate (  1 = -0.53) if the null hypothesis is true? is RT related to age? [adapted from slides by Russ Poldrack]

14 ^ Null hypothesis:  1 = 0 Alternative:  1 ~= 0 How likely is the parameter estimate (  1 = -0.53) if the null hypothesis is true? We need a statistic with a known distribution to determine this! is RT related to age? [adapted from slides by Russ Poldrack]

15 but we don’t know is RT related to age? the CLT tells us [adapted from slides by Russ Poldrack]

16 but we don’t know we do know is RT related to age? the CLT tells us [adapted from slides by Russ Poldrack]

17 t(10) = How likely is this value in this t distribution ? p <.001 is RT related to age? [adapted from slides by Russ Poldrack]

18 what can we conclude?  in this sample  p < there is a relationship between age and RT  R2 - age accounts for 69% of variance in RT  very unlikely if no relationship in the population  the test does not tell us how well we can predict RT from age in the population [adapted from slides by Russ Poldrack]

19 what happens with a new sample? draw a new sample from the same population compute the R2 using parameters estimated in the original sample [adapted from slides by Russ Poldrack]

20 what happens with a new sample? repeat this 100 times... using model parameters estimated from the original sample average R2 = a measure of how good the model learned from a single sample is [adapted from slides by Russ Poldrack]

21 the learning perspective When we estimate parameters from a sample, we are learning about the population from training data. [adapted from slides by Russ Poldrack]

22 the learning perspective When we estimate parameters from a sample, we are learning about the population from training data. How well can we measure the prediction ability of our learned model? Use a new sample as test data. [adapted from slides by Russ Poldrack]

23 test data and cross-validation entire dataset estimate parameters from training data test accuracy on untrained data If you can’t collect more split your sample in two... [adapted from slides by Russ Poldrack]

24 test data and cross-validation entire dataset estimate parameters from training data test accuracy on untrained data k-fold cross-validation:  split into k folds  train on k-1, test on the left out  average prediction measure on all k folds  several variants: all possible splits, leave-one-out If you can’t collect more split your sample in two... [adapted from slides by Russ Poldrack]

25 leave-one-out cross-validation regression lines on each training set original sample R 2 = leave-one-out on original R 2 = mean over 100 new samples R 2 = [adapted from slides by Russ Poldrack]

26 model complexity As model complexity goes up, we can always fit the training data better What does this do to our predictive ability? [adapted from slides by Russ Poldrack]

27 model complexity polynomials of higher degree fit the training data better... [adapted from slides by Russ Poldrack]

28 model complexity polynomials of higher degree fit the training data better but they do worse on test data: overfitting [adapted from slides by Russ Poldrack]

29 model complexity if the relationship in the population were more complicated [adapted from slides by Russ Poldrack]

30 (this would need to be done with nested CV, CV inside the training set) model complexity if the relationship in the population were more complicated we could use CV to determine adequate model complexity

31 what is machine learning, redux  generalization: make predictions about a new individual  a model that generalizes captures the relationship between the individual and what we want to predict  cross-validation is a good way of  measuring generalization  doing model selection (there are others)

32 what is machine learning, redux  generalization: make predictions about a new individual  a model that generalizes captures the relationship between the individual and what we want to predict  cross-validation is a good way of  measuring generalization  doing model selection (there are others)  “all models are wrong but some are useful” George Box

33 what does this have to do with fMRI? In this talk:  prediction is classification  generalization is within subject (population of trials)  how to draw conclusions with statistical significance  what has it been used for?

34 two questions stimulusfMRI activation (single voxel) GLM : are there voxels that reflect the stimulus? contrast of interest GLM

35 two questions stimulusfMRI activation (single voxel) GLM : are there voxels that reflect the stimulus? stimulusfMRI activation (multiple voxels) Classifier : do voxels contain information to predict? contrast of interest GLM what is the subject seeing?

36 case study: two categories  subjects read concrete nouns in 2 categories  words name either tools or building types  trial: see a word think about properties, use, visualize blank 3 seconds 8 seconds [ data from Rob Mason and Marcel Just, CCBI, CMU ]

37 case study: two categories  subjects read concrete nouns in 2 categories  words name either tools or building types  trial: see a word think about properties, use, visualize blank  goal: can the two categories be distinguished?  average images around trial peak to get one labelled image 3 seconds 8 seconds tools [ data from Rob Mason and Marcel Just, CCBI, CMU ]

38 the name(s) of the game voxels (features) tools class label example … average trial image

39 the name(s) of the game voxels (features) tools class label example example group … …... tools labels = average trial image 14 examples

40 the name(s) of the game voxels (features) tools class label example example group group 1 dataset … …... tools labels = … … … … … … group examples average trial image 14 examples

41 classifying two categories training data (42) test data (42) labels … … … … … … group 1 group 3 group 5 group 2 group 4 group 6

42 classifying two categories training data (42) test data (42) labels classifier + … … … … … … group 1 group 3 group 5 group 2 group 4 group 6

43 classifying two categories training data (42) test data (42) labels classifier += predicted labels true labels vs … … … … … … group 1 group 3 group 5 group 2 group 4 group 6

44 classifying two categories training data (42) test data (42) labels classifier += predicted labels true labels vs … … … … … … accuracy estimate = #correct/42 = 0.65 group 1 group 3 group 5 group 2 group 4 group 6

45 a classifier  is a function from data to labels  parameters learned from training data

46 a classifier  is a function from data to labels  parameters learned from training data  want to estimate its true accuracy  “probability of labelling a new example correctly”  estimation is done on the test data  it is finite, hence an estimate with uncertainty

47 a classifier  is a function from data to labels  parameters learned from training data  want to estimate its true accuracy  “probability of labelling a new example correctly”  estimation is done on the test data  it is finite, hence an estimate with uncertainty  null hypothesis: “the classifier learned nothing”

48 what questions can be tackled?  is there information? (pattern discrimination)  where/when is information present? (pattern localization)  how is information encoded? (pattern characterization)

49 “is there information?”  what is inside the black box?  how to test results?  from a study to examples

50 what is inside the box?  simplest function is no function at all  “nearest neighbour” tools1 tools2 tools3 buildings1 buildings2 buildings3 voxel 2 voxel 1

51 what is inside the box?  simplest function is no function at all  “nearest neighbour” tools1 tools2 tools3 buildings1 buildings2 buildings3 voxel 2 voxel 1 ?

52 what is inside the box?  simplest function is no function at all  “nearest neighbour” tools1 tools2 tools3 buildings1 buildings2 buildings3 voxel 2 voxel 1 ding!

53 what is inside the box?  simplest function is no function at all  “nearest neighbour” tools1 tools2 tools3 buildings1 buildings2 buildings3 voxel 2 voxel 1 … … requires example similarity measure euclidean, correlation,...

54 what is inside the box?  next simplest: learn linear discriminant tools1 tools2 tools3 buildings1 buildings2 buildings3 voxel 2 voxel 1 linear discriminant A

55 what is inside the box?  next simplest: learn linear discriminant  note that there are many solutions... tools1 tools2 tools3 buildings1 buildings2 buildings3 voxel 2 voxel 1 linear discriminant A linear discriminant B

56 what is inside the box?  next simplest: learn linear discriminant  note that there are many solutions... tools1 tools2 tools3 buildings1 buildings2 buildings3 voxel 2 voxel 1 linear discriminant A linear discriminant B ?

57 what is inside the box?  next simplest: learn linear discriminant  note that there are many solutions... tools1 tools2 tools3 buildings1 buildings2 buildings3 voxel 2 voxel 1 linear discriminant A linear discriminant B

58 linear classifiers If otherwise tools buildings...voxel 2voxel 1 weight1 x voxel n weight2 x weight n x ++++ weight0 + > 0

59 linear classifiers various kinds Gaussian Naive Bayes Logistic Regression Linear SVM differ on how weights are chosen If otherwise tools buildings...voxel 2voxel 1 weight1 x voxel n weight2 x weight n x ++++ weight0 + > 0

60 linear classifiers If otherwise tools buildings...voxel 2voxel 1voxel n > 0 linear SVM weights: weight1 x weight2 x weight n x weight0

61 linear classifiers If otherwise tools buildings...voxel 2voxel 1voxel n > 0 linear SVM weights: weights pull towards tools weights pull towards buildings weight1 x weight2 x weight n x weight0

62 nonlinear classifiers  linear on a transformed feature space

63 nonlinear classifiers  linear on a transformed feature space  neural networks: new features are learnt voxel 1voxel 2 tools vs buildings

64 nonlinear classifiers  linear on a transformed feature space!  neural networks: new features are learnt  SVMs new features are (implicitly) determined by a kernel voxel 1voxel 2 quadratic SVM voxel 1voxel 2 voxel 1voxel 2 voxel 1 x voxel 2 tools vs buildings

65 nonlinear classifiers reasons to be careful:  too few examples, too many features  harder to interpret

66 nonlinear classifiers reasons to be careful:  too few examples, too many features  harder to interpret  overfitting [from Hastie et al,2001] best tradeoff increasing model complexity prediction error test set training set

67 how do we test predictions? training data (42) test data (42) labels classifier += predicted labels true labels vs … … … … … … group 1 group 3 group 5 group 2 group 4 group 6

68 how do we test predictions? tools buildings tools buildings tools... Predicted labels

69 how do we test predictions?... tools buildings tools... True labels tools buildings tools buildings tools... Predicted labels error #correct out of #test

70 how do we test predictions?  null hypothesis: “classifier learnt nothing” “predicts randomly”... tools buildings tools... True labels tools buildings tools buildings tools... Predicted labels error #correct out of #test

71 how do we test predictions?  null hypothesis: “classifier learnt nothing” “predicts randomly”  intuition:  a result is significant if very unlikely under null... tools buildings tools... True labels tools buildings tools buildings tools... Predicted labels error #correct out of #test

72 how do we test predictions?  X = #correct  P(X|null) is binomial(#test,0.5)  p-value is P(X>=result to test|null) distribution under null (0.05 p-value cut-off)

73 how do we test predictions?  X = #correct  P(X|null) is binomial(#test,0.5)  p-value is P(X>=result to test|null)  lots of caveats:  accuracy is an estimate  few examplesvery uncertain  can get a confidence interval  must correct for multiple comparisons distribution under null (0.05 p-value cut-off)

74 what questions can be tackled?  is there information? (pattern discrimination)  where/when is information present? (pattern localization)  how is information encoded? (pattern characterization)

75 case study: orientation subjects see gratings in one of 8 orientations [Kamitani&Tong, 2005]

76 case study: orientation subjects see gratings in one of 8 orientations orientations voxel responses voxels in visual cortex respond similarly to different orientations [Kamitani&Tong, 2005]

77 case study: orientation subjects see gratings in one of 8 orientations orientations voxel responses voxels in visual cortex respond similarly to different orientations [Kamitani&Tong, 2005] yet, voxels can be combined to predict the orientation of the grating being seen! … linear SVM

78 features  case study #1, features are voxels  case study #2, features are voxels in visual cortex

79 features  case study #1, features are voxels  case study #2, features are voxels in visual cortex  what else could they be? voxels at particular times in a trial, syntactic ambiguity study time (seconds) voxel activation difference ambiguous sentence unambiguous sentence

80 features  you can also synthesize features  Singular Value Decomposition (SVD)  Independent Component Analysis (ICA) dataset voxels examples Z basis images new features = voxels

81 features  you can also synthesize features  Singular Value Decomposition (SVD)  Independent Component Analysis (ICA)  reduces to #features < #examples  a feature has a spatial extent: basis image  learn on the training set, convert the test set dataset voxels examples Z basis images new features = voxels

82 example construction  an example  can be created from one or more brain images  needs to be amenable to labelling

83 example construction  an example  can be created from one or more brain images  needs to be amenable to labelling  some possibilities  the brain image from a single TR  the average image in a trial or a block  the image of beta coefficients from deconvolution

84 example construction  an example  can be created from one or more brain images  needs to be amenable to labelling  some possibilities  the brain image from a single TR  the average image in a trial or a block  the image of beta coefficients from deconvolution  caveats  remember the haemodynamic response time-to-peak  images for two examples not separate “enough”  in test set, lowers the effective #examples in statistical test  in between train and test set, “peeking”/”circularity”  read [Kriegeskorte et al. 2009] (“double dipping”)

85 localization  key idea #1 test conclusions pertain to whatever is fed to the classifier

86 localization  key idea #1 test conclusions pertain to whatever is fed to the classifier  key idea #2 one can predict anything that can be labelled: stimuli, subject percepts, behaviour, response,...

87 localization  key idea #1 test conclusions pertain to whatever is fed to the classifier  key idea #2 one can predict anything that can be labelled: stimuli, subject percepts, behaviour, response,... so what criteria can we use?  location  time  voxel behaviour or relationship to label  aka “feature selection”

88 feature (voxel) selection  what does it look like training data (42) test data (42) labels classifier += predicted labels true labels vs … … … … … … group 1 group 3 group 5 group 2 group 4 group 6

89 feature (voxel) selection  what does it look like training data (42) test data (42) labels classifier += predicted labels true labels vs … … … … … … group 1 group 3 group 5 group 2 group 4 group 6

90 feature (voxel) selection  what does it look like training data (42) test data (42) labels classifier += predicted labels true labels vs … … … … … … group 1 group 3 group 5 group 2 group 4 group 6

91 feature (voxel) selection  what does it look like  great for improving prediction accuracy  but  voxels often come from all over the place  very little overlap in selected across training sets training data (42) test data (42) labels classifier += predicted labels true labels vs … … … … … … group 1 group 3 group 5 group 2 group 4 group 6

92 feature (voxel) selection  look at the training data and labels

93 feature (voxel) selection  look at the training data and labels  a few criteria:  difference from baseline baselineAB

94 feature (voxel) selection  look at the training data and labels  a few criteria:  difference from baseline  difference between classes ( e.g. ANOVA ) baselineABABC

95 feature (voxel) selection  look at the training data and labels  a few criteria:  difference from baseline  difference between classes ( e.g. ANOVA )  preferential response to one class baselineAB ABC ABC

96 feature (voxel) selection  look at the training data and labels  a few criteria:  difference from baseline  difference between classes ( e.g. ANOVA )  preferential response to one class  stability ... baselineAB ABC ABC ABC (in run 1)ABC (in run 2)

97 case study: category reinstatement  items from 3 categories: faces, locations, objects  classifier trained during study of the grouped items  detect category reinstatement during free recall (test) faceslocationsobjects [Polyn et al 2005]

98 temporal localization voxel influence on reinstatement estimate faceslocationsobjects [Polyn et al 2005]

99 temporal localization key ideas:  trained classifiers may be used as detectors to localize the time at which information is present  test examples can be temporally overlapping  it is feasible to decode endogenous events

100 what questions can be tackled?  is there information? (pattern discrimination)  where/when is information present? (pattern localization)  how is information encoded? (pattern characterization)

101 classifier dissection  a classifier learns to relate features to labels  we can consider not just what gets selected but also how the classifier uses it  in linear classifiers, look at voxel weights

102 classifier dissection  weights depend on classifier assumptions  less of an issue if feature selection is used  weights are similar, but accuracy difference 15%

103 classifier dissection  assumption effects on synthetic data SVM GNB “voxels” see [Pereira, Botvinick 2010]

104 case study: 8 categories subjects see photographs of objects in 8 categories  faces, houses, cats, bottles, scissors, shoes, chairs, scrambled  block: series of photographs of the same category [ Haxby et al., 2001 ]

105 case study: 8 categories nearest neighbour classifier  all category pair distinctions  selects voxels by location  fusiform gyrus  rest of temporal cortex  selects voxels by behaviour  responsive to single category  responsive to multiple  logic:  restrict by location/behaviour  see if there is still information [ Haxby et al., 2001 ]

106 classifier dissection 1) whole-brain logistic regression weights faces

107 classifier dissection 1) whole-brain logistic regression weights faces houses

108 classifier dissection 1) whole-brain logistic regression weights faces houses chairs

109 classifier dissection 2) feature selection faces

110 classifier dissection 2) feature selection faces top 1000 voxels

111 classifier dissection whole brain classifier  accuracy 40% in this case  many more features than examples => simple classifier  messy maps (can bootstrap to threshold)

112 classifier dissection whole brain classifier  accuracy 40% in this case  many more features than examples => simple classifier  messy maps (can bootstrap to threshold) voxel selection  accuracy 80% in this case  sparse, non-reproducible maps  different methods pick different voxels

113 classifier dissection whole brain classifier  accuracy 40% in this case  many more features than examples => simple classifier  messy maps (can bootstrap to threshold) voxel selection  accuracy 80% in this case  sparse, non-reproducible maps  different methods pick different voxels a lot of work [Mitchell et al 2004],[Norman et al 2006], [Haynes et al 2006], [Pereira 2007], [De Martino et al 2008], [Carrol et al 2009], [Pereira et al 2009]

114 classifier dissection neural network  one-of-8 classifier  temporal cortex learned model  hidden units  activation patterns across units reflect category similarity [ Hanson et al., 2004 ] face cat house bottle scissorshoe chair distance

115 classifier dissection neural network  one-of-8 classifier  temporal cortex learned model  hidden units  activation patterns across units reflect category similarity  sensitivity analysis  add noise to voxels  which ones lead to classification error? [ Hanson et al., 2004 ] face cat house bottle scissorshoe chair distance

116 classifier dissection conclusions  if linear works, you can look at weights  but know what your classifier does (or try several)  read about bootstrapping (and [Strother et al. 2002] )

117 classifier dissection conclusions  if linear works, you can look at weights  but know what your classifier does (or try several)  read about bootstrapping (and [Strother et al. 2002] )  voxel selection  may be necessary in multiclass situations  try multiple methods and look at the voxels they pick  voxels picked may be a small subset of informative ones  report all #s of voxels selected or use cross-validation on training set to pick a # to use

118 classifier dissection conclusions  if linear works, you can look at weights  but know what your classifier does (or try several)  read about bootstrapping (and [Strother et al. 2002] )  voxel selection  may be necessary in multiclass situations  try multiple methods and look at the voxels they pick  voxels picked may be a small subset of informative ones  report all #s of voxels selected or use cross-validation on training set to pick a # to use  nonlinear classifiers  worth trying, but try linear + voxel selection first  look at [Hanson et al. 2004] and [Rasmussen et al. 2011] for ways of gauging voxel influence on classifier

119 information-based mapping (searchlights)  focus the classifier on small voxel neighbourhoods  more examples than features  can learn voxel relationships (e.g. covariance matrix)  can train nonlinear classifiers [Kriegeskorte 2006] classifier = predicted labels voxel accuracy classifier = predicted labels voxel accuracy

120 information mapping  on 8 categories, yields an accuracy map  also local information: covariance, voxel weights

121 information mapping  on 8 categories, yields an accuracy map  also local information: covariance, voxel weights  can be thresholded for statistical significance

122 information mapping  “H0: chance level” deems many voxels significant  what does accuracy mean in multi-class settings?  confusion matrix  for each class, what do examples belonging to it get labelled as? 8 classes accuracy faces houses cats shoes bottles chairs scissors scrambled

123 information mapping  contrast each pair of classes directly  threshold accuracy to a binary image face v house face v cats... scrambled v chairs count# pairs signifcant voxels

124 information mapping  each voxel has a binary profile across pairs  how many different ones? face v house face v cats... scrambled v chairs count# pairs signifcant voxels

125 information mapping  a binary profile is a kind of confusion matrix  only a few hundred profiles, many similar  cluster them! 1 face 2 house 3 cat 4 bottle 5 scissors 6 chairs 7 shoes 8 scrambled faces and cat versus all elsehouses versus all else [Pereira&Botvinick, KDD 2011]

126 information mapping  a map of accuracy works well in 2 class situation  some classifiers seem consistently better see [Pereira&Botvinick 2010] for details  easy to get above chance with multiway prediction so reporting accuracy or #significant is not enough  consider reporting common profiles or grouping classes into distinctions to test

127 what questions can be tackled?  is there information? (pattern discrimination)  where/when is information present? (pattern localization)  how is information encoded? (pattern characterization)

128 to get started  MVPA toolbox (in MATLAB)  PyMVPA (in Python)  support the whole workflow  cross-validation, voxel selection, multiple classifiers,...  helpful mailing lists (most people are on both)

129 to get started  MVPA toolbox (in MATLAB)  PyMVPA (in Python)  support the whole workflow  cross-validation, voxel selection, multiple classifiers,...  helpful mailing lists (most people are on both)  Searchmight (in MATLAB, shameless plug)   special purpose toolbox for information mapping  can be used with MVPA toolbox

130 Thank you! questions?