Machine Learning for Analyzing Brain Activity Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 2006 Collaborators: Rebecca Hutchinson, Marcel Just, Mark Palatucci, Francisco Pereira, Rob Mason, Indra Rustandi, Svetlana Shinkareva, Wei Wang
improving performance at some task through experience Learning =
Learning to Predict Emergency C-Sections 9714 patient records, each with 215 features [Sims et al., 2000]
Learning to detect objects in images Example training images for each orientation (Prof. H. Schneiderman)
Learning to classify text documents Company home page vs Personal home page vs University home page vs …
Reinforcement Learning [Sutton and Barto 1981; Samuel 1957]
Machine Learning - Practice Object recognition Mining Databases Speech Recognition Control learning Reinforcement learning Supervised learning Bayesian networks Hidden Markov models Unsupervised clustering Explanation-based learning.... Text analysis
Machine Learning - Theory PAC Learning Theory # examples (m) representational complexity (H) error rate ( ) failure probability ( ) Similar theories for Reinforcement skill learning Unsupervised learning Active student querying … … also relating: # of mistakes during learning learner’s query strategy convergence rate asymptotic performance … (for supervised concept learning)
Functional MRI
Brain scans can track activation with precision and sensitivity [from Walt Schneider]
Human Brain Imaging fMRI Location with millimeter precision 1 mm 3 = % of cortex ERP Time course with millisecond precision 10 ms = 10 % of human production cycle DTI Connections tracing millimeter precision 1 mm connection ~10k fibers, or % of neurons [from Walt Schneider]
Can we train classifiers of mental state?
Can we train program to classify what you’re thinking about? “reading a word about tools” or “reading a word about buildings” ? Observed fMRI: … time … Train classifiers of form: fMRI(t, t+1,... t+d) CognitiveProcess e.g., fMRI(t, t+1,... t+4) = {tools, buildings, fish, vegetables,...}
Reading a noun (15 sec) [Rustandi et al., 2005]
Representing Meaning in the Brain Study brain activation associated with different semantic categories of words and pictures Categories: vegetables, tools, trees, fish, dwellings, building parts Some experiments use block stimulus design Present sequence of 20 words from same category, classify the block of words Some experiments use single stimuli Present single words/pictures for 3 sec, classify brain activity for single word/picture
Classifying the Semantic Category of Word Blocks Learn fMRI(t,...t+32) word-category(t,...t+32) –fMRI(t1...t2) = 10 4 voxels, mean activation of each during interval [t1 t2] Training methods: –train single-subject classifiers –Gaussian Naïve Bayes P(fMRI | word-category) –Nearest nbr with spatial-correlation as distance –SVM, Logistic regression,... Feature selection: Select n voxels –Best accuracy: reduce 10 4 voxels to 10 2
Mean Activation per Voxel for Word Categories Tools Dwellings one horizontal slice, from one subject, ventral temporal cortex [Pereira, et al 2004] Presentation 1Presentation 2 Classification accuracy 1.0 (tools vs dwellings) on each of 7 human subjects (trained on indiv. human subjects)
Gaussian Naïve Bayes (GNB) classifier* for C. Assume the f j are conditionally independent given C. Training: 1.For each class value, c i, estimate 2.For each feature F j estimate Classify new instance Use Bayes rule: F2F2 F1F1 C FnFn … *assumes feature values are conditionally independent given the class Normal distribution
Results predicting word block semantic category Mean pairwise prediction accuracy averaged over 8 subjects: Random guess: 0.5 expected accuracy Ventral temporal cortex classifiers averaged over 8 subjects: Best pair: Dwellings vs. Tools (1.00 accuracy) Worst pair: Tools vs. Fish (.40 accuracy) Average over all pairs:.75 Averaged over all subjects, all pairs: Full brain:.75 (individual subjects:.57 to.83) Ventral temporal:.75 (individuals:.57 to.88) Parietal:.70 (individuals:.62 to.77) Frontal:.67 (individuals:.48 to.78)
Question: Are there consistently distinguishable and consistently confusable categories across subjects?
Six-Category Study: Pairwise Classification Errors (ventral temporal cortex) FishVegetablesToolsDwellingsTreesBldg Parts Subj * * Sub2.10 *.55 * *.30 Sub *.15 *.20 Sub * * Sub5.60 * * Sub *.30 *.05 Sub * * Mean * Worst * Best
Question: Can we classify single, 3-second word presentation?
Accuracy of up to 80% for classifying whether word is about a “tool” or a “dwelling” Rank accuracy of up to 68% for classifying which of 14 individual words (6 presentations of each word) Category classification accuracy is above chance for all subjects. Individual word classification accuracy is not consistent across subjects Classifying individual word presentations
Question: Where in the brain is the activity that discriminates word category?
Learned Logistic Regression Weights: Tools (red) vs Buildings (blue)
Accuracy of searchlights: Bayes classifier Accuracy at each voxel with a radius 1 searchlight
Regions that encode ‘tools’ vs. ‘dwellings’ Accuracy at each significant searchlight [ ] “tools” vs. “dwellings “searchlight” classifier at each voxel uses only the voxel and its immediate neighbors
The distinguishing voxels occur in sensible areas: dwellings activate parahippocampal place area tools activate motor and premotor areas
What is the relation between the neural representation of a word in two different languages in the brain of a bilingual? Tested 10 Portuguese-English bilinguals in English and in Portuguese, using the same words
Identifying categories (tool or dwelling) within languages for individual subjects using naïve bayes classifier Subj. # EnglishPortuguese Eng->PortPort->Eng 01701B B B B B B B B B B
Identifying categories across languages for individual subjects (rank accuracy) Subj. # EnglishPortugueseEng->PortPort->Eng 01701B B B B B B B B B B Across Languages
What is the relation between the neural representation of an object when it is referred to by a word versus when it is depicted by a line drawing?
Schematic representation of experimental design for (A) pictures and (B) words experiments
It is easier to identify the semantic category of a picture a subject is viewing than a word he/she is reading Pictures accuracy Words accuracy
Can a classifier be trained in one modality, and then accurately identify activation patterns in the other modality?
Cross-Modal identification accuracy is high, in both directions Word to picture Picture to word
Can a classifier be trained on a group of human subjects, then be successfully applied to a new person?
Picture category accuracy within and between subjects The classifiers work well across subjects; for “bad” subjects, the identification is even better across than within subjects Within subject accuracy Between subject accuracy
Locations of diagnostic voxels across subjects Tool voxels are shown in blue, dwellings voxels are shown in red L IPL indicated with a yellow circle; activates during imagined or actual grasping (Crafton et al., 1996) Subj 1Subj 5Subj 4Subj 3Subj 2
Voxel Locations are Similar for Pictures and Words Pictures Tools: L IPL L postcentral L middle temporal Cuneus Dwellings (positive weights): L/R Parahippocampal gyrus Cuneus Words Tools: L IPL L postcentral L precentral L middle temporal Dwellings (positive weights): L/R Parahippocampal gyrus Interpretation: L IPL – imaginary grasping (of tools, here) (Crafton et al., 1996) Parahippocampal gyrus – formation and retrieval of topographical memory; plays a role in perception of landmarks or scenes
Lessons Learned Yes, one can train machine learning classifiers to distinguish a variety of cognitive states/processes –Picture vs. Sentence –Ambiguous sentence vs. unambiguous –Nouns about “tools” vs. nouns about “dwellings” Train on Portuguese words, test on English Train on words, test on pictures Train on some human subjects, test on others Failures too: –True vs. false sentences –Negative sentence (containing “not”) vs. affirmative ML methods: –Logistic regression, NNbr, Naïve Bayes, SVMs, LogReg, … –Feature selection matters: searchlights, contrast to fixation,... –Case study in high dimensional, noisy classification [MLJ 2004]
[Science, 2001] [Machine Learning Journal, 2004] [Nature Neuroscience, 2006]
2. How can we model overlapping mental processes?
Decide whether consistent Can we learn to classify/track multiple overlapping processes (with unknown timing)? Observed fMRI: Observed button press: Read sentence View picture Input stimuli: ?
Process: ReadSentence Duration d: 11 sec. P(Process = ReadSent) P(Offset times): Response signature W: Configuration C of Process Instances h 1, 2, … i Observed data Y: Input Stimulus : 11 44 Timing landmarks : ¢ 2 ¢ 1 ¢ 3 22 Process instance: 4 Process h: ReadSentence Timing landmark : 3 Offset time O: 1 sec Start time ´ + O sentence picture sentence 33 Hidden Process Models Red to be learned
The HPM Graphical Model Probabilistically generate data Y t,v using a configuration of N process instances 1,... n Offset( 1 ) contribution to Y t,v observed data Y t,v Stimulus( 1 ) ProcessType( 1 ) Voxel v, time t StartTime( 1 ) Offset( 2 ) contribution to Y t,v Stimulus( 2 ) StartTime( 2 ) ProcessType( 2 ) observed unobserved
Learning HPMs Known process IDs,start times: –Least squares regression, eg. Dale[HBM,1999] –Ordinary least sq if assume noise indep over time –Generalized least sq if assume autocorrelated noise Unknown start times: EM algorithm (Iteratively reweighted least squares) –Repeat: E: estimate distribution over latent variables M: choose parameters to maximize expected log full data likelihood Y = X h + ε
HPM: Synthetic Noise-Free Data Example Process 1:Process 2:Process 3: Process responses: Process instances: observed data ProcessID=1, S=1 ProcessID=2, S=17 ProcessID=3, S=21 Time
Figure 1. The learner was given 80 training examples with known start times for only the first two processes. It chooses the correct start time (26) for the third process, in addition to learning the HDRs for all three processes. true signal Observed noisy signal true response W learned W Process 1Process 2Process 3
Inference with HPMs Given an HPM and data set –Assign the Interpretation (process IDs and timings) that maximizes data likelihood Classification = assigning the maximum likelihood process IDs y = X h + ε
2-process HPM for Picture-Sentence Study Read sentence View picture Cognitive processes: Observed fMRI: cortical region 1: cortical region 2:
ViewPicture in Visual Cortex Offset = P(Offset)
ReadSentence in Visual Cortex Offset = P(Offset)
View Picture Or Read Sentence Or View Picture Fixation Press Button 4 sec.8 sec.t=0 Rest picture or sentence? 16 sec. GNB: picture or sentence? HPM: HPMs improve classifiaction accuracy over G Naïve Bayes by 15% on average.
trial 25 Models learned (with known onset times) Comprehend sentence Comprehend picture
How can we use HPMs to resolve between competing cognitive models?
Is the subject using two or three cognitive processes? Train 2-process HPM 2 on training data Train 3-process HPM 3 on training data Test HPM 2 and HPM 3 on separate test data –Which predicts known process identities better? –Which has higher probability given the test data? –(use n-fold cross-validation for test)
Decide whether consistent 3-process HPM model for Picture-Sentence Study Read sentence View picture Cognitive processes: ? Observed fMRI: cortical region 1: cortical region 2: Observed button press:
Decide whether consistent 3-process HPM model for Picture-Sentence Study Observed fMRI: Observed button press: Read sentence View picture Input stimuli: ?
Learned HPM with 3 processes (S,P,D), and R=13sec (TR=500msec). P P SS D? observed Learned models: S P D D start time chosen by program as t+18 reconstructed P P SS D D D?
Which HPM Model Works Best?
Which HPM Model Works Best? 3-process HPM
Parameter Sharing in HPMs [Niculescu, Mitchell, Rao, JMLR 2006] Problem: Many, many parameters to estimate: 4698 voxels ¢ 26 parameters/voxel ¢ 3 processes = 366,444 But only dozens of training trials Sometimes neighboring voxels exhibit similar W v,t, Learn which subregions share, then for each v in region r W v,t, C v ¢ W r,t, voxelprocess time region
Which Parameters to Share? Learn shared regions using via greedy, top-down algorithm: Initialize Regions set of anatomically-defined regions Loop until all r 2 Regions are finalized: –Choose an unfinalized region, R, from Regions –SR divide R rectilinearly into 2x2x4 subregions –Train HPM R and HPM SR, using nested cross-validation to determine which is more accurate –If HPM SR more accurate than HPM R, Then replace R by SR Else mark R finalized
shared parameter regionsamplitude coefficients C v shared parameters for S process W v,t, = C v ¢ W t, W t,S t
Results of Parameter Sharing in HPMs Parameter-sharing model needs only 35% as much training data as the non-sharing model (to achieve same accuracy) Reduces 4698 voxels to 299 regions Reduces number of estimated parameters from 366,444 to 38,232 Improves cross-validated data likelihood of learned model Parameter-sharing model currently learnable only when process onset times are given future work...
Goal: General Models of Cognitive Processing Read word Decide category Push button Read word Comprehend sentence Decide truth Push button Read word Comprehend pictureComprehend sentence Decide pic=?=sent Push button
Summary Conclusions Can studies of human and artificial intelligence inform each other? –Up to now, not much –This may be about to change Can we understand knowledge representation in the brain? –fMRI provides sufficient data to distinguish interesting semantic representations Will we be able to track processes in the brain? –HPMs provide a machine learning approach to learning most probable models given observed data (and linearity assumption)
Thank you
shared parameter regionsamplitude coefficients C v shared parameters for S process W v,t, = C v ¢ W r,t, W r,t,S t
Univariate analysis (e.g., SPM): Multivariate analysis (e.g., learned classifiers): “Is the activity of voxel v sensitive to the experimental conditions?” “Can voxel set S={v 1,... v n } successfully predict the experimental condition?” Tool words Dwelling words
Why Multivariate Classifiers? 1.Discover distributed patterns of activation 2.Determine statistical significance with fewer modeling assumptions (e.g., no need for t-test assumptions of Gaussianity) –Cross validation tests assume only iid examples 3.Determine whether there is a statistically significant difference, AND magnitude of the difference 4.Better handling of signal-to-noise problems: –Univariate combine signal across images –Multivariate combine signal across images and voxels
Imagine two voxels, and their P(activation|class) for c 1, and c 2 n=10 6 p < n=10 2 p < Both depend on the class. We can get the same p-values if we collect more data for the first. p-values yield confidence in existence of effect -- not magnitude of effect. Magnitude of the effect is obtained by training a GNB classifier. Its cross validated prediction error is an unbiased estimate of the Bayes error – the area under the intersection – the magnitude of the effect
Then how do we get p-values for a classifier using voxel set S, which predicts m out of n correctly on a cross validation set? Tool words Dwelling words Null hypothesis: true classifier accuracy =.50 P( m correct | true acc = 0.5) = Binomial(n, p=0.5) P( at least m correct | true acc = 0.5) = p-value
Gaussian Naïve Bayes (GNB) classifier* for C. Assume the f j are conditionally independent given C. Training: 1.For each class value, c i, estimate 2.For each feature F j estimate Classify new instance Use Bayes rule: F2F2 F1F1 C FnFn … * Same model assumptions as GLM!
Linear Decision Surfaces This form of GNB learns a linear decision surface: Logistic regression learns same linear form, but estimates w i to maximize conditional data likelihood P(C|X) Linear Discriminant Analysis learns same form, but estimates w i to maximize the ratio of between-class variance to within-class variance. Linear SVM learns same form, but estimates w i to maximize margin between classes
Learning an HPM: Picture and Sentence Study Each trial: determine whether sentence correctly describes picture 40 trials per subject. Picture first in 20 trials, Sentence first in other 20 Images acquired every 0.5 seconds. Read Sentence View PictureRead Sentence View PictureFixation Press Button 4 sec.8 sec.t=0 Rest
Goal: Use brain imaging to study how people think What details can be observed with imaging? –Physically: 1 mm, 1 msec, axon bundle connectivity –Functionally: surprisingly subtle (e.g., ‘tools’ vs. ’dwellings’) –Controlled experiments difficult – humans think what they want! What form of cognitive models makes sense? –High level production system models?: SOAR, ACT-R, 4CAPS,... –Intermediate level: Hidden Process Models –Connectionist neural network models?: e.g., Plaut language models How can we analyze the data to find models? –Machine learning classifiers predictive spatial/temporal patterns –Hidden process models model overlapping processes with unknown timing –Can we build a library of cognitive processes and their signatures?
Lessons Learned Yes, one can train machine learning classifiers to distinguish a variety of cognitive states/processes –Nouns about “tools” vs. nouns about “building parts” –Ambiguous sentence vs. unambiguous –Picture vs. Sentence Failures too: –True vs. false sentences –Negative sentence (containing “not”) vs. affirmative ML methods: –Logistic regression, NNbr, Naïve Bayes, SVMs, LDA, NNets, … –Feature selection matters: searchlights, contrast to fixation,... –Case study in high dimensional, noisy classif [MLJ 2004]
HPMs More Precisely… Process h (e.g., ‘Read’) = Process Instance (e.g., “Read ‘The dog ran’ ”)= Configuration c = set of Process Instances Hidden Process Model HPM = h H, , C, i H: set of processes : define distrib over h( ) C: set of partially specified candidate configurations : h 1 … v i voxel noise model
Learning HPMs: with unknown timing, known processes EM (Expectation-Maximization) algorithm E-step –Estimate the conditional distribution over start times of the process instances given observed data, current HPM P(O( 1 )…O( N ) | Y, h( 1 )… h( N ), HPM). M-step –Use the distribution from the E step to determine new maximum- (expected) likelihood estimates of the HPM parameters. Distributions governing timing offsets and response signatures ** In real problems, some timings are often known * Special case of DBNs with built-in assumptions
Observed fMRI: … time … Can set S of voxels successfully predict the experimental condition? Reading a word about ‘tools’ or ‘buildings’?
Example 2: Word Categories – Individual word presentations Two categories (tools, dwellings): Presented 7 tool words, 7 dwelling words, 6 times each (84 word presentations in total) Inter-trial interval: 10 sec Train classifier to predict category given single word presentation, using 4 sec of data (starting 4 sec after stimulus) [with Marcel Just, Rob Mason, Francisco Pereira, Svetlana Shinkareva, Wei Wang ]
Learning task formulation Learn Mean(fMRI(t+4),...,fMRI(t+7)) WordCategory –Leave one out cross validation over 84 word presentations Preprocessing: –Convert each image x to standard normal image Learning algorithms tried: –kNN spatial correlation –Gaussian Naïve Bayes best on average –Regularized Logistic regression best on average –Support Vector Machine Feature selection methods tried: –Logistic regression weights, activity relative to fixation, spotlights,...
Linear Decision Surfaces This form of GNB learns a linear decision surface: Logistic regression learns same linear form, but estimates w i to maximize conditional data likelihood P(C|X) Linear Discriminant Analysis learns same form, but estimates w i to maximize the ratio of between-class variance to within-class variance. Linear SVM learns same form, but estimates w i to maximize margin between classes
Learning task formulation Learn Mean(fMRI(t+4),...,fMRI(t+7)) WordCategory –Leave one out cross validation over 84 word presentations Preprocessing: –Convert each image x to standard normal image Learning algorithms tried: –kNN spatial correlation –Gaussian Naïve Bayes best on average –Regularized Logistic regression best on average –Support Vector Machine Feature selection methods tried: –Logistic regression weights, activity relative to fixation, spotlights,... Results: for 4 of 8 subjects, classifier accuracy >.80; others.5 to.8
Question: How can we tell which locations allow classifier to succeed? Classifiers answer: “Can set S of voxels successfully predict the experimental condition?” Try all possible subsets S? Examine learned classifier weights? Examine class-conditional means?...
Linear discriminant weights GNB (accuracy 0.65) Slice orientation posterior anterior leftright
Linear discriminant weights GNB (accuracy 0.65) Logistic Regression (accuracy 0.75) correlation 0.8
Learned Logistic Regression Weights: Tools (red) vs Buildings (blue)
Idea 1 [Kriegeskorte 2002]: Examine ability to discriminate inside a small region Train a classifier for every small region Idea 2: Use this for voxel selection, within the training set –Compute accuracy inside all searchlights –Rank voxels by the accuracy of their searchlights searchlight classifiers
Accuracy of searchlights: Bayes classifier Accuracy at each voxel with a radius 1 searchlight
Accuracy of single-voxel classifiers Accuracy at each voxel by itself
Accuracy of searchlights: Bayes classifier Accuracy at each voxel with a radius 1 searchlight
Accuracy of searchlights: Bayes classifier Accuracy at each voxel with a radius 1 searchlight (significant voxels FDR 0.01)
Accuracies of significant searchlights Accuracy at each significant searchlight [ ]
voxel selection based on searchlights Conclusions: GNB accuracy using searchlight-selected voxels ~80% Locations identified are plausible –include parahippocampal gyrus and pre/post central gyri Similar results in accuracy/location for 3 other subjects “Spatial Searchlights for Feature Selection and Classification” Francisco Pereira, et al., in preparation
Word stimuli in word-picture study DwellingsTools CastleDrill HouseSaw HutScrewdriver ApartmentPliers IglooHammer
Picture stimuli were presented as white lines on black background
functional Magnetic Resonance Imaging (fMRI) ~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe measures Blood Oxygen Level Dependent (BOLD) response Typical fMRI response to impulse of neural activity 10 sec
General Linear Model ‘design matrix’ X describes timing of processes (for [Dale 1999], this is the stimulus timing) Y = X h + ε Observations TxV Design matrix Gaussian noise Response signatures for all stimuli HPM’s correspond to making X an unobserved random variable [Dale, HBM 1999]