Download presentation
Presentation is loading. Please wait.
Published byDiane Lyons Modified over 8 years ago
1
Mixed effects models for a hierarchical analysis of fMRI data, and Bayesian model selection applied to activation detection Keith Worsley 12, Chuanhong Liao 1, John Aston 123, Jean-Baptiste Poline 4, Gary Duncan 5, Vali Petre 2, Frank Morales 6, Alan Evans 2, Ed George 7 1 Department of Mathematics and Statistics, McGill University, 2 Brain Imaging Centre, Montreal Neurological Institute, 3 Imperial College, London, 4 Service Hospitalier Frédéric Joliot, CEA, Orsay, 5 Centre de Recherche en Sciences Neurologiques, Université de Montréal, 6 Cuban Neuroscience Centre 7 Wharton School
2
fMRI data: 120 scans, 3 scans each of hot, rest, warm, rest, hot, rest, … T = (hot – warm effect) / S.d. ~ t 110 if no effect
4
Results from 4 runs on the same subject ‘_mag_ef’ 0 1 Run 1Run 2Run 3Run 4 Effect, E i 0 0.1 0.2 Sd, S i -5 0 5 T stat, E i / S i ‘_mag_sd’ ‘_mag_t’
5
MULTISTAT: mixed effects linear model for combining effects from different runs/sessions/subjects: E i = effect for run/session/subject i S i = standard error of effect Mixed effects model: E i = covariates i c + S i WN i F + WN i R Random effect, due to variability from run to run ‘Fixed effects’ error, due to variability within the same run Usually 1, but could add group, treatment, age, sex,... } from FMRILM ??
6
Run 1Run 2Run 3Run 4 Effect, E i Sd, S i T stat, E i / S i 0 1 MULTISTAT 0 0.1 0.2 -5 0 5 Problem: 4 runs, 3 df for random effects sd ... … and T>15.96 for P<0.05 (corrected): … very noisy sd: … so no response is detected … ‘_mag_ef’ ‘_mag_sd’ ‘_mag_t’
7
REML estimation using the EM algorithm Slow to converge (10 iterations by default). Stable (maintains estimate 2 > 0 ), but 2 biased if 2 (random effect) is small, so: Re-parameterize the variance model: Var(E i ) = S i 2 + 2 = (S i 2 – min j S j 2 ) + ( 2 + min j S j 2 ) = S i * 2 + * 2 2 = * 2 – min j S j 2 (less biased estimate) ^^ ^ ? ? ^
8
Basic idea: increase df by spatial smoothing (local pooling) of the sd. Can’t smooth the random effects sd directly, - too much anatomical structure. Instead, random effects sd fixed effects sd which removes the anatomical structure before smoothing. Solution: Spatial regularization of the sd sd = smooth fixed effects sd )
9
Random effects sd, 3 dfFixed effects sd, 440 df 0 0.05 0.1 0.15 0.2 Mixed effects sd, ~100 df Random sd / fixed sd 0.5 1 1.5 Smoothed sd ratio ‘_sdratio’ random effect, sd ratio ~1.3 dividemultiply ^ Average S i
10
df ratio = df random ( 2 + 1 ) 1 1 1 df eff df ratio df fixed Effective df depends on smoothing FWHM ratio 2 3/2 FWHM data 2 = + e.g. df random = 3, df fixed = 4 110 = 440, FWHM data = 8mm: 02040Infinity 0 100 200 300 400 FWHM ratio df eff random effects analysis, df eff = 3 fixed effects analysis, df eff = 440 Target = 100 df FWHM = 19mm
11
Run 1Run 2Run 3Run 4 Effect, E i Sd, S i T stat, E i / S i 0 1 MULTISTAT 0 0.1 0.2 -5 0 5 Final result: 19mm smoothing, 100 effective df … … less noisy sd: … and T>4.93 for P<0.05 (corrected): … and now we can detect a response! ‘_mag_ef’ ‘_mag_sd’ ‘_mag_t’ ‘_ef’ ‘_sd’ ‘_t’
12
-50510152025 -0.4 -0.2 0 0.2 0.4 0.6 t (seconds) Estimating the delay of the response Delay or latency to the peak of the HRF is approximated by a linear combination of two optimally chosen basis functions: HRF(t + shift) ~ basis 1 (t) w 1 (shift) + basis 2 (t) w 2 (shift) Convolve bases with the stimulus, then add to the linear model basis 1 basis 2 HRF shift delay
13
-505 -3 -2 0 1 2 3 shift (seconds) Fit linear model, estimate w 1 and w 2 Equate w 2 / w 1 to estimates, then solve for shift (Hensen et al., 2002) To reduce bias when the magnitude is small, use shift / (1 + 1/T 2 ) where T = w 1 / Sd(w 1 ) is the T statistic for the magnitude Shrinks shift to 0 where there is little evidence for a response. w1w1 w2w2 w 2 / w 1
14
Shift of the hot stimulus T stat for magnitude ‘_mag_t’ T stat for shift ‘_del_t’ Shift (secs) ‘_del_ef’ Sd of shift (secs) ‘_del_sd’
15
Shift of the hot stimulus ~1 sec+/- 0.5 sec T>4T~2 T stat for magnitude ‘_mag_t’ T stat for shift ‘_del_t’ Shift (secs) ‘_del_ef’ Sd of shift (secs) ‘_del_sd’
16
Combining shifts of the hot stimulus (Contours are T stat for magnitude > 4) ‘_del_ef’ ‘_del_sd’ ‘_del_t’ ‘_ef’ ‘_sd’ ‘_t’
17
Shift (secs) ‘_del_ef’ Shift of the hot stimulus T stat for magnitude ‘_mag_t’ > 4.93
19
FWHM – the local smoothness of the noise FWHM = (2 log 2) 1/2 voxel size (1 – correlation) 1/2 (If the noise is modeled as white noise smoothed with a Gaussian kernel, this would be its FWHM) Resels = Volume FWHM 3 05001000 0 0.02 0.04 0.06 0.08 0.1 Resels of search volume P value of local max Local maximum T = 4.5 00.511.52 0 0.02 0.04 0.06 0.08 0.1 Resels of cluster P value of cluster Clusters above t = 3.0, search volume resels = 500 P-values depend on Resels:
20
In between use Discrete Local Maxima (DLM) 012345678910 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Bonferroni, N=Resels Gaussian T, 20 df T, 10 df Gaussianized threshold FWHM of smoothing kernel (voxels) True Bonferroni Random Field Theory Discrete Local Maxima (DLM) STAT_SUMMARY High FWHM use Random Field Theory Low FWHM use Bonferroni
21
012345678910 0 0.02 0.04 0.06 0.08 0.1 0.12 GaussianT, 20 dfT, 10 df Bonferroni, N=Resels P-value FWHM of smoothing kernel (voxels) True Bonferroni Random Field Theory Discrete Local Maxima In between use Discrete Local Maxima (DLM) STAT_SUMMARY High FWHM use Random Field Theory Low FWHM use Bonferroni DLM can ½ P-value when FWHM ~3 voxels
23
STAT_SUMMARY example: single run, hot-warm Detected by DLM, but not by BON or RFT Detected by BON and DLM but not by RFT
24
Bayesian Model Selection (thanks to Ed George) Z-statistic SPM at voxel i is Z i ~ N(m i,1), i = 1, …, n Most of the m i ’s are zero (unactivated voxels) and a few are non- zero (activated voxels), but we do not know which voxels are activated, and by how much (m i ) This is a model selection problem, where we add an extra model parameter (m i ) for the mean of each activated voxel Simple Bayesian set-up: - each voxel is independently active with probability p - the activation is itself drawn independently from a Gaussian distribution: m i ~ N(0,c) The hyperparameter p controls the expected proportion of activated voxels, and c controls their expected activation
25
Surprise! This prior setup is related to the canonical penalized sum-of-squares criterion A F = Σ activated voxels Z i 2 – F q where - q is the number of activated voxels and - F is a fixed penalty for adding an activated voxel Popular model selection criteria simply entail - maximizing A F for a particular choice of F - which is equivalent to thresholding the image at √F Some choices of F: - F = 0 : all voxels activated - F = 2 : Mallow’s C p and AIC - F = log n : BIC - F = 2 log n : RIC - P(Z > √F) = 0.05/n : Bonferroni (almost same as RIC!)
26
The Bayesian relationship with A F is obtained by re- expressing the posterior of the activated voxels, given the data: P(activated voxels | Z’s) α exp ( c/2(1+c) A F ) where F = (1+c)/c {2 log[(1-p)/p] + log(1+c)} Since p and c control the expected number and size of the activation, the dependence of F on p and c provides an implicit connection between the penalty F and the sorts of models for which its value may be appropriate
27
The awful truth: p and c are unknown Empirical Bayes idea: use p and c that maximize the marginal likelihood, which simplifies to L(p,c | Z’s) α П i [ (1-p)exp(-Z i 2 /2) + p(1+c) -1/2 exp(-Z i 2 /2(1+c) ) ] This is identical to fitting a classic mixture model with - a probability of (1-p) that Z i ~ N(0,1) - a probability of p that Z i ~ N(0,c) - √F is the value of Z where the two components are equal Using these estimated values of p and c gives us an adaptive penalty F, or equivalently a threshold √F, that is implicitly based on the SPM All we have to do is fit the mixture model … but does it work?
28
Same data as before: hot – warm stimulus, four runs: - proportion of activated voxels p = 0.57 - variance of activated voxels c = 5.8 (sd = 2.4) - penalty F = 1.59 (a bit like AIC) - threshold √F = 1.26 (?) seems a bit low … Z Null model N(0,1) 57% activated voxels, N(0,5.8) 43% un- activated voxels, N(0,1) MixtureHistogram of SPM (n=30786): √F threshold where components are equal AIC: √F = 2 FDR (0.05): √F = 2.67 BIC: √F = 3.21 RIC: √F = 4.55 Bon (0.05): √F = 4.66
29
Same data as before: hot – warm stimulus, one run: - proportion of activated voxels p = 0.80 - variance of activated voxels c = 1.55 - penalty F = -3.02 (?) - all voxels activated !!!!!! What is going on? Z Null model N(0,1) 80% activated voxels, N(0,1.55) 20% un- activated voxels, N(0,1) Mixture Histogram of SPM (n=30768): components are never equal! AIC: √F = 2 FDR (0.05): √F = 2.67 BIC: √F = 3.21 RIC: √F = 4.55 Bon (0.05): √F = 4.66
30
Difference with SPM FMRISTAT second level analysis is univariate (can only lead to a T test); SPM is multivariate (can lead to an F test) To do this, SPM has to assume various parameters are global Not clear if allowing for correlated contrasts at second level improves inference for a single contrast i.e. better T stats (in the end, most stats are T stats, not F) – in fact if the correlations and models across subjects are equal, nothing is gained … FMRISTAT uses spatial information to boost df; SPM is mass univariate
31
T>4.86
32
T > 4.93 (P < 0.05, corrected)
33
T>4.86 T > 4.93 (P < 0.05, corrected)
34
T>4.86
35
Conjunction: Minimum T i > threshold Minimum of T i ‘_conj’Average of T i ‘_mag_t’ For P=0.05, threshold = 1.82 For P=0.05, threshold = 4.93 Efficiency = 82%
36
Efficiency : optimum block design 0 0.1 0.2 0.3 0.4 0.5 InterStimulus Interval (secs) Sd of hot stimulus X 5101520 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 Sd of hot-warm X 5101520 0 5 10 15 20 0 0.2 0.4 0.6 0.8 1 (secs) 5101520 5 10 15 20 0 0.2 0.4 0.6 0.8 1 Stimulus Duration (secs) (secs) 5101520 0 5 10 15 20 Optimum design Optimum design X Optimum design Optimum design X Magnitude Delay (Not enough signal)
37
5101520 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Average time between events (secs) Sd of effect (secs for delays) uniform......... random......... concentrated : Efficiency : optimum event design ____ magnitudes ……. delays (Not enough signal)
38
How many subjects? Largest portion of variance comes from the last stage i.e. combining over subjects: sd run 2 sd sess 2 sd subj 2 n run n sess n subj n sess n subj n subj If you want to optimize total scanner time, take more subjects. What you do at early stages doesn’t matter very much! ++
39
Comparison: Different slice acquisition times: Drift removal: Temporal correlation: Estimation of effects: Rationale: Random effects: Map of the delay: SPM’99: Adds a temporal derivative Low frequency cosines (flat at the ends) AR(1), global parameter, bias reduction not necessary Band pass filter, then least-squares, then correction for temporal correlation More robust, but lower df No regularization, low df, no conjuncs No fmristat: Shifts the model Splines (free at the ends) AR(p), voxel parameters, bias reduction Pre-whiten, then least squares (no further corrections needed) More accurate, higher df Regularization, high df, conjuncs Yes
40
References http://www.math.mcgill.ca/keith/fmristat Worsley et al. (2002). A general statistical analysis for fMRI data. NeuroImage, 15:1- 15. Liao et al. (2002). Estimating the delay of the response in fMRI data. NeuroImage, 16:593-606.
41
Functional connectivity Measured by the correlation between residuals at every pair of voxels (6D data!) Local maxima are larger than all 12 neighbours P-value can be calculated using random field theory Good at detecting focal connectivity, but PCA of residuals x voxels is better at detecting large regions of co-correlated voxels Voxel 2 Voxel 1 + + + + + + Activation only Voxel 2 Voxel 1 + + + + + + Correlation only
42
First Principal Component > threshold |Correlations| > 0.7, P<10 -10 (corrected)
43
False Discovery Rate (FDR) Benjamini and Hochberg (1995), Journal of the Royal Statistical Society Benjamini and Yekutieli (2001), Annals of Statistics Genovese et al. (2001), NeuroImage FDR controls the expected proportion of false positives amongst the discoveries, whereas Bonferroni / random field theory controls the probability of any false positives No correction controls the proportion of false positives in the volume
44
Noise P 1.64 5% of volume is false + FDR 2.82 5% of discoveries is false + P 4.22 5% probability of any false + Signal + Gaussian white noise False + True + Signal
45
FDR depends on the ordered P-values: P 1 < P 2 < … < P n. To control the FDR at find K = max {i : P i < (i/n) }, threshold the P-values at P K Proportion of true + 1 0.1 0.01 0.001 0.0001 Threshold T 1.64 2.56 3.28 3.88 4.41 Bonferroni thresholds the P-values at /n: Number of voxels 1 10 100 1000 10000 Threshold T 1.64 2.58 3.29 3.89 4.42 Random field theory: resels = volume / FHHM 3 : Number of resels 0 1 10 100 1000 Threshold T 1.64 2.82 3.46 4.09 4.65 Comparison of thresholds
46
P 1.64 5% of volume is false +
47
FDR 2.67 5% of discoveries is false +
48
P 4.93 5% probability of any false +
50
PCA_IMAGE: PCA of time space: 1: exclude first frames 2: drift 3: long-range correlation or anatomical effect: remove by converting to % of brain 4: signal?
51
FMRILM: fits a linear model for fMRI time series with AR(p) errors Linear model: ? ? Y t = (stimulus t * HRF) b + drift t c + error t AR(p) errors: ? ? ? error t = a 1 error t-1 + … + a p error t-p + s WN t unknown parameters
53
050100150200250300350 0 1 2 Alternating hot and warm stimuli separated by rest (9 seconds each). hot warm hot warm 050 -0.2 0 0.2 0.4 Hemodynamic response function: difference of two gamma densities 050100150200250300350 0 1 2 Responses = stimuli * HRF, sampled every 3 seconds Time, seconds FMRIDESIGN example: pain perception
55
FMRILM first step: estimate the autocorrelation AR(1) model: error t = a 1 error t-1 + s WN t Fit the linear model using least squares error t = Y t – fitted Y t â 1 = Correlation ( error t, error t-1 ) Estimating error t ’s changes their correlation structure slightly, so â 1 is slightly biased: which_stats = ‘_cor’ Raw autocorrelation Smoothed 12.4mm Bias corrected â 1 ~ -0.05 ~ 0 ?
56
0102030 0 50 100 FWHM acor 0102030 0 50 100 FWHM acor Effective df depends on smoothing Hot stimulus Hot-warm stimulus Target = 100 df Residual df = 110 Target = 100 df Residual df = 110 FWHM = 10.3mmFWHM = 12.4mm df acor = df residual ( 2 + 1 ) 1 1 2 acor(contrast of data) 2 df eff df residual df acor FWHM acor 2 3/2 FWHM data 2 = + Variability in acor lowers df Df depends on contrast Smoothing acor brings df back up: Contrast of data, acor = 0.79 Contrast of data, acor = 0.61 FWHM data = 8.79 df eff
57
-0.5 0 0.5 1 Hot - warm effect, % ‘_mag_ef’ 0 0.05 0.1 0.15 0.2 0.25 Sd of effect, % ‘_mag_sd’ -6 -4 -2 0 2 4 6 T = effect / sd, 110 df ‘_mag_t’ Pre-whiten: Y t * = Y t – â 1 Y t-1, then fit using least squares: FMRILM second step: refit the linear model T > 4.93 (P < 0.05, corrected) which_stats = ‘_mag_ef _mag_sd _mag_t’
58
Higher order AR model? Try AR(3): ‘_AR’ … has little effect on the T statistics: AR(1) seems to be adequate a 1 a 2 -0.1 0 0.1 0.2 0.3 a 3 AR(1)AR(2) -5 0 5 AR(3) No correlation biases T up ~12% → more false positives
59
Non-isotropic data (spatially varying FWHM) fMRI data is smoother in GM than WM VBM data is highly non-isotropic Has little effect on P-values for local maxima (use ‘average’ FWHM inside search region), but Has a big effect on P-values for spatial extents: smooth regions → big clusters, rough regions → small clusters, so Replace cluster volume by cluster resels = volume / FWHM 3
60
Resels=1.90 P=0.007 Resels=0.57 P=0.387 ‘_fwhm’ 0 5 10 15 20 FWHM (mm) of scans (110 df) 0 5 10 15 20 FWHM (mm) of effects (3 df) 0 5 10 15 20 FWHM of effects (smoothed) 0.5 1 1.5 effects / scans FWHM (smoothed) ‘_fwhm’
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.