1 Statistics – Understanding your findings Chris Rorden 1.Modeling data: Signal, Error and Covariates Statistical contrasts 2.Thresholding Results: Statistical.

Slides:

Advertisements

Similar presentations

Mkael Symmonds, Bahador Bahrami

Advertisements

Psych 5500/6500 t Test for Two Independent Groups: Power Fall, 2008.

Hypothesis Testing Steps in Hypothesis Testing:

THE DISTRIBUTION OF SAMPLE MEANS How samples can tell us about populations.

FMRI Data Analysis: I. Basic Analyses and the General Linear Model

Statistics – Modelling Your Data

Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research University of Zurich With many thanks for slides & images to: FIL Methods.

Topological Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM Course London, May 2014 Many thanks to Justin.

Designing a behavioral experiment

Classical inference and design efficiency Zurich SPM Course 2014

Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research Institute for Empirical Research in Economics University of Zurich With.

07/01/15 MfD 2014 Xin You Tai & Misun Kim

Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research Institute for Empirical Research in Economics University of Zurich With.

Multiple comparison correction Methods & models for fMRI data analysis 18 March 2009 Klaas Enno Stephan Laboratory for Social and Neural Systems Research.

Differentially expressed genes

Comparison of Parametric and Nonparametric Thresholding Methods for Small Group Analyses Thomas Nichols & Satoru Hayasaka Department of Biostatistics U.

Multiple comparison correction Methods & models for fMRI data analysis 29 October 2008 Klaas Enno Stephan Branco Weiss Laboratory (BWL) Institute for Empirical.

Lecture 9: One Way ANOVA Between Subjects

Review of Stats Fundamentals

Today Concepts underlying inferential statistics

FMRI – Week 9 – Analysis I Scott Huettel, Duke University FMRI Data Analysis: I. Basic Analyses and the General Linear Model FMRI Undergraduate Course.

Inferential Statistics

1st Level Analysis Design Matrix, Contrasts & Inference

INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.

AM Recitation 2/10/11.

Multiple Comparison Correction in SPMs Will Penny SPM short course, Zurich, Feb 2008 Will Penny SPM short course, Zurich, Feb 2008.

Multiple testing in high- throughput biology Petter Mostad.

Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.

Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.

Random Field Theory Will Penny SPM short course, London, May 2005 Will Penny SPM short course, London, May 2005 David Carmichael MfD 2006 David Carmichael.

Basics of fMRI Inference Douglas N. Greve. Overview Inference False Positives and False Negatives Problem of Multiple Comparisons Bonferroni Correction.

Statistical Power The ability to find a difference when one really exists.

The Hypothesis of Difference Chapter 10. Sampling Distribution of Differences Use a Sampling Distribution of Differences when we want to examine a hypothesis.

Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.

Copyright © 2012 by Nelson Education Limited. Chapter 7 Hypothesis Testing I: The One-Sample Case 7-1.

With a focus on task-based analysis and SPM12

Random field theory Rumana Chowdhury and Nagako Murase Methods for Dummies November 2010.

Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.

Multiple comparisons in M/EEG analysis Gareth Barnes Wellcome Trust Centre for Neuroimaging University College London SPM M/EEG Course London, May 2013.

Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

Methods for Dummies Random Field Theory Annika Lübbert & Marian Schneider.

PSY2004 Research Methods PSY2005 Applied Research Methods Week Five.

Classical Inference on SPMs Justin Chumbley SPM Course Oct 23, 2008.

Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.

Contrasts & Statistical Inference

**please note** Many slides in part 1 are corrupt and have lost images and/or text. Part 2 is fine. Unfortunately, the original is not available, so please.

Random Field Theory Will Penny SPM short course, London, May 2005 Will Penny SPM short course, London, May 2005.

Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.

1 Identifying Robust Activation in fMRI Thomas Nichols, Ph.D. Assistant Professor Department of Biostatistics University of Michigan

Statistical Analysis An Introduction to MRI Physics and Analysis Michael Jay Schillaci, PhD Monday, April 7 th, 2007.

Statistical Inference Christophe Phillips SPM Course London, May 2012.

FMRI Modelling & Statistical Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM Course Chicago, Oct.

Multiple comparisons problem and solutions James M. Kilner

Chapter 13 Understanding research results: statistical inference.

Topological Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM Course London, May 2015 With thanks to Justin.

Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,

Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.

Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.

Topological Inference

Statistical Inference

Methods for Dummies Random Field Theory

Topological Inference

Contrasts & Statistical Inference

Statistical Parametric Mapping

Contrasts & Statistical Inference

Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research Institute for Empirical Research in Economics University of Zurich.

Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research Institute for Empirical Research in Economics University of Zurich.

Type I and Type II Errors

Contrasts & Statistical Inference

Presentation transcript:

1 Statistics – Understanding your findings Chris Rorden 1.Modeling data: Signal, Error and Covariates Statistical contrasts 2.Thresholding Results: Statistical power and statistical errors The multiple comparison problem Familywise error and Bonferroni Thresholding Permutation Thresholding False Discovery Rate Thresholding Implications: null results uninterruptible

2 The fMRI signal Last lecture: we predict areas that are involved with a task will become brighter (after a delay) Consider a 12-sec on, 12-sec rest task. Our expected (model) signal should look like this:

3 Calculating statistics Strong predictor: model predicts virtually all the variability in the observed data. Mediocre predictor: weaker correlation between model and observed data.

4 Calculating statistics Does our model reliably explain observed data? a.Top: model good predictor (strong signal, little noise) b.Middle: mediocre predictor (strong signal, lots of noise) c.Bottom: mediocre predictor (little signal, little noise) Statistical probability is based on ratio of effect amplitude divided by error (signal/noise).

5 General Linear Model The observed data is composed of a signal that is predicted by our model and unexplained noise (Boynton et al., 1996). Measured Data Amplitude (solve for) Design Model Noise

6 What is your design model? Model is predicted effect. Consider Block design experiment: –Three conditions, each for 11.2sec 1.Press left index finger when you see  2.Press right index finger when you see  3.Do nothing when you see  Intensity Time

7 FSL/SPM display of model Analysis programs display model as grid. Each column is a regressor –e.g. left / right arrows. Each row is a volume of data –for within-subject fMRI = time Brightness of row is model’s predicted intensity. Intensity Time

8 Statistical Contrasts fMRI inference based on contrast. Consider study with left arrow and right arrow as regressors 1.[1 0] identifies activation correlated with left arrows: we could expect visual and motor effects. 2.[1 –1] identifies regions that show more response to left arrows than right arrows. Visual effects should be similar, so should selectively identify contralateral motor cortex. Choice of contrast crucial to inference.

9 Statistical Contrasts t-Test is one tailed, F-test is two-tailed. –T-test: [1 –1] mutually exclusive of [-1 1]: left>right vs right>left. –F-test: [1 –1] = [-1 1]: difference between left and right. Choice of test crucial to inference.

10 How many regressors? We collected data during a block design, where the participant completed 3 tasks –Left hand movement –Right hand movement –Rest We are only interested in the brain areas involved with Left hand movement. Should we include uninteresting right hand movement as a regressor in our statistical model? –I.E. Is a [1] analysis the same as a [1 0]? –Is a [1 0] analysis identical, better, worse or different from a [1] analysis? =?

11 Meaningful regressors decrease noise Meaningful regressors can explain some of the variability. Adding a meaningful regressor can reduce the unexplained noise from our contrast.

12 Single factor… Consider a test to see how well height predicts weight. Explained Variance Unexplained Variance t= Small t-score height only weakly predicts weight High t-score height strongly predicts weight Weight Height

13 Adding a second factor… How does an additional factor influence our test? E.G. We can add waist diameter as a regressor. Does this regressor influence the t-test regarding how well height predicts weight? Consider ratio of cyan to green. Increased t Waist explains portion of weight not predicted by height. Decreased t Waist explains portion of weight predicted by height. Weight Height Waist

14 Regressors and statistics Our analysis identifies three classes of variability: 1.Signal: Predicted effect of interest 2.Noise (aka Error): Unexplained variance 3.Covariates: Predicted effects that are not relevant. –E.G. Regressors with a weight of zero Statistical significance is the ratio: Covariates will –Improve sensitivity if they reduce error (explain otherwise unexplained variance). –Reduce sensitivity if they reduce signal (explain variance that is also predicted by our effect of interest). Signal Noise t=

15 Correlated regressors decrease signal If a regressor is strongly correlated with our effect, it can reduce the residual signal –Our signal is excluded as the regressor explains this variability. –Example: responses highly correlated to visual stimuli

16 Summary Regressors should be orthogonal –Each regressor describes independent variance. –Variance should not be explained by more than one regressor. E.G. we will see that including temporal derivatives as regressors tend to help event related designs (temporal processing lecture).

17 Inferring effect size from probability maps The claim “the parietal is more active than the frontal lobe” may be wrong Statistical significance is based on amplitude relative to error. Instead “the trend is for the parietal activity to be more reliable than in the frontal lobe.” If you want to infer effect size, examine percent signal change, not statistical significance.

18 E.G. erythropoietin (EPO) doping in athletes –In endurance athletes, EPO improves performance ~ 10% –Races often won by less than 1% –Without testing, athletes forced to dope to be competitive –Dangers: Carcinogenic and can cause heart-attacks –Therefore: Measure hematocrit level to identify drug users. –Problem hematocrit levels vary even in people who are not doping. Statistical thresholding hematocrit 30 % 50 % hematocrit 30 % 50 % If we set the threshold too low, we will accuse innocent people (high rate of false alarms). If we set the threshold too high, we will fail to detect dopers (high rate of misses).

19 Alpha level  Statistics allow us to estimate our confidence.  is our statistical threshold: it measures our chance of Type I error. An alpha level of 5% means only 1/20 chance of false alarm (we will only accept p < 0.05). An alpha level of 1% means only 1/100 chance of false alarm (p< 0.01). Therefore, a 1% alpha is more conservative than a 5% alpha.

20 Errors With noisy data, we will make mistakes. Statistics allows us to –Estimate our confidence –Bias the type of mistake we make (e.g. we can decide whether we will tend to make false alarms or misses) We can be liberal: avoiding misses We can be conservative: avoiding false alarms. We want liberal tests for airport weapons detection (X-ray often leads to innocent baggage being opened). Our society wants conservative tests for criminal conviction: avoid sending innocent people to jail.

21 Statistical Power Statistical Power is our probability of making a Hit. It reflects our ability to detect real effects. Type II error miss Correct rejection Accept Ho HitType I error false alarm Reject Ho Ho falseHo true Decision Reality To make new discoveries, we need to optimize power. There are several ways to increase power…

22 Increasing power… 1.Adjust statistical threshold (e.g. p < 0.05 instead of 0.01) However, we increase the chance of a Type I error! 2.Increase signal/noise ratio Increase noise: block design, higher field magnet, higher dose Decrease noise: meaningful regressors, temporal and spatial processing (future lectures) 3.Increase number of observations Scan individuals for longer, scan more individuals Disadvantage: time and money

23 Multiple Comparisons Assume a 1% alpha for drug testing. –An innocent athlete only has 1% chance of being accused. –Problem: 10,500 athletes in the Olympics. –If all innocent, and  = 1%, we will wrongly accuse 105 athletes (0.01*10500)! –This is the ‘multiple comparison problem’. The gray matter volume ~900cc (900,000mm 3 ) –Typical fMRI voxel is 3x3x3mm (27mm 3 ) –Therefore, we will conduct >30,000 tests –With 5% alpha, we will make >1500 false alarms!

24 Multiple Comparison Problem If we conduct 20 tests, with an  = 5%, we will on average make one false alarm (20x0.05). If we make twenty comparisons, it is possible that we may be making 0, 1, 2 or in rare cases even more errors. The chance we will make at least one error is given by the formula: 1- (1-  ) C : if we make twenty comparisons at p <.05, we have a 1-(.95) 20 = 64% chance that we are reporting at least one erroneous finding. This is our familywise error (FWE) rate.

25 Bonferroni Correction Bonferroni Correction: controls FWE. For example: if we conduct 10 tests, and want a 5% chance of any errors, we will adjust our threshold to be p < (0.05/10). Problem: Very conservative = very little chance of detecting real effects = low power.

26 Random Field Theory We spatially smooth our data – peaks due to noise should be attenuated by neighbors. –Worsley et al, HBM 4:58-73, RFT uses resolution elements (resels) instead of voxels. –If we smooth our data with 8mm FWHM, then resel size is 8mm. SPM uses RFT for FWE correction: only requires statistical map, smoothness and cluster size threshold. –Euler characteristic: unsmoothed noise will have high peaks but few clusters, smoothed data will be have lower peaks but show clustering. RFT has many unchecked assumptions (Nichols) Works best for heavily smoothed data (x3 voxel size) 5mm 10mm 15mm Image from Nichols

27 Permutation Thresholding Prediction: Label ‘Group 1’ and ‘Group 2’ mean something. Null Hypothesis (Ho): Labels are meaningless. If Ho true, we should get similar t-scores if we randomly scramble order. Group 1Group 2

28 Permutation Thresholding Observed data: max T = 4.1 Permutation 1, max T = 3.2 Permutation 2, max T = 2.9 Permutation 3, max T = 3.3 Permutation 4, max T = 2.8 Permutation 5, max T = 3.5 … 1000.Permutation 1000, max T = 3.1 Group 1 Group 2 …… Step 1: compute 1000 random permutations of data, record maximum T- score observed for each permutation. Step 2: Rank order the 1000 maximum T-scores, the 50 th most significant max T is the 5% threshold. Max T Percentile 0 5 5% T= 3.9

29 Permutation Thresholding Permutation Thresholding offers the same protection against false alarms as Bonferroni. Typically, much more powerful than Bonferroni. Implementations include SnPM, FSL’s randomise, and my own NPM. Disadvantage: computing 1000 permutations means it takes x1000 times longer than typical analysis! Simulation data from Nichols et al.: Permutation always optimal. Bonferroni typically conservative. Random Fields only accurate with high DF and heavily smoothed.

30 False Discovery Rate Traditional statistics attempts to control the False Alarm rate. ‘False Discovery Rate’ controls the ratio of false alarms to hits. It often provides much more power than Bonferroni correction. Z-score 5% FDR: only 5% of expelled athletes are innocent. Z-score 5% Bonferroni: only a 5% chance an innocent athlete will be accused. No dopers: Z-score is standard distribution around zero Some dopers: the dopers will form a distribution with Z-scores > 0 Consider distribution of hemacrit for our population of Olympic athletes…

31 Controlling for multiple comparisons Bonferroni correction –We will often fail to find real results. RFT correction –Typically less conservative than Bonferroni. –Requires large DF and broad smoothing. Permutation Thresholding –Offers same inference as Bonferroni correction. –Typically much less conservative than Bonferroni. –Computationally very slow FDR correction (though see –At FDR of.05, about 5% of ‘activated’ voxels will be false alarms. –If signal is only tiny proportion of data, FDR will be similar to Bonferroni.

32 Alternatives to voxelwise analysis Conventional fMRI statistics compute one statistical comparison per voxel. –Advantage: can discover effects anywhere in brain. –Disadvantage: low statistical power due to multiple comparisons. Small Volume Comparison: Only test a small proportion of voxels. (Still have to adjust for RFT). Region of Interest: Pool data across anatomical region for single statistical test. SVCROISPM Example: how many comparisons on this slice? voxelwise: 1600 SVC: 57 ROI: 1

33 ROI analysis In voxelwise analysis, we conduct an independent test for every voxel –Each voxel is noisy –Huge number of tests, so severe penalty for multiple comparisons Alternative: pool data from region of interest. –Averaging across meaningful region should reduce noise. –One test per region, so FWE adjustment less severe. Region must be selected independently of statistical contrast! –Anatomically predefined –Defined based on previous localizer session –Selected based on combination of conditions you will contrast. M1: movement S1: sensation

34 Inference from fMRI statistics fMRI studies have very low power. –Correction for multiple comparisons –Poor signal to noise –Variability in functional anatomy between people. Null results impossible to interpret. (Hard to say an area is not involved with task).