Issues with analysis and interpretation - Type I/ Type II errors & double dipping - Madeline Grade & Suz Prejawa Methods for Dummies 2013.

Slides:



Advertisements
Similar presentations
SPM 2002 C1C2C3 X =  C1 C2 Xb L C1 L C2  C1 C2 Xb L C1  L C2 Y Xb e Space of X C1 C2 Xb Space X C1 C2 C1  C3 P C1C2  Xb Xb Space of X C1 C2 C1 
Advertisements

Statistics in Science  Role of Statistics in Research.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
FMRI Data Analysis: I. Basic Analyses and the General Linear Model
Chapter 10.  Real life problems are usually different than just estimation of population statistics.  We try on the basis of experimental evidence Whether.
Topological Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM Course London, May 2014 Many thanks to Justin.
Statistical Issues in Research Planning and Evaluation

Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research Institute for Empirical Research in Economics University of Zurich With.
07/01/15 MfD 2014 Xin You Tai & Misun Kim
Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research Institute for Empirical Research in Economics University of Zurich With.
Multiple comparison correction Methods & models for fMRI data analysis 18 March 2009 Klaas Enno Stephan Laboratory for Social and Neural Systems Research.
Image from Russian Newsweek Varieties of “Voodoo” Ed Vul MIT.
Circular analysis in systems neuroscience – with particular attention to cross-subject correlation mapping Nikolaus Kriegeskorte Laboratory of Brain and.
Differentially expressed genes
Comparison of Parametric and Nonparametric Thresholding Methods for Small Group Analyses Thomas Nichols & Satoru Hayasaka Department of Biostatistics U.
Multiple comparison correction Methods & models for fMRI data analysis 29 October 2008 Klaas Enno Stephan Branco Weiss Laboratory (BWL) Institute for Empirical.
False Discovery Rate Methods for Functional Neuroimaging Thomas Nichols Department of Biostatistics University of Michigan.
Today Concepts underlying inferential statistics
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Lorelei Howard and Nick Wright MfD 2008
Richard M. Jacobs, OSA, Ph.D.
Descriptive Statistics
SPM short course – May 2003 Linear Models and Contrasts The random field theory Hammering a Linear Model Use for Normalisation T and F tests : (orthogonal.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
General Linear Model & Classical Inference
Inferential Statistics
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
General Linear Model & Classical Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM M/EEGCourse London, May.
Practical statistics for Neuroscience miniprojects Steven Kiddle Slides & data :
Multiple Comparison Correction in SPMs Will Penny SPM short course, Zurich, Feb 2008 Will Penny SPM short course, Zurich, Feb 2008.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Chapter 1: Introduction to Statistics
Random Field Theory Will Penny SPM short course, London, May 2005 Will Penny SPM short course, London, May 2005 David Carmichael MfD 2006 David Carmichael.
Basics of fMRI Inference Douglas N. Greve. Overview Inference False Positives and False Negatives Problem of Multiple Comparisons Bonferroni Correction.
With a focus on task-based analysis and SPM12
Random field theory Rumana Chowdhury and Nagako Murase Methods for Dummies November 2010.
SPM short course – Oct Linear Models and Contrasts Jean-Baptiste Poline Neurospin, I2BM, CEA Saclay, France.
Hypothesis Testing Hypothesis Testing Topic 11. Hypothesis Testing Another way of looking at statistical inference in which we want to ask a question.
Research Process Parts of the research study Parts of the research study Aim: purpose of the study Aim: purpose of the study Target population: group whose.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Inference and Inferential Statistics Methods of Educational Research EDU 660.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Methods for Dummies Random Field Theory Annika Lübbert & Marian Schneider.
FMRI ROI Analysis 7/18/2014 Friday Yingying Wang
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Introduction to sample size and power calculations Afshin Ostovar Bushehr University of Medical Sciences.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Analyzing Statistical Inferences How to Not Know Null.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Issues concerning the interpretation of statistical significance tests.
**please note** Many slides in part 1 are corrupt and have lost images and/or text. Part 2 is fine. Unfortunately, the original is not available, so please.
Random Field Theory Will Penny SPM short course, London, May 2005 Will Penny SPM short course, London, May 2005.
 Descriptive Methods ◦ Observation ◦ Survey Research  Experimental Methods ◦ Independent Groups Designs ◦ Repeated Measures Designs ◦ Complex Designs.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Statistical Analysis An Introduction to MRI Physics and Analysis Michael Jay Schillaci, PhD Monday, April 7 th, 2007.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
SPM short – Mai 2008 Linear Models and Contrasts Stefan Kiebel Wellcome Trust Centre for Neuroimaging.
Multiple comparisons problem and solutions James M. Kilner
Topological Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM Course London, May 2015 With thanks to Justin.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Kriegeskorte N, Simmons WK, Bellgowan PSF, Baker CI. Circular analysis in systems neuroscience – the dangers of double dipping slides by Nikolaus Kriegeskorte,
What a Cluster F… ailure!
Statistics for fMRI.
Topological Inference
Issues with analysis and interpretation
Methods for Dummies Random Field Theory
Topological Inference
Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research Institute for Empirical Research in Economics University of Zurich.
Presentation transcript:

Issues with analysis and interpretation - Type I/ Type II errors & double dipping - Madeline Grade & Suz Prejawa Methods for Dummies 2013

Review: Hypothesis Testing Null Hypothesis (H 0 ) –Observations are the result of random chance Alternative Hypothesis (H A ) –There is a real effect contributing to activation Test Statistic (T) P-value –probability of T occurring if H 0 is true Significance level (α) –Set a priori, usually.05 XKCD

True physiological activation? YesNo Experimental finding? Yes H A Type I Error “False Positive” No Type II Error “False Negative” H 0

Type I/II Errors

Not just one t-test…

60,000 of them!

Inference on t-maps 2013 MFD Random Field Theory t > 0.5 t > 1.5 t > 2.5 t > 3.5 t > 4.5 t > 5.5t > 6.5 t > 0.5 Around 60,000 voxels to image the brain 60,000 t-tests with α=0.05  3000 Type I errors! Adjust the threshold

Type I Errors “In fMRI, you have 60,000 darts, and so just by random chance, by the noise that’s inherent in the fMRI data, you’re going to have some of those darts hit a bull’s-eye by accident.” – Craig Bennett, Dartmouth Bennett et al. 2010

Correcting for Multiple Comparisons Family-wise Error Rate (FWER) –Simultaneous inference –Probability of observing 1+ false positives after carrying out multiple significance tests –Ex: FEWR = 0.05 means 5% chance of Type I error –Bonferroni correction –Gaussian Random Field Theory Downside: Loss of statistical power

Correcting for Multiple Comparisons False Discovery Rate (FDR) –Selective inference –Less conservative, can place limits on FDR –Ex: FDR = 0.05 means at maximum, 5% of results are false positives Greater statistical power May represent more ideal balance

Salmon experiment with corrections? No significant voxels even at relaxed thresholds of FDR = 0.25 and FWER = 0.25 The dead salmon in fact had no brain activity during the social perspective- taking task

Not limited to fMRI studies “After adjusting the significance level to account for multiple comparisons, none of the identified associations remained significant in either the derivation or validation cohort.”

How often are corrections made? Percentage of 2008 journal articles that included multiple comparisons correction in fMRI analysis –74% (193/260) in NeuroImage –67.5% (54/80) in Cerebral Cortex –60% (15/25) in Social Cognitive and Affective Neuroscience –75.4% (43/57) in Human Brain Mapping –61.8% (42/68) in Journal of Cog. Neuroscience Not to mention poster sessions! Bennett et al. 2010

“Soft control” Uncorrected statistics may have: –increased α (0.001 < p < 0.005) and –minimum cluster size (6 < k < 20 voxels) This helps, but is an inadequate replacement Vul et al. (2009) simulation: –Data comprised of random noise –α=0.005 and 10 voxel minimum –Significant clusters yielded 100% of time

Effect of Decreasing α on Type I/II Errors

Type II Errors Power analyses –Can estimate likelihood of Type II errors in future samples given a true effect of a certain size May arise from use of Bonferroni –Value of one voxel is highly correlated with surrounding voxels (due to BOLD basis, Gaussian smoothing) FDR, Gaussian Random Field estimation are good alternatives w/ higher power

Don’t overdo it! Unintended negative consequences of “single- minded devotion” to avoiding Type I errors: –Increased Type II errors (missing true effects) –Bias towards studying large effects over small –Bias towards sensory/motor processes rather than complex cognitive/affective processes –Deficient meta-analyses Lieberman et al. 2009

Other considerations Increasing statistical power –Greater # of subjects or scans –Designing behavioral tasks that take into account the slow nature of the fMRI signal Value of meta-analyses –“We recommend a greater focus on replication and meta- analysis rather than emphasizing single studies as the unit of analysis for establishing scientific truth. From this perspective, Type I errors are self-erasing because they will not replicate, thus allowing for more lenient thresholding to avoid Type II errors.” Lieberman et al. 2009

It’s All About Balance Type I Errors Type II Errors

Double Dipping Suz Prejawa

Double Dipping – a common stats problem Auctioneering: “the winner’s curse” Machine learning: “testing on training data” “data snooping” Modeling: “overfitting” Survey sampling: “selection bias” Logic: “circularity” Meta-analysis: “publication bias” fMRI: “double dipping” “non-independence”

Double Dipping – a common stats problem Auctioneering: “the winner’s curse” Machine learning: “testing on training data” “data snooping” Modeling: “overfitting” Survey sampling: “selection bias” Logic: “circularity” Meta-analysis: “publication bias” fMRI: “double dipping” “non-independence”

Kriegeskorte et al (2009) Circular Analysis/ non-independence/ double dipping: “data are first analyzed to select a subset and then the subset is reanalyzed to obtain the results” “the use of the same data for selection and selective analysis” “… leads to distorted descriptive statistics and invalid statistical inference whenever the test statistics are not inherently independent on the selection criteria under the null hypothesis Nonindependent selective analysis is incorrect and should not be acceptable in neuroscientific publications*.” * It is epidemic in publications- see Vul and Kriegeskorte

Kriegeskorte et al (2009) results reflect data indirectly: through the lens of an often complicated analysis, in which assumptions are not always fully explicit Assumptions influence which aspect of the data is reflected in the results- they may even pre-determine the results.

“Animate?”“Pleasant?” STIMULUS (object category) TASK (property judgment) Simmons et al Example 1: Pattern-information analysis

define ROI by selecting ventral-temporal voxels for which any pairwise condition contrast is significant at p<.001 (uncorr.) perform nearest-neighbor classification based on activity-pattern correlation use odd runs for training and even runs for testing Pattern-information analysis

decoding accuracy task (judged property) stimulus (object category) Results chance level

define ROI by selecting ventral-temporal voxels for which any pairwise condition contrast is significant at p<.001 (uncorr.)  based on all data sets perform nearest-neighbor classification based on activity-pattern correlation use odd runs for training and even runs for testing Where did it go wrong??

fMRI data using all data to select ROI voxels using only training data to select ROI voxels data from Gaussian random generator decoding accuracy chance level task stimulus... cleanly independent training and test data! ? !

Conclusion for pattern-information analysis The test data must not be used in either... training a classifier or defining the ROI continuous weighting binary weighting

Happy so far?

Simulated fMRI experiment Experimental conditions: A, B, C, D “Truth”: a region equally active for A and B, not for C and D (blue) Time series: preprocessed and smoothed, then whole brain search on entire time-series (FWE-corrected): 1.contrast [A > D]  identifies ROI (red) = skewed/ “overfitted” 2.now you test within (red) ROI (using the same time-series) for [A > B] ….and Example 2: Regional activation analysis true region overfitted ROI  

ROI defined by contrast favouring condition A* and using all time-series data Any subsequent ROI search using the same time-series would find stronger effects for A > B (since A gave you the ROI in the first place) * because the region was selected with a bias towards condition A when ROI was based on [A>D] so any contrast involving either condition A or condition D would be biased. Such biased contrasts include A, A-B, A-C, and A+B Where did it go wrong??

Saving the ROI- with independence Independence of the selective analysis through independent test data (green) or by using selection and test statistics that are inherently independent. […] However, selection bias can arise even for orthogonal contrast vectors.

Does selection by an orthogonal contrast vector ensure unbiased analysis? ROI-definition contrast: A+B ROI-average analysis contrast: A-B c selection =[1 1] T c test =[1 -1] T orthogonal contrast vectors  A note on orthogonal vectors

Does selection by an orthogonal contrast vector ensure unbiased analysis? not sufficient The design and noise dependencies matter.designnoise dependencies – No, there can still be bias. still not sufficient A note on orthogonal vectors II

To avoid selection bias, we can......perform a nonselective analysis OR...make sure that selection and results statistics are independent under the null hypothesis, because they are either: inherently independent or computed on independent data e.g. independent contrasts e.g. whole-brain mapping (no ROI analysis)

Generalisations (from Vul) Whenever the same data and measure are used to select voxels and later assess their signal: –Effect sizes will be inflated (e.g., correlations) –Data plots will be distorted and misleading –Null-hypothesis tests will be invalid –Only the selection step may be used for inference If multiple comparisons are inadequate, results may be produced from pure noise.

So… we don’t want any of this!!

Because …

And if you are unsure… … ask our friends Kriegeskorte et al (2009)…

QUESTIONS?

References MFD 2013: “Random Field Theory” slides “Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument for Proper Multiple Comparisons Correction.” Bennett, Baird, Miller, Wolford, JSUR, 1(1):1-5 (2010) “Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition.” Vul, Harris, Winkielman, Pashler, Perspectives on Psychological Science, 4(3): (2009) “Type I and Type II error concerns in fMRI research: re-balancing the scale.” Lieberman & Cunningham, SCAN 4:423-8 (2009) Kriegeskorte, N., Simmons, W.K., Bellgowan, P.S.F., Baker, C.I., Circular analysis in systems neuroscience: the dangers of double dipping. Nat Neurosci 12, Vul, E & Kanwisher, N (?). Begging the Question: The Non-Independence Error in fMRI Data Analysis; available at inpress.pdfhttp:// inpress.pdf cbu.cam.ac.uk/people/nikolaus.kriegeskorte/Circular%20analysis_teaching%20slides. ppt. cbu.cam.ac.uk/people/nikolaus.kriegeskorte/Circular%20analysis_teaching%20slides. ppt

Voodoo Correlations