Image from Russian Newsweek Varieties of “Voodoo” Ed Vul MIT.

Slides:

Advertisements

Similar presentations

SPM 2002 C1C2C3 X =  C1 C2 Xb L C1 L C2  C1 C2 Xb L C1  L C2 Y Xb e Space of X C1 C2 Xb Space X C1 C2 C1  C3 P C1C2  Xb Xb Space of X C1 C2 C1 

Advertisements

Reliability and Validity

Myers’ PSYCHOLOGY (7th Ed)

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Measurement Reliability and Validity

INTRODUCTION Assessing the size of objects rapidly and accurately clearly has survival value. Thus, a central multi-sensory module for magnitude assessment.

HUDM4122 Probability and Statistical Inference March 30, 2015.

Critical Thinking.

Effect size: Why what we teach psychologists is wrong Dr Thom Baguley, Psychology, Nottingham Trent University

Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.

Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research Institute for Empirical Research in Economics University of Zurich With.

458 More on Model Building and Selection (Observation and process error; simulation testing and diagnostics) Fish 458, Lecture 15.

Circular analysis in systems neuroscience – with particular attention to cross-subject correlation mapping Nikolaus Kriegeskorte Laboratory of Brain and.

Comparison of Parametric and Nonparametric Thresholding Methods for Small Group Analyses Thomas Nichols & Satoru Hayasaka Department of Biostatistics U.

Lecture 10 Comparison and Evaluation of Alternative System Designs.

Experimental Evaluation

FMRI: Biological Basis and Experiment Design Lecture 26: Significance Review of GLM results Baseline trends Block designs; Fourier analysis (correlation)

Why ROI ? (Region Of Interest) 1.To explore one’s data (difficult to discern patterns across whole brain) To control for Type 1 error and limit the number.

Signal and Noise in fMRI fMRI Graduate Course October 15, 2003.

I NTRODUCTION The use of rapid event related designs is becoming more widespread in fMRI research. The most common method of modeling these events is by.

Introduction to Regression Analysis, Chapter 13,

Issues with analysis and interpretation - Type I/ Type II errors & double dipping - Madeline Grade & Suz Prejawa Methods for Dummies 2013.

1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.

General Linear Model & Classical Inference

Spreadsheet Modeling & Decision Analysis A Practical Introduction to Management Science 5 th edition Cliff T. Ragsdale.

Studying Visual Attention with the Visual Search Paradigm Marc Pomplun Department of Computer Science University of Massachusetts at Boston

Student Engagement Survey Results and Analysis June 2011.

Basics of fMRI Inference Douglas N. Greve. Overview Inference False Positives and False Negatives Problem of Multiple Comparisons Bonferroni Correction.

With a focus on task-based analysis and SPM12

PSY2004 Research Methods PSY2005 Applied Research Methods Week Eleven Stephen Nunn.

7/16/2014Wednesday Yingying Wang

Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research Institute for Empirical Research in Economics University of Zurich With.

SPM Course Zurich, February 2015 Group Analyses Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London With many thanks to.

FMRI Methods Lecture7 – Review: analyses & statistics.

10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.

Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Research Process Parts of the research study Parts of the research study Aim: purpose of the study Aim: purpose of the study Target population: group whose.

Managerial Economics Demand Estimation & Forecasting.

FMRI ROI Analysis 7/18/2014 Friday Yingying Wang

Center for Radiative Shock Hydrodynamics Fall 2011 Review Assessment of predictive capability Derek Bingham 1.

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

Online social network size is reflected in human brain structure R. Kanai, B. Bahrami, R. Roylance and G. Rees published online 19 October 2011 Proceedings.

Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.

- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.

Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.

Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.

A significance test or hypothesis test is a procedure for comparing our data with a hypothesis whose truth we want to assess. The hypothesis is usually.

Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.

Methods for Dummies Second level Analysis (for fMRI) Chris Hardy, Alex Fellows Expert: Guillaume Flandin.

Statistical Analysis An Introduction to MRI Physics and Analysis Michael Jay Schillaci, PhD Monday, April 7 th, 2007.

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.

Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.

Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.

Hypothesis Tests. An Hypothesis is a guess about a situation that can be tested, and the test outcome can be either true or false. –The Null Hypothesis.

Definition Slides Unit 2: Scientific Research Methods.

Definition Slides Unit 1.2 Research Methods Terms.

Kriegeskorte N, Simmons WK, Bellgowan PSF, Baker CI. Circular analysis in systems neuroscience – the dangers of double dipping slides by Nikolaus Kriegeskorte,

Group Analyses Guillaume Flandin SPM Course London, October 2016

Statistics for fMRI.

J. A. Landsheer & G. van den Wittenboer

Issues with analysis and interpretation

Contrasts & Statistical Inference

Signal and Noise in fMRI

Disruption of Large-Scale Brain Systems in Advanced Aging

Volume 53, Issue 6, Pages (March 2007)

Statistical Challenges in “Big Data” Human Neuroimaging

Yoni K. Ashar, Jessica R. Andrews-Hanna, Sona Dimidjian, Tor D. Wager

Thinking critically with psychological science

Contrasts & Statistical Inference

Contrasts & Statistical Inference

Presentation transcript:

Image from Russian Newsweek Varieties of “Voodoo” Ed Vul MIT

neurocritic bumpybrains “What is the most we can learn about the mind by studying the brain?” Reception “an unrepentant Vul” “I haven't had so much fun since Watergate!” “there are people who would otherwise be dead if we adopted Vul’s opinions” “grossly over-strong pronouncements” “they look like idiots because they're out of their statistical depth” “such interchanges, in my experience, lead to more heat than light”

Reception “such interchanges, in my experience, lead to more heat than light” From SEED magazine

Varieties of “voodoo” r = 0.96 Historical perspective What captured our attention? The “voodoo correlation” method and generalizations Objections Solutions

B-Projective Psychokinesis Test Edward Cureton (1950) Can we use the same data to make an item analysis and to validate the test? Thanks to: Dirk Vorberg! This analysis produces high validity from pure noise ) Noise data: 29 students & grades 85 random flips/student 2) Find tags predicting GPA 3) Evaluate validity on same data Result: validity = 0.82

Cureton (1950)

A new name in each field Auctioneering: “the winner’s curse” Machine learning: “testing on training data” “data snooping” Modeling: “overfitting” Survey sampling: “selection bias” Logic: “circularity” Meta-analysis: “publication bias” fMRI: “double-dipping” “non-independence”

Varieties of “voodoo” r = 0.96 Historical perspective What captured our attention? The “voodoo correlation” method and generalizations Objections Solutions

Large Correlations Eisenberger, Lieberman, & Satpute (2005) r = 0.86

Large Correlations Takahashi, Kato, Matsuura, Koeda, Yahata, Suhara, & Okubo (2008) r = 0.82

Large Correlations Hooker, Verosky, Miyakawa, Knight, & D’Esposito (2008) r = 0.88

Large Correlations Eisenberger, Lieberman, & Williams (2003) r = 0.88

How common are these correlations? Phil Nguyen surveyed soc/aff/person fMRI, in search of these correlations. Vul, Harris, Winkielman, & Pashler (2009)

What’s wrong with large correlations? (Expected) observed correlations are limited by reliability: Maximum expected observed correlation should be the geo. mean. of reliabilities… Expected observed correlation Geometric mean of measure reliabilities True underlying correlation

Reliability of personality measures Viswesvaran and Ones (2000) reliability of the Big Five:.73 to.78. Hobbs and Fowler (1974) MMPI: (mean=.84) We say: 0.8 optimistic guess for ad hoc scales

Reliability of fMRI measures (a lot of variability depending on measure) Choose estimates most relevant for across-subject correlations – Kong et al. (2006) – 0.23 – 0.93 (mean=0.60) Manoach et al (2001) – ~0.8 Aron, Gluck, and Poldrack (2006) – Grinband (2008) We say: not often over 0.7

Maximum expected correlation Expected observed correlation Geometric mean of measure reliabilities True underlying correlation fMRI reliability Personality reliability Max. true corr. Maximum expected correlation

“Upper bound”? Many exceeding the maximum possible expected correlation. (variability is possible, but this struck us as exceedingly unlikely) Vul, Harris, Winkielman, & Pashler (2009)

Varieties of “voodoo” r = 0.96 Historical perspective What captured our attention? The “voodoo correlation” method and generalizations Objections Solutions

How are these correlations produced? Would you please be so kind as to answer a few very quick questions about the analysis that produced, i.e., the correlations on page XX?... Response? “No response” The data plotted reflect the percent signal change or difference in parameter estimates (according to some contrast) of… 1 …the average of a number of voxels 2 …one peak voxel that was most significant according to some functional measure 1) What is the brain measure: One peak voxel, or average of several voxels? If 1: The voxels whose data were plotted (i.e., the “region of interest”) were selected based on… 1a …only anatomical constraints 1b …only functional constraints 1c … anatomical and functional constraints 1.1) How was this set of voxels selected? Go to question 2 “Independent”

How are these correlations produced? The functional measure used to select the voxel(s) plotted in the figure was… A …a contrast within individual subjects B …the result of running a regression, across subjects, of the behavioral measure of interest against brain activation (or a contrast) at each voxel C …something else? 2) What sort of functional selection? Finally: the fMRI data (runs/blocks/trials) displayed in the figure were… A …the same data as those employed in the analysis used to select voxels B …different data from those employed in the analysis used to select voxels. 3) Same or different data? “Independent” “Non-Independent” 54% said correlations were computed in voxels that were selected for containing high correlations.

54% did what? Vul, Harris, Winkielman, & Pashler (2009)

What would this do on noise? Simulate 10 subjects, 10,000 voxels each. This analysis produces high correlations from pure noise. Vul, Harris, Winkielman, & Pashler (2009)

False alarms and inflation Vul, Harris, Winkielman, & Pashler (2009) Multiple comparisons correction: 1) Minimizes false alarms 2) Exacerbates effect-size inflation

Source of high correlations Our “upper bound” estimate Color coded by analysis method ascertained from survey. Vul, Harris, Winkielman, & Pashler (2009) Biased method produced the majority of the “surprisingly high” correlations

Varieties of “voodoo” r = 0.96 Historical perspective What captured our attention? The “voodoo correlation” method and generalizations Objections Solutions

Same problem in different guises Null-hypothesis tests on non-independent data. Computing meaningless effect sizes (e.g., correlations). Showing uninformative/misleading graphs… (with error bars?)

Testing null-hypotheses non-independently Somerville et al, 2006 r = 0.48 p<0.038 Partially non-independent null-hypotheses

Baker, Hutchison, & Kanwisher (2007) High selectivity from pure noise.

Computing effect sizes non-independently “r 2 = 0.74” Eisenberger, Lieberman, & Satpute (2005) A “voodoo” correlation

Modulating motion perception non-independently Low load High load Motion No motionMotion No motion ! Non-independent selection produces bizarre spurious patterns Rees, Frith, & Lavie (1997)

Data Analysis voxels M 2 (data 2 ) M 1 (data 1 ) > α Reporting peak voxel, cluster mean, etc. amounts to two analysis steps.

Data Analysis voxels M 2 (data 2 ) M 1 (data 1 ) > α M 1 = M 2 & data 1 = data 2 Non-Independent: M 1 ≠ M 2 or data 1 ≠ data 2 Independent:

Data Analysis voxels M 2 (data 2 ) M 1 (data 1 ) > α Non-Independent: Independent: Conclusions can only be based on selection step M 2 is biased Conclusions can be based on both measures M 2 provides unbiased estimates of effect size

Data Analysis voxels M 2 (data 2 ) M 1 (data 1 ) > α Non-Independent: Independent: Requires the most stringent multiple comparisons corrections! (and reduced power) Selection step need not have the most stringent corrections. Inferences from M 2 amount to a single test: greater power

Worst possible combination: Inadequate multiple comparisons (often used justifiably in independent analyses) Non-independent null-hypothesis testing. No conclusions may be drawn from either step!

p < uncorrected peak voxel: F (2,14) = 22.1; p <.0004 Significant results, potentially out of nothing! Summerfield et al, 2006 Uncorrected thresholds and non-independence

Significant results, potentially out of nothing! Eisenberger et al, 2003 The “Forman” error and non-independence p < cluster-size = 10 r = 0.88 p < 0.005

Generalizations Whenever the same data and measure are used to select voxels and later assess their signal: –Effect sizes will be inflated (e.g., correlations) –Data plots will be distorted and misleading –Null-hypothesis tests will be invalid –Only the selection step may be used for inference If multiple comparisons are inadequate, results may be produced from pure noise.

Varieties of “voodoo” r = 0.96 Historical perspective What captured our attention? The “voodoo correlation” method and generalizations Objections Solutions

This is not specific to social neuroscience! We completely agree: the problem is very general Some fields are more susceptible: –fMRI, genetics, finance, immunology, etc. Non-independence arises when –A whole lot of data are available –Only some contains relevant signal –Researchers try to find relevant bits, and estimate signal without extra costs of data

Inflation isn’t so bad Cannot estimate inflation in general – must reanalyze data. We know of one comparison (in real data) of independent – non-independent correlations Poldrack and Mumford (submitted) ~0.8 down to ~0.5 (non-indep. R 2 inflated by more than a factor of 2)

Location, not magnitude, matters Do we want to study anatomy for its own sake? Only location matters, but… Localization seems to be overestimated by underpowered studies: Studies with adequate power yield small widespread correlations (Yarkoni and Braver, in press) Do we want to use measures/manipulations of the brain to predict/alter behavior? Effect size really matters!

Restriction of range? Lieberman et al: Independent studies are underestimating correlations due to restricted range No evidence of such an effect in our surveyed data 95% CI: – 0.06

Inflated correlations not used for inference? Eisenberger, Lieberman, & Satpute (2005)

Varieties of “voodoo” r = 0.96 Historical perspective What captured our attention? The “voodoo correlation” method and generalizations Objections Solutions

Suggestions for practitioners 1: Independent localizers Use different data or orthogonal measure to select voxels.

Suggestions for practitioners 1: Independent localizers 2: Cross-validation methods Use one portion of data to select, another to obtain a validation measure, repeat. What is “left out” determines generalization: - leave one run out: generalize to new runs of same subjects - leave one subject out: generalize to new subjects!

Suggestions for practitioners 1: Independent localizers 2: Cross-validation methods 3: Random/Permutation simulations to evaluate independence If orthogonality of measures or independence of data are not obvious, try the analysis on random data (Generates empirical null-hypothesis distributions.)

Warnings for consumers/reviewers If analysis is not independent: Do not be seduced by biased graphs, statistics Rely only on the multiple comparisons correction in the voxel selection step for conclusions. Heuristics for spotting non-independence –Heat-map, arrow, plot –New coordinates for each analysis Read the methods!

Questions for statisticians How can we quantify spatial uncertainty given anatomical variability? –…to assess “replications” –…to cross-validate across subjects –…to generalize results Is it possible to jointly estimate signal location and strength without bias? ?

Closing Reusing the same data and measure to (a) select data and (b) estimate effect sizes leads to overestimation: “Baloney” - Cureton (1950) Highest correlations in social neuroscience individual differences research are obtained in this manner: “Voodoo” Many fMRI papers contain variants of this error 42% in 2008 (Kriegeskorte et al, 2009; Vul & Kanwisher, in press) Simple solutions exist to avoid these problems!

Thanks! Thanks to: Nancy Kanwisher, Hal Pashler, Chris Baker, Niko Kriegeskorte, Russ Poldrack and many others for very engaging discussions. Vul, Harris, Winkielman, Pashler (2009) Puzzlingly high correlations in fMRI studies of emotion, personality and social cognition, Perspectives on Psychological Science. Kriegeskorte, Simmons, Bellgowan, Baker (2009) Circular analysis in systems neuroscience: the dangers of double dipping, Nature Neuroscience. Yarkoni (2009) Big correlations in little studies…, Perspectives on Psychological Science. Lieberman, Berkman, Wager (2009) Correlations in social neuroscience aren’t voodoo…, Perspectives on Psychological Science. Vul, Harris, Winkielman, Pashler (2009) Reply to comments on “Puzzlingly high correlations…”, Perspectives on Psychological Science. Cureton (1950) Validity, Reliability, and Baloney. Educational and Psychological Measurement. Vul, Kanwisher (in press) Begging the question: The non-independence error in fMRI Data Analysis, Foundational issues for human brain mapping. Poldrack, Mumford (submitted) Independence in ROI analyses: Where’s the voodoo? Nichols, Poline (2009) Commentary on Vul et al…, Perspectives on Psychological Science. Yarkoni, Braver (in press) Cognitive neuroscience approaches to individial differences in working memory… Handbook of individual differences in cognition.