Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron Roth Microsoft Res. Google Res.U. of Toronto Samsung Res. Penn, CS
Analysis Findings Param. estimates Correlations Predictive model Classifier, Clustering etc.
Data Science 101 Does student nutrition affect academic performance? Normalized grade
Check correlations
Pick candidate foods
Fit linear function of 3 selected foods Freedman’s Paradox: “Such practices can distort the significance levels of conventional statistical tests. The existence of this effect is well known, but its magnitude may come as a surprise, even to a hardened statistician.” (1983) SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations100 ANOVA dfSSMSFSignificance F Regression E-05 Residual Total Coefficient sStandard Errort StatP-value Intercept Mushroom Pumpkin Nutella FALSE DISCOVERY
Statistical inference Data Result and statistical guarantees Procedure Hypothesis tests Regression Learning p-values confidence intervals prediction intervals “Fresh” data
Data analysis is adaptive Data Result Exploratory data analysis Variable selection Hyper-parameter tuning Shared data - findings inform others
Is this a real problem? In the course of collecting and analyzing data, researchers have many decisions to make […] It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance”, and to then report only what “worked”. [Simmons,Nelson,Simonsohn 11] 1,000,000+ downloads; citations “Irreproducible preclinical research exceeds 50%, resulting in approximately US$28B/year loss” [Freedman,Cockburn,Simcoe 15] “Why Most Published Research Findings Are False” [Ioannidis 05] Adaptive data analysis is one of the causes
Evaluating adaptive queries Data analyst(s) Statistical query oracle [Kearns 93] Can measure correlations, moments, accuracy/error, parameters and run any SQ-based algorithm!
Answering non-adaptive SQs
Answering adaptive SQs
Our results
Tool: differential privacy DATA
Differential Privacy [Dwork,McSherry,Nissim,Smith 06] S Algorithm ratio bounded Cynthia Frank Chris Kobbi Adam Aaron
Why DP? DP composes adaptively A B
B A Why DP? DP composes adaptively
DP implies generalization Why DP? DP composes adaptively
Back to queries
Further developments