Presentation is loading. Please wait.

Presentation is loading. Please wait.

Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron.

Similar presentations


Presentation on theme: "Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron."— Presentation transcript:

1 Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron Roth Microsoft Res. Google Res.U. of Toronto Samsung Res. Penn, CS

2 Analysis Findings Param. estimates Correlations Predictive model Classifier, Clustering etc.

3 Data Science 101 Does student nutrition affect academic performance? Normalized grade 100 50

4 Check correlations

5 Pick candidate foods

6 Fit linear function of 3 selected foods Freedman’s Paradox: “Such practices can distort the significance levels of conventional statistical tests. The existence of this effect is well known, but its magnitude may come as a surprise, even to a hardened statistician.” (1983) SUMMARY OUTPUT Regression Statistics Multiple R0.4453533 R Square0.1983396 Adjusted R Square0.1732877 Standard Error1.0041891 Observations100 ANOVA dfSSMSFSignificance F Regression323.950865447.9836227.9171518.98706E-05 Residual9696.806001261.008396 Total99120.7568667 Coefficient sStandard Errort StatP-value Intercept-0.0442480.100545016-0.440080.660868 Mushroom-0.2960740.10193011-2.904680.004563 Pumpkin0.2557690.1084430692.3585550.020373 Nutella0.26713630.0951861652.8064620.006066 FALSE DISCOVERY

7 Statistical inference Data Result and statistical guarantees Procedure Hypothesis tests Regression Learning p-values confidence intervals prediction intervals “Fresh” data

8 Data analysis is adaptive Data Result Exploratory data analysis Variable selection Hyper-parameter tuning Shared data - findings inform others

9 Is this a real problem? In the course of collecting and analyzing data, researchers have many decisions to make […] It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance”, and to then report only what “worked”. [Simmons,Nelson,Simonsohn 11] 1,000,000+ downloads; 1400+ citations “Irreproducible preclinical research exceeds 50%, resulting in approximately US$28B/year loss” [Freedman,Cockburn,Simcoe 15] “Why Most Published Research Findings Are False” [Ioannidis 05] Adaptive data analysis is one of the causes

10 Evaluating adaptive queries Data analyst(s) Statistical query oracle [Kearns 93] Can measure correlations, moments, accuracy/error, parameters and run any SQ-based algorithm!

11 Answering non-adaptive SQs

12 Answering adaptive SQs

13 Our results

14 Tool: differential privacy DATA

15 Differential Privacy [Dwork,McSherry,Nissim,Smith 06] S Algorithm ratio bounded Cynthia Frank Chris Kobbi Adam Aaron

16 Why DP? DP composes adaptively A B

17 B A Why DP? DP composes adaptively

18 DP implies generalization Why DP? DP composes adaptively

19 Back to queries

20 Further developments

21


Download ppt "Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron."

Similar presentations


Ads by Google