Presentation is loading. Please wait.

Presentation is loading. Please wait.

Preserving Validity in Adaptive Data Analysis

Similar presentations


Presentation on theme: "Preserving Validity in Adaptive Data Analysis"β€” Presentation transcript:

1 Preserving Validity in Adaptive Data Analysis
Vitaly Feldman Accelerated Discovery Lab IBM Research - Almaden

2 Distribution 𝑃 over 𝑋 Results Analysis 𝑆= π‘₯ 1 ,…, π‘₯ 𝑛 ∼ 𝑃 𝑛 Parameter
estimate Classifier Clustering etc. 𝑆= π‘₯ 1 ,…, π‘₯ 𝑛 ∼ 𝑃 𝑛 Analysis

3 Data Analysis 101 Does student nutrition predict academic performance?
50 Normalized grade 100

4 Check correlations

5 Pick candidate foods

6 Fit linear function of 3 selected foods
SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 100 ANOVA df SS MS F Significance F Regression 3 E-05 Residual 96 Total 99 Coefficients t Stat P-value Intercept Mushroom Pumpkin Nutella FALSE DISCOVERY β‰ˆ0.0001 True 𝑃: π‘₯ 𝑖 ,π‘¦βˆΌπ‘(0,1) Uncorrelated Freedman’s Paradox [1983]

7 Statistical inference
Data β€œFresh” i.i.d. samples Procedure Hypothesis tests Regression Learning Illustrate separately Result + generalization guarantees πœ‡

8 Data analysis is adaptive
Steps depend on previous analyses of the same dataset 𝐴 1 𝑆 𝑣 1 Data cleaning Exploratory data analysis Variable selection Hyper-parameter tuning Shared datasets …. 𝐴 2 𝑣 2 𝐴 π‘š Mention that had we unlimited amount of data that would not be an issue. There is plenty of empirical evidence for that: from effects of selection to papers demonstrating overfitting to the test set. 𝑣 π‘š Data analyst(s) 𝐴 𝑖 : 𝑋 𝑛 β†’ π‘Œ 𝑖 𝑣 𝑖 = 𝐴 𝑖 (𝑆)

9 It’s an old problem Thou shalt not test hypotheses suggested by data
β€œQuiet scandal of statistics” [Leo Breiman, 1992]

10 Is this a real problem? β€œWhy Most Published Research Findings Are False” [Ioannidis 2005] β€œIrreproducible preclinical research exceeds 50%, resulting in approximately US$28B/year loss” [Freedman,Cockburn,Simcoe 2015] Adaptive data analysis is one of the causes 𝑝-hacking Researcher degrees of freedom [Simmons, Nelson, Simonsohn 2011] Garden of forking paths [Gelman, Loken 2015] Add learning references

11 Existing approaches I Abstinence Pre-registration
Β© Center for Open Science

12 Existing approaches II
Selective inference B A Data A+B Examples: Model selection + parameter inference Variable selection + regression Survey: [Taylor, Tibshirani 2015]

13 Existing approaches III
B A Data C Sample splitting Might be necessary for standard approaches

14 New approach: differential privacy [Dwork,McSherry,Nissim,Smith 06]
DATA

15 Differential privacy as outcome stability
ratio bounded A 𝑆 𝑆′ Randomized algorithm 𝐴 is πœ–-differentially private (DP) if for any two data sets 𝑆,𝑆′ such that dist 𝑆, 𝑆 β€² =1: βˆ€π‘βŠ†range 𝐴 , Pr 𝑨 𝑨 𝑆 βˆˆπ‘ ≀ 𝑒 πœ– β‹… Pr 𝑨 𝑨 𝑆 β€² βˆˆπ‘

16 DP composes adaptively 𝐴 𝑆 is πœ– 1 βˆ’DP 𝐴 π‘Œ π‘Œ 𝐡 𝑆,𝑦 is πœ– 2 -DP βˆ€ π‘¦βˆˆπ‘Œ 𝐡
π‘Œβ€² 𝐡 𝑆,𝑦 is πœ– 2 -DP βˆ€ π‘¦βˆˆπ‘Œ

17 DP composes DP implies adaptively generalization A B π‘Œβ€² 𝑆
DP is a strong form of algorithmic stability πœ– 1 + πœ– 2 -DP

18 Approaches (Approximate) max-information Differential privacy
Description length Can be used to analyze any algorithm. The goal is primarily to lay the theoretical groundwork Additional approaches: Mutual information [Russo, Zhou 2016] KL-stability [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 2016] Typical stability [Bassily,Freund 2016]

19 Application: holdout validation
Data Data Data Data Training Holdout/Testing Learning happens here Inference only happens Test error of 𝑓 Predictor 𝑓 πΈπ‘Ÿπ‘Ÿ (𝑓)

20 Reusable Holdout algorithm
Data Training 𝑇 Holdout 𝐻 Analyst(s) 𝑓 1 Reusable Holdout algorithm πΈπ‘Ÿπ‘Ÿ ( 𝑓 1 ) 𝑓 2 πΈπ‘Ÿπ‘Ÿ ( 𝑓 2 ) 𝐻 𝑓 π‘š πΈπ‘Ÿπ‘Ÿ ( 𝑓 π‘š )

21 Thresholdout Theorem: Given a holdout set of 𝑛 i.i.d. samples Thresholdout can estimate the true error of a β€œlarge” number of adaptively chosen predictors if β€œfew” overfit to the training set We know reasonably well what the test set is used for: primarily to estimate the error. Can be applied to maintaining accurate leaderboard in machine learning competitions [Blum, Hardt 2015]

22 Illustration Given points in 𝐑 𝑑 Γ— βˆ’1,+1
Pick variables correlated with the label on the training set. Validate on the holdout set Check prediction error of the linear classifier given by sign π‘–βˆˆπ‘‰ 𝑠 𝑖 π‘₯ 𝑖 where 𝑠 𝑖 is the sign of the correlation Data distribution: 10,000 points from 𝑁 0,1 10,000 randomly labeled

23 Conclusions Adaptive data analysis: Further work:
Ubiquitous and useful Leads to false discovery and overfitting Possible to model and improve on standard approaches Further work: New and stronger theory Tune to specific applications Practical applications

24 The cast Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron Roth Harvard Google Res. U. of Toronto Stanford U. Penn Reusable holdout: Preserving validity in adaptive data analysis. Science, 2015 Preserving Statistical Validity in Adaptive Data Analysis, Symposium on the Theory of Computing (STOC), 2015. Generalization in Adaptive Data Analysis and Holdout Reuse Neural Information Processing Systems (NIPS), 2015.


Download ppt "Preserving Validity in Adaptive Data Analysis"

Similar presentations


Ads by Google