Download presentation
Presentation is loading. Please wait.
1
Preserving Validity in Adaptive Data Analysis
Vitaly Feldman Accelerated Discovery Lab IBM Research - Almaden
2
Distribution π over π Results Analysis π= π₯ 1 ,β¦, π₯ π βΌ π π Parameter
estimate Classifier Clustering etc. π= π₯ 1 ,β¦, π₯ π βΌ π π Analysis
3
Data Analysis 101 Does student nutrition predict academic performance?
50 Normalized grade 100
4
Check correlations
5
Pick candidate foods
6
Fit linear function of 3 selected foods
SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 100 ANOVA df SS MS F Significance F Regression 3 E-05 Residual 96 Total 99 Coefficients t Stat P-value Intercept Mushroom Pumpkin Nutella FALSE DISCOVERY β0.0001 True π: π₯ π ,π¦βΌπ(0,1) Uncorrelated Freedmanβs Paradox [1983]
7
Statistical inference
Data βFreshβ i.i.d. samples Procedure Hypothesis tests Regression Learning Illustrate separately Result + generalization guarantees π
8
Data analysis is adaptive
Steps depend on previous analyses of the same dataset π΄ 1 π π£ 1 Data cleaning Exploratory data analysis Variable selection Hyper-parameter tuning Shared datasets β¦. π΄ 2 π£ 2 π΄ π Mention that had we unlimited amount of data that would not be an issue. There is plenty of empirical evidence for that: from effects of selection to papers demonstrating overfitting to the test set. π£ π Data analyst(s) π΄ π : π π β π π π£ π = π΄ π (π)
9
Itβs an old problem Thou shalt not test hypotheses suggested by data
βQuiet scandal of statisticsβ [Leo Breiman, 1992]
10
Is this a real problem? βWhy Most Published Research Findings Are Falseβ [Ioannidis 2005] βIrreproducible preclinical research exceeds 50%, resulting in approximately US$28B/year lossβ [Freedman,Cockburn,Simcoe 2015] Adaptive data analysis is one of the causes π-hacking Researcher degrees of freedom [Simmons, Nelson, Simonsohn 2011] Garden of forking paths [Gelman, Loken 2015] Add learning references
11
Existing approaches I Abstinence Pre-registration
Β© Center for Open Science
12
Existing approaches II
Selective inference B A Data A+B Examples: Model selection + parameter inference Variable selection + regression Survey: [Taylor, Tibshirani 2015]
13
Existing approaches III
B A Data C Sample splitting Might be necessary for standard approaches
14
New approach: differential privacy [Dwork,McSherry,Nissim,Smith 06]
DATA
15
Differential privacy as outcome stability
ratio bounded A π πβ² Randomized algorithm π΄ is π-differentially private (DP) if for any two data sets π,πβ² such that dist π, π β² =1: βπβrange π΄ , Pr π¨ π¨ π βπ β€ π π β
Pr π¨ π¨ π β² βπ
16
DP composes adaptively π΄ π is π 1 βDP π΄ π π π΅ π,π¦ is π 2 -DP β π¦βπ π΅
πβ² π΅ π,π¦ is π 2 -DP β π¦βπ
17
DP composes DP implies adaptively generalization A B πβ² π
DP is a strong form of algorithmic stability π 1 + π 2 -DP
18
Approaches (Approximate) max-information Differential privacy
Description length Can be used to analyze any algorithm. The goal is primarily to lay the theoretical groundwork Additional approaches: Mutual information [Russo, Zhou 2016] KL-stability [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 2016] Typical stability [Bassily,Freund 2016]
19
Application: holdout validation
Data Data Data Data Training Holdout/Testing Learning happens here Inference only happens Test error of π Predictor π πΈππ (π)
20
Reusable Holdout algorithm
Data Training π Holdout π» Analyst(s) π 1 Reusable Holdout algorithm πΈππ ( π 1 ) π 2 πΈππ ( π 2 ) π» π π πΈππ ( π π )
21
Thresholdout Theorem: Given a holdout set of π i.i.d. samples Thresholdout can estimate the true error of a βlargeβ number of adaptively chosen predictors if βfewβ overfit to the training set We know reasonably well what the test set is used for: primarily to estimate the error. Can be applied to maintaining accurate leaderboard in machine learning competitions [Blum, Hardt 2015]
22
Illustration Given points in π π Γ β1,+1
Pick variables correlated with the label on the training set. Validate on the holdout set Check prediction error of the linear classifier given by sign πβπ π π π₯ π where π π is the sign of the correlation Data distribution: 10,000 points from π 0,1 10,000 randomly labeled
23
Conclusions Adaptive data analysis: Further work:
Ubiquitous and useful Leads to false discovery and overfitting Possible to model and improve on standard approaches Further work: New and stronger theory Tune to specific applications Practical applications
24
The cast Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron Roth Harvard Google Res. U. of Toronto Stanford U. Penn Reusable holdout: Preserving validity in adaptive data analysis. Science, 2015 Preserving Statistical Validity in Adaptive Data Analysis, Symposium on the Theory of Computing (STOC), 2015. Generalization in Adaptive Data Analysis and Holdout Reuse Neural Information Processing Systems (NIPS), 2015.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.