Download presentation
Presentation is loading. Please wait.
1
Vitaly (the West Coast) Feldman
Understanding Generalization in Adaptive Data Analysis Vitaly (the West Coast) Feldman
2
Overview Adaptive data analysis New results (with Thomas Steinke)
Motivation Framework Basic results With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14,15] New results (with Thomas Steinke) Open problems
3
π π₯βΌπ [πΏππ π (π,π₯)]=? Results Data π=π΄(π) Analysis π΄ Probability
distribution π over domain π Data Results Analysis π=π΄(π) π= π₯ 1 ,β¦, π₯ π βΌ π π π΄
4
Statistical inference
Data π π i.i.d. samples from π Theory Concentration/CLT Model complexity Rademacher compl. Stability Online-to-batch Algorithm π΄ Hypothesis test Parameter estimator Classification Generalization guarantees for π π=π΄(π) πΏππ π π (π)
5
Data analysis is adaptive
Steps depend on previous analyses of the same dataset π΄ 1 Data pre-processing Exploratory data analysis Feature selection Model stacking Hyper-parameter tuning Shared datasets β¦ π π£ 1 π΄ 2 π£ 2 π΄ π Mention non-adaptive reuse π£ π Data analyst(s)
6
Thou shalt not test hypotheses suggested by data
βQuiet scandal of statisticsβ [Leo Breiman, 1992]
7
Reproducibility crisis?
Why Most Published Research Findings Are False [Ioannidis 2005] βIrreproducible preclinical research exceeds 50%, resulting in approximately US$28B/year lossβ [Freedman,Cockburn,Simcoe 2015] Adaptive data analysis is one of the causes π-hacking Researcher degrees of freedom [Simmons, Nelson, Simonsohn 2011] Garden of forking paths [Gelman, Loken 2015]
8
Existing approaches Sample splitting Selective inference
Model selection + parameter estimation Variable selection + regression Pre-registration Β© Center for Open Science
9
Adaptive data analysis [DFHPRR 14]
π΄ 1 Algorithm π= π₯ 1 ,β¦, π₯ π βΌ π π π£ 1 π΄ 2 π£ 2 π΄ π π£ π Data analyst(s) Need both tools for analysis and ways to make algorithms that compose better. Goal: given π compute π£ π βs βcloseβ to running π΄ π on fresh samples Each analysis is a query Design algorithm for answering adaptively-chosen queries
10
Adaptive statistical queries
Statistical query oracle [Kearns 93] π π΄ 1 π= π₯ 1 ,β¦, π₯ π βΌ π π π£ 1 π΄ 2 π£ 2 π΄ π π£ π Data analyst(s) π΄ π (π)β‘ 1 π π₯βπ π π π₯ π π :πβ 0,1 Example: π π =πΏππ π (π,π₯) π£ π β π π₯βΌπ π π π₯ β€Ο with prob. 1βπ½ Can measure correlations, moments, accuracy/loss Run any statistical query algorithm
11
Answering non-adaptive SQs
Given π non-adaptive query functions π 1 ,β¦ ,π π and π i.i.d. samples from π estimate π π₯βΌπ π π π₯ Use empirical mean: π π π π = 1 π π₯βπ π π π₯ π=π log (π/π½) π 2
12
Answering adaptively-chosen SQs
What if we use π π π π ? For some constant π½>0: πβ₯ π π 2 Variable selection, boosting, step-wise regression .. Sample splitting: π=π πβ
log π π 2
13
Answering adaptive SQs
[DFHPRR 14] Exists an algorithm that can answer π adaptively chosen SQs with accuracy π for π= π π π 2.5 Data splitting: π π π 2 [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15] π= π π π 2 Generalizes to low-sensitivity analyses: π΄ π π β π΄ π π β² β€ 1 π when π,πβ² differ in a single element Estimates π πβΌ π π [ π΄ π (π)] within π
14
Value perturbation Answer low-sensitivity query π΄ with π΄ π +π
Laplace/Gaussian
15
Differential privacy [Dwork,McSherry,Nissim,Smith 06]
DP implies generalization Differential privacy is stability If π is (π,πΏ)-DP and outputs a function πβ 0,1 then for every π,πβ²,π₯ π π=π π π π₯ β π π=π π β² π π₯ β²π+πΏ uniform replace-one stability implies generalization in expectation [Bousquet,Elisseeff 02] π πβΌ π π , π=π π π π π β π πβΌ π π , π=π π π π π β²π+πΏ DP implies generalization with high probability [DFHPRR 14, BNSSSU 15]
16
Differential privacy [DMNS 06]
DP implies generalization Differential privacy limits information learned about the dataset Max-information: for an algorithm π: π π βπ and dataset πΊβΌπ· over π π I β π½ πΊ;π΄(πΊ) β€π if for any event πβ π π Γπ Pr πΊβΌπ· πΊ,π΄ πΊ βπ β€ π π β
Pr πΊ,π»βΌπ· π»,π΄ πΊ βπ +π½ π-DP bounds max-information [DFHPRR 15] (π,πΏ)-DP bounds max-information for π·= π π [Rogers,Roth,Smith,Thakkar 16]
17
Differential privacy [DMNS 06]
DP implies generalization DP composes adaptively Adaptive composition of π (π,πΏ)-DP algorithms is π π log 1 πΏ β² , πΏ β² +ππΏ -DP For every πΏ β² >0 and πβ€1/ π [Dwork,Rothblum,Vadhan 10]
18
Differential privacy [DMNS 06]
DP implies generalization DP composes adaptively If π is βaccurateβ when fresh samples are used to answer a query differentially private Then π is βaccurateβ when same dataset is reused for adaptively-chosen queries
19
Value perturbation [DMNS 06]
Answer low-sensitivity query π΄ with π΄ π +π Given π samples achieves error βΞ(π΄)β
π β
π 1 4 where Ξ(π΄) is the worst-case sensitivity: max π,πβ² π΄ π βπ΄( π β² ) Ξ(π΄)β
π could be much larger than standard deviation of π΄ on π max π,πβ² π΄ π βπ΄ π β² β€1/π
20
Beyond low-sensitivity
[F, Steinke 17] Exists an algorithm that for any adaptively-chosen sequence π΄ 1 ,β¦, π΄ π : π π‘ ββ given π= π π β
π‘ i.i.d. samples from π outputs values π£ 1 ,β¦, π£ π such that w.h.p. for all π: π πβΌ π π‘ π΄ π π β π£ π β€2 π π where π π = πππ« πβΌ π π‘ π΄ π π
21
Stable Median π π 1 π 2 π 3 β― π π π=π‘π β― π΄ π π¦ 1 π¦ 2 π¦ 3 π¦ πβ2 π¦ πβ1
π Find an approximate median with (weak) DP relative to π value π£ greater than bottom 1/3 and smaller than top 1/3 in π π£
22
Median algorithms Exponential mechanism [McSherry, Talwar 07]
Requires discretization: ground set π, |π|=π Upper bound: 2 π( log β π) samples Lower bound: Ξ©( log β π) samples [Bun,Nissim,Stemmer,Vadhan 15] π π Exponential mechanism [McSherry, Talwar 07] Output π£βπ with prob. β π βπ # π¦βπ π£β€π¦ β π 2 Uses π log π π samples Stability and confidence amplification for the price of one log factor!
23
Limits Any algorithm for answering π adaptively chosen SQs with accuracy π requires* π=Ξ©( π /π) samples [Hardt, Ullman 14; Steinke, Ullman 15] *in sufficiently high dimension or under crypto assumptions Analysts without side information about π? Queries depend only on previous answers Fixed βnaturalβ analyst/Learning algorithm Gradient descent for stochastic convex optimization Does there exist an analyst whose statistical queries require more than π( log π) samples to answer? (with 0.1 accuracy/confidence)
24
ML practice Testing Training Validation π π π π Test error of π
Data Training Validation π π π XGBoost SVRG Tensorflow π Test error of π βπ π₯βΌπ [πΏππ π (π,π₯)]
25
Reusable holdout [DFHPRR 15]
Data Training π Holdout π» π 1 πΏππ π ( π 1 ) Reusable holdout algorithm AI guru π 2 πΏππ π ( π 2 ) π π πΏππ π ( π π )
26
Reusable holdout [DFHPRR 15, FS 17] Exists an algorithm that can accurately estimate the loss of π adaptively chosen functions as long as at most β overfit to the training set for π ~ β β
log π Overfitting: π π₯βΌπ [πΏππ π (π,π₯)] βπ π₯βΌπ [πΏππ π (π,π₯)] Verifying mostly correct answers with DP is cheap Sparse vector technique [Dwork,Naor,Reingold,Rothblum,Vadhan 09]
27
Conclusions Datasets are reused adaptively New conceptual framework
Deep connections to DP Privacy and generalization are aligned Data βfreshnessβ is a limited resource Real-valued analyses (without any assumptions) Going beyond adversarial adaptivity Connections to stability and selective inference Using these techniques in practice
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.