Download presentation
Presentation is loading. Please wait.
Published byTabitha Juliana Bell Modified over 6 years ago
1
Understanding Generalization in Adaptive Data Analysis
Vitaly Feldman
2
Overview Adaptive data analysis New results [F, Steinke 17]
Motivation Definitions Basic techniques With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14,15] New results [F, Steinke 17] Open problems
3
π π₯βΌπ [πΏππ π (π,π₯)]=? Learning problem Model Data π=π΄(π) Analysis π΄
Distribution π over domain π XGBoost SVRG Adagrad SVM Analysis π=π΄(π) π= π₯ 1 ,β¦, π₯ π βΌ π π π΄
4
Statistical inference
Data π π i.i.d. samples from π Theory Model complexity Rademacher compl. Stability Online-to-batch β¦ Algorithm π΄ Generalization guarantees for π π=π΄(π)
5
Data analysis is adaptive
Steps depend on previous analyses of the same dataset π΄ 1 π π£ 1 Exploratory data analysis Feature selection Model stacking Hyper-parameter tuning Shared datasets β¦ π΄ 2 π£ 2 π΄ π π£ π Data analyst(s)
6
Thou shalt not test hypotheses suggested by data
βQuiet scandal of statisticsβ [Leo Breiman, 1992]
7
ML practice Testing Training π Test error of π βπ π₯βΌπ [πΏππ π (π,π₯)]
Data Data Data Data Training Testing Lasso k-NN SVM C4.5 Kernels π Test error of π βπ π₯βΌπ [πΏππ π (π,π₯)]
8
ML practice now Testing Training Validation π π π π Test error of π
XGBoost SVRG Tensorflow Testing Data Training Validation π π π π Test error of π βπ π₯βΌπ [πΏππ π (π,π₯)]
9
Adaptive data analysis [DFHPRR 14]
π΄ 1 Algorithm π= π₯ 1 ,β¦, π₯ π βΌ π π π£ 1 π΄ 2 π£ 2 π΄ π π£ π Data analyst(s) Goal: given π compute π£ π βs βcloseβ to running π΄ π on fresh samples Each analysis is a query Design algorithm for answering adaptively-chosen queries
10
Adaptive statistical queries
Statistical query oracle [Kearns 93] π π΄ 1 π= π₯ 1 ,β¦, π₯ π βΌ π π π£ 1 π΄ 2 π£ 2 π΄ π π£ π Data analyst(s) π΄ π (π)β‘ 1 π π₯βπ π π π₯ π π :πβ 0,1 Example: π π =πΏππ π (π,π₯) π£ π β π π₯βΌπ π π π₯ β€Ο with prob. 1βπ½ Can measure correlations, moments, accuracy/loss Run any statistical query algorithm
11
Answering non-adaptive SQs
Given π non-adaptive query functions π 1 ,β¦ ,π π and π i.i.d. samples from π estimate π π₯βΌπ π π π₯ Use empirical mean: π π π π = 1 π π₯βπ π π π₯ π=π log (π/π½) π 2
12
Answering adaptively-chosen SQs
What if we use π π π π ? For some constant π½>0: πβ₯ π π 2 Variable selection, boosting, bagging, step-wise regression .. Data splitting: π=π πβ
log π π 2
13
Answering adaptive SQs
[DFHPRR 14] Exists an algorithm that can answer π adaptively chosen SQs with accuracy π for π= π π π 2.5 Data splitting: π π π 2 [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15] π= π π π 2 Generalizes to low-sensitivity analyses: π΄ π π β π΄ π π β² β€ 1 π when π,πβ² differ in a single element Estimates π πβΌ π π [ π΄ π (π)] within π
14
Differential privacy [Dwork,McSherry,Nissim,Smith 06]
ratio bounded M π πβ² Randomized algorithm π is (π,πΏ)-differentially private if for any two data sets π,πβ² that differ in one element: βπβrange π , Pr π΄ π΄ π βπ β€ π π β
Pr π΄ π΄ π β² βπ +πΏ
15
Differential privacy is stability
DP implies generalization DP composes adaptively Differential privacy is stability Implies strongly uniform replace-one stability and generalization in expectation DP implies generalization with high probability [DFHPRR 14, BNSSSU 15] Composition of π π-DP algorithms: for every πΏ>0, is π π log 1/πΏ ,πΏ -DP [Dwork,Rothblum,Vadhan 10]
16
Value perturbation [DMNS 06]
Answer low-sensitivity query π΄ with π΄ π +π Given π samples achieves error βΞ(π΄)β
π β
π 1 4 where Ξ(π΄) is the worst-case sensitivity: max π,πβ² π΄ π βπ΄( π β² ) Ξ(π΄)β
π could be much larger than standard deviation of π΄ max π,πβ² π΄ π βπ΄ π β² β€1/π Gaussian π(0,π)
17
Beyond low-sensitivity
[F, Steinke 17] Exists an algorithm that for any adaptively-chosen sequence π΄ 1 ,β¦, π΄ π : π π‘ ββ given π= π π β
π‘ i.i.d. samples from π outputs values π£ 1 ,β¦, π£ π such that w.h.p. for all π: π πβΌ π π‘ π΄ π π β π£ π β€2 π π where π π = πππ« πβΌ π π‘ π΄ π π For statistical queries: π π :πβ[βπ΅,π΅] given π samples get error that scales as πππ« π₯βΌπ π π π₯ π β
π 1/4 Value perturbation: π΅ π β
π 1/4
18
Stable Median π π 1 π 2 π 3 β― π π π=π‘π β― π΄ π π¦ 1 π¦ 2 π¦ 3 π¦ πβ2 π¦ πβ1
π Find an approximate median with DP relative to π value π£ greater than bottom 1/3 and smaller than top 1/3 in π π£
19
Median algorithms Exponential mechanism [McSherry, Talwar 07]
Requires discretization: ground set π, |π|=π Upper bound: 2 π( log β π) samples Lower bound: Ξ©( log β π) samples [Bun,Nissim,Stemmer,Vadhan 15] π π Exponential mechanism [McSherry, Talwar 07] Output π£βπ with prob. β π βπ # π¦βπ π£β€π¦ β π 2 Uses π log π π samples Stability and confidence amplification for the price of one log factor!
20
Analysis Differential privacy approximately preserves quantiles
If π£ is within , empirical quantiles then π£ is within , true quantiles π£ is within mean Β±2π If π is well-concentrated on π· then easy to prove high probability bounds [F, Steinke 17] Let π be a DP algorithm that on input πβ π π outputs a function π:πββ and a value π£ββ. Then w.h.p. over πβΌ π· π and π,π£ βπ π : Pr π¦βΌπ· π£β€π(π¦) β Pr π¦βΌπ π£β€π(π¦)
21
Limits Any algorithm for answering π adaptively chosen SQs with accuracy π requires* π=Ξ©( π /π) samples [Hardt, Ullman 14; Steinke, Ullman 15] *in sufficiently high dimension or under crypto assumptions Verification of responses to queries: π=π( π log π) where π is the number of queries that failed verification Data splitting if overfitting [DFHPRR 14] Reusable holdout [DFHPRR 15] Maintaining public leaderboard in a competition [Blum, Hardt 15]
22
Open problems Analysts without side information about π
Queries depend only on previous answers Fixed βnaturalβ analyst/Learning algorithm Gradient descent for stochastic convex optimization Does there exist an SQ analyst whose queries require more than π( log π) samples to answer? (with 0.1 accuracy/confidence)
23
Stochastic convex optimization
Convex body πΎ= πΉ 2 π 1 β π₯ π₯ 2 β€1} Class πΉ of convex 1-Lipschitz functions πΉ= π convex βπ₯βπΎ, π»π(π₯) 2 β€1 Given π 1 ,β¦, π π sampled i.i.d. from unknown π over πΉ Minimize true (expected) objective: π π (π₯)β π πβΌπ [π(π₯)] over πΎ: Find π₯ s.t. π π π₯ β€ min π₯βπΎ π π π₯ +π π 1 π 2 β¦ π π π π π₯
24
Gradient descent ERM via projected gradient descent:
π π (π₯)β 1 π π π (π₯) Initialize π₯ 1 βπΎ For π‘=1 to π π₯ π‘+1 = Project πΎ π₯ π‘ βπβ
π» π π π₯ π‘ Output: 1 π π‘ π₯ π‘ 1 π π»π π ( π₯ π‘ ) Fresh samples: π» π π π₯ π‘ βπ» π π π₯ π‘ 2 β€1/ π Sample complexity is unknown Uniform convergence: π π π 2 samples (tight [F. 16]) SGD solves using π 1 π 2 samples [Robbins,Monro 51; Polyak 90] Overall: π/ π 2 statistical queries with accuracy π in 1/ π 2 adaptive rounds Sample splitting: π log π π 4 samples DP: π π π 3 samples
25
Conclusions Real-valued analyses (without any assumptions)
Going beyond tools from DP Other notions of stability for outcomes Max/mutual information Generalization beyond uniform convergence Using these techniques in practice
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.