Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding Generalization in Adaptive Data Analysis

Similar presentations


Presentation on theme: "Understanding Generalization in Adaptive Data Analysis"β€” Presentation transcript:

1 Understanding Generalization in Adaptive Data Analysis
Vitaly Feldman

2 Overview Adaptive data analysis New results [F, Steinke 17]
Motivation Definitions Basic techniques With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14,15] New results [F, Steinke 17] Open problems

3 𝐄 π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓,π‘₯)]=? Learning problem Model Data 𝑓=𝐴(𝑆) Analysis 𝐴
Distribution 𝑃 over domain 𝑋 XGBoost SVRG Adagrad SVM Analysis 𝑓=𝐴(𝑆) 𝑆= π‘₯ 1 ,…, π‘₯ 𝑛 ∼ 𝑃 𝑛 𝐴

4 Statistical inference
Data 𝑆 𝑛 i.i.d. samples from 𝑃 Theory Model complexity Rademacher compl. Stability Online-to-batch … Algorithm 𝐴 Generalization guarantees for 𝑓 𝑓=𝐴(𝑆)

5 Data analysis is adaptive
Steps depend on previous analyses of the same dataset 𝐴 1 𝑆 𝑣 1 Exploratory data analysis Feature selection Model stacking Hyper-parameter tuning Shared datasets … 𝐴 2 𝑣 2 𝐴 π‘˜ 𝑣 π‘˜ Data analyst(s)

6 Thou shalt not test hypotheses suggested by data
β€œQuiet scandal of statistics” [Leo Breiman, 1992]

7 ML practice Testing Training 𝑓 Test error of 𝑓 β‰ˆπ„ π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓,π‘₯)]
Data Data Data Data Training Testing Lasso k-NN SVM C4.5 Kernels 𝑓 Test error of 𝑓 β‰ˆπ„ π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓,π‘₯)]

8 ML practice now Testing Training Validation πœƒ 𝑓 πœƒ 𝑓 Test error of 𝑓
XGBoost SVRG Tensorflow Testing Data Training Validation πœƒ 𝑓 πœƒ 𝑓 Test error of 𝑓 β‰ˆπ„ π‘₯βˆΌπ‘ƒ [πΏπ‘œπ‘ π‘ (𝑓,π‘₯)]

9 Adaptive data analysis [DFHPRR 14]
𝐴 1 Algorithm 𝑆= π‘₯ 1 ,…, π‘₯ 𝑛 ∼ 𝑃 𝑛 𝑣 1 𝐴 2 𝑣 2 𝐴 π‘˜ 𝑣 π‘˜ Data analyst(s) Goal: given 𝑆 compute 𝑣 𝑖 ’s β€œclose” to running 𝐴 𝑖 on fresh samples Each analysis is a query Design algorithm for answering adaptively-chosen queries

10 Adaptive statistical queries
Statistical query oracle [Kearns 93] 𝑆 𝐴 1 𝑆= π‘₯ 1 ,…, π‘₯ 𝑛 ∼ 𝑃 𝑛 𝑣 1 𝐴 2 𝑣 2 𝐴 π‘˜ 𝑣 π‘˜ Data analyst(s) 𝐴 𝑖 (𝑆)≑ 1 𝑛 π‘₯βˆˆπ‘† πœ™ 𝑖 π‘₯ πœ™ 𝑖 :𝑋→ 0,1 Example: πœ™ 𝑖 =πΏπ‘œπ‘ π‘ (𝑓,π‘₯) 𝑣 𝑖 βˆ’ 𝐄 π‘₯βˆΌπ‘ƒ πœ™ 𝑖 π‘₯ ≀τ with prob. 1βˆ’π›½ Can measure correlations, moments, accuracy/loss Run any statistical query algorithm

11 Answering non-adaptive SQs
Given π‘˜ non-adaptive query functions πœ™ 1 ,… ,πœ™ π‘˜ and 𝑛 i.i.d. samples from 𝑃 estimate 𝐄 π‘₯βˆΌπ‘ƒ πœ™ 𝑖 π‘₯ Use empirical mean: 𝐄 𝑆 πœ™ 𝑖 = 1 𝑛 π‘₯βˆˆπ‘† πœ™ 𝑖 π‘₯ 𝑛=𝑂 log (π‘˜/𝛽) 𝜏 2

12 Answering adaptively-chosen SQs
What if we use 𝐄 𝑆 πœ™ 𝑖 ? For some constant 𝛽>0: 𝑛β‰₯ π‘˜ 𝜏 2 Variable selection, boosting, bagging, step-wise regression .. Data splitting: 𝑛=𝑂 π‘˜β‹…log π‘˜ 𝜏 2

13 Answering adaptive SQs
[DFHPRR 14] Exists an algorithm that can answer π‘˜ adaptively chosen SQs with accuracy 𝜏 for 𝑛= 𝑂 π‘˜ 𝜏 2.5 Data splitting: 𝑂 π‘˜ 𝜏 2 [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15] 𝑛= 𝑂 π‘˜ 𝜏 2 Generalizes to low-sensitivity analyses: 𝐴 𝑖 𝑆 βˆ’ 𝐴 𝑖 𝑆 β€² ≀ 1 𝑛 when 𝑆,𝑆′ differ in a single element Estimates 𝐄 π‘†βˆΌ 𝑃 𝑛 [ 𝐴 𝑖 (𝑆)] within 𝜏

14 Differential privacy [Dwork,McSherry,Nissim,Smith 06]
ratio bounded M 𝑆 𝑆′ Randomized algorithm 𝑀 is (πœ–,𝛿)-differentially private if for any two data sets 𝑆,𝑆′ that differ in one element: βˆ€π‘βŠ†range 𝑀 , Pr 𝑴 𝑴 𝑆 βˆˆπ‘ ≀ 𝑒 πœ– β‹… Pr 𝑴 𝑴 𝑆 β€² βˆˆπ‘ +𝛿

15 Differential privacy is stability
DP implies generalization DP composes adaptively Differential privacy is stability Implies strongly uniform replace-one stability and generalization in expectation DP implies generalization with high probability [DFHPRR 14, BNSSSU 15] Composition of π‘˜ πœ–-DP algorithms: for every 𝛿>0, is πœ– π‘˜ log 1/𝛿 ,𝛿 -DP [Dwork,Rothblum,Vadhan 10]

16 Value perturbation [DMNS 06]
Answer low-sensitivity query 𝐴 with 𝐴 𝑆 +𝜁 Given 𝑛 samples achieves error β‰ˆΞ”(𝐴)β‹… 𝑛 β‹… π‘˜ 1 4 where Ξ”(𝐴) is the worst-case sensitivity: max 𝑆,𝑆′ 𝐴 𝑆 βˆ’π΄( 𝑆 β€² ) Ξ”(𝐴)β‹… 𝑛 could be much larger than standard deviation of 𝐴 max 𝑆,𝑆′ 𝐴 𝑆 βˆ’π΄ 𝑆 β€² ≀1/𝑛 Gaussian 𝑁(0,𝜏)

17 Beyond low-sensitivity
[F, Steinke 17] Exists an algorithm that for any adaptively-chosen sequence 𝐴 1 ,…, 𝐴 π‘˜ : 𝑋 𝑑 →ℝ given 𝑛= 𝑂 π‘˜ β‹… 𝑑 i.i.d. samples from 𝑃 outputs values 𝑣 1 ,…, 𝑣 π‘˜ such that w.h.p. for all 𝑖: 𝐄 π‘†βˆΌ 𝑃 𝑑 𝐴 𝑖 𝑆 βˆ’ 𝑣 𝑖 ≀2 𝜎 𝑖 where 𝜎 𝑖 = π•πšπ« π‘†βˆΌ 𝑃 𝑑 𝐴 𝑖 𝑆 For statistical queries: πœ™ 𝑖 :𝑋→[βˆ’π΅,𝐡] given 𝑛 samples get error that scales as π•πšπ« π‘₯βˆΌπ‘ƒ πœ™ 𝑖 π‘₯ 𝑛 β‹… π‘˜ 1/4 Value perturbation: 𝐡 𝑛 β‹… π‘˜ 1/4

18 Stable Median 𝑆 𝑆 1 𝑆 2 𝑆 3 β‹― 𝑆 π‘š 𝑛=π‘‘π‘š β‹― 𝐴 𝑖 𝑦 1 𝑦 2 𝑦 3 𝑦 π‘šβˆ’2 𝑦 π‘šβˆ’1
π‘ˆ Find an approximate median with DP relative to π‘ˆ value 𝑣 greater than bottom 1/3 and smaller than top 1/3 in π‘ˆ 𝑣

19 Median algorithms Exponential mechanism [McSherry, Talwar 07]
Requires discretization: ground set 𝑇, |𝑇|=π‘Ÿ Upper bound: 2 𝑂( log βˆ— π‘Ÿ) samples Lower bound: Ξ©( log βˆ— π‘Ÿ) samples [Bun,Nissim,Stemmer,Vadhan 15] π‘ˆ 𝑇 Exponential mechanism [McSherry, Talwar 07] Output π‘£βˆˆπ‘‡ with prob. ∝ 𝑒 βˆ’πœ– # π‘¦βˆˆπ‘ˆ 𝑣≀𝑦 βˆ’ π‘š 2 Uses 𝑂 log π‘Ÿ πœ– samples Stability and confidence amplification for the price of one log factor!

20 Analysis Differential privacy approximately preserves quantiles
If 𝑣 is within , empirical quantiles then 𝑣 is within , true quantiles 𝑣 is within mean Β±2𝜎 If πœ™ is well-concentrated on 𝐷 then easy to prove high probability bounds [F, Steinke 17] Let 𝑀 be a DP algorithm that on input π‘ˆβˆˆ π‘Œ π‘š outputs a function πœ™:π‘Œβ†’β„ and a value π‘£βˆˆβ„. Then w.h.p. over π‘ˆβˆΌ 𝐷 π‘š and πœ™,𝑣 ←𝑀 π‘ˆ : Pr π‘¦βˆΌπ· π‘£β‰€πœ™(𝑦) β‰ˆ Pr π‘¦βˆΌπ‘ˆ π‘£β‰€πœ™(𝑦)

21 Limits Any algorithm for answering π‘˜ adaptively chosen SQs with accuracy 𝜏 requires* 𝑛=Ξ©( π‘˜ /𝜏) samples [Hardt, Ullman 14; Steinke, Ullman 15] *in sufficiently high dimension or under crypto assumptions Verification of responses to queries: 𝑛=𝑂( 𝑐 log π‘˜) where 𝑐 is the number of queries that failed verification Data splitting if overfitting [DFHPRR 14] Reusable holdout [DFHPRR 15] Maintaining public leaderboard in a competition [Blum, Hardt 15]

22 Open problems Analysts without side information about 𝑃
Queries depend only on previous answers Fixed β€œnatural” analyst/Learning algorithm Gradient descent for stochastic convex optimization Does there exist an SQ analyst whose queries require more than 𝑂( log π‘˜) samples to answer? (with 0.1 accuracy/confidence)

23 Stochastic convex optimization
Convex body 𝐾= 𝔹 2 𝑑 1 ≐ π‘₯ π‘₯ 2 ≀1} Class 𝐹 of convex 1-Lipschitz functions 𝐹= 𝑓 convex βˆ€π‘₯∈𝐾, 𝛻𝑓(π‘₯) 2 ≀1 Given 𝑓 1 ,…, 𝑓 𝑛 sampled i.i.d. from unknown 𝑃 over 𝐹 Minimize true (expected) objective: 𝑓 𝑃 (π‘₯)≐ 𝐄 π‘“βˆΌπ‘ƒ [𝑓(π‘₯)] over 𝐾: Find π‘₯ s.t. 𝑓 𝑃 π‘₯ ≀ min π‘₯∈𝐾 𝑓 𝑃 π‘₯ +πœ– 𝑓 1 𝑓 2 … 𝑓 𝑛 𝑓 𝑃 π‘₯

24 Gradient descent ERM via projected gradient descent:
𝑓 𝑆 (π‘₯)≐ 1 𝑛 𝑓 𝑖 (π‘₯) Initialize π‘₯ 1 ∈𝐾 For 𝑑=1 to 𝑇 π‘₯ 𝑑+1 = Project 𝐾 π‘₯ 𝑑 βˆ’πœ‚β‹…π›» 𝑓 𝑆 π‘₯ 𝑑 Output: 1 𝑇 𝑑 π‘₯ 𝑑 1 𝑛 𝛻𝑓 𝑖 ( π‘₯ 𝑑 ) Fresh samples: 𝛻 𝑓 𝑆 π‘₯ 𝑑 βˆ’π›» 𝑓 𝑃 π‘₯ 𝑑 2 ≀1/ 𝑛 Sample complexity is unknown Uniform convergence: 𝑂 𝑑 πœ– 2 samples (tight [F. 16]) SGD solves using 𝑂 1 πœ– 2 samples [Robbins,Monro 51; Polyak 90] Overall: 𝑑/ πœ– 2 statistical queries with accuracy πœ– in 1/ πœ– 2 adaptive rounds Sample splitting: 𝑂 log 𝑑 πœ– 4 samples DP: 𝑂 𝑑 πœ– 3 samples

25 Conclusions Real-valued analyses (without any assumptions)
Going beyond tools from DP Other notions of stability for outcomes Max/mutual information Generalization beyond uniform convergence Using these techniques in practice


Download ppt "Understanding Generalization in Adaptive Data Analysis"

Similar presentations


Ads by Google