Understanding Generalization in Adaptive Data Analysis

Slides:

Advertisements

Similar presentations

Linear Regression.

Advertisements

Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.

Boosting Approach to ML

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005

By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005

MCS 2005 Round Table In the context of MCS, what do you believe to be true, even if you cannot yet prove it?

Probably Approximately Correct Model (PAC)

Evaluating Hypotheses

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

1 On The Learning Power of Evolution Vitaly Feldman.

Ensemble Learning (2), Tree and Forest

How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.

Using Data Privacy for Better Adaptive Predictions Vitaly Feldman IBM Research – Almaden Foundations of Learning Theory, 2014 Cynthia Dwork Moritz Hardt.

Foundations of Privacy Lecture 6 Lecturer: Moni Naor.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Random Sampling, Point Estimation and Maximum Likelihood.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.

Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.

Foundations of Privacy Lecture 5 Lecturer: Moni Naor.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.

Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron.

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Learning Deep Generative Models by Ruslan Salakhutdinov

DEEP LEARNING BOOK CHAPTER to CHAPTER 6

Private Data Management with Verification

Deep Feedforward Networks

Stochastic Streams: Sample Complexity vs. Space Complexity

Vitaly Feldman and Jan Vondrâk IBM Research - Almaden

Ch3: Model Building through Regression

Task: It is necessary to choose the most suitable variant from some set of objects by those or other criteria.

Privacy-preserving Release of Statistics: Differential Privacy

Generalization and adaptivity in stochastic convex optimization

Probabilistic Models for Linear Regression

Privacy as a tool for Robust Mechanism Design in Large Markets

Algorithmic Approaches to Preventing Overfitting in Adaptive Data Analysis Part 1 Aaron Roth.

Privacy-Preserving Classification

Differential Privacy in Practice

Vitaly (the West Coast) Feldman

Statistical Methods For Engineers

Current Developments in Differential Privacy

Preserving Validity in Adaptive Data Analysis

Random Sampling over Joins Revisited

Differential Privacy and Statistical Inference: A TCS Perspective

CSCI B609: “Foundations of Data Science”

Online Learning Kernels

INF 5860 Machine learning for image classification

Lecture 18: Bagging and Boosting

Privacy-preserving Prediction

CSCI B609: “Foundations of Data Science”

The Byzantine Secretary Problem

Generally Discriminant Analysis

Lecture 06: Bagging and Boosting

Ensemble learning Reminder - Bagging of Trees Random Forest

Generalization bounds for uniformly stable algorithms

The reusable holdout: Preserving validity in adaptive data analysis

Tight Analyses for Non-Smooth Stochastic Gradient Descent

Presentation transcript:

Understanding Generalization in Adaptive Data Analysis Vitaly Feldman

Overview Adaptive data analysis New results [F, Steinke 17] Motivation Definitions Basic techniques With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14,15] New results [F, Steinke 17] Open problems

𝐄 𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓,𝑥)]=? Learning problem Model Data 𝑓=𝐴(𝑆) Analysis 𝐴 Distribution 𝑃 over domain 𝑋 XGBoost SVRG Adagrad SVM Analysis 𝑓=𝐴(𝑆) 𝑆= 𝑥 1 ,…, 𝑥 𝑛 ∼ 𝑃 𝑛 𝐴

Statistical inference Data 𝑆 𝑛 i.i.d. samples from 𝑃 Theory Model complexity Rademacher compl. Stability Online-to-batch … Algorithm 𝐴 Generalization guarantees for 𝑓 𝑓=𝐴(𝑆)

Data analysis is adaptive Steps depend on previous analyses of the same dataset 𝐴 1 𝑆 𝑣 1 Exploratory data analysis Feature selection Model stacking Hyper-parameter tuning Shared datasets … 𝐴 2 𝑣 2 𝐴 𝑘 𝑣 𝑘 Data analyst(s)

Thou shalt not test hypotheses suggested by data “Quiet scandal of statistics” [Leo Breiman, 1992]

ML practice Testing Training 𝑓 Test error of 𝑓 ≈𝐄 𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓,𝑥)] Data Data Data Data Training Testing Lasso k-NN SVM C4.5 Kernels 𝑓 Test error of 𝑓 ≈𝐄 𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓,𝑥)]

ML practice now Testing Training Validation 𝜃 𝑓 𝜃 𝑓 Test error of 𝑓 XGBoost SVRG Tensorflow Testing Data Training Validation 𝜃 𝑓 𝜃 𝑓 Test error of 𝑓 ≈𝐄 𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓,𝑥)]

Adaptive data analysis [DFHPRR 14] 𝐴 1 Algorithm 𝑆= 𝑥 1 ,…, 𝑥 𝑛 ∼ 𝑃 𝑛 𝑣 1 𝐴 2 𝑣 2 𝐴 𝑘 𝑣 𝑘 Data analyst(s) Goal: given 𝑆 compute 𝑣 𝑖 ’s “close” to running 𝐴 𝑖 on fresh samples Each analysis is a query Design algorithm for answering adaptively-chosen queries

Adaptive statistical queries Statistical query oracle [Kearns 93] 𝑆 𝐴 1 𝑆= 𝑥 1 ,…, 𝑥 𝑛 ∼ 𝑃 𝑛 𝑣 1 𝐴 2 𝑣 2 𝐴 𝑘 𝑣 𝑘 Data analyst(s) 𝐴 𝑖 (𝑆)≡ 1 𝑛 𝑥∈𝑆 𝜙 𝑖 𝑥 𝜙 𝑖 :𝑋→ 0,1 Example: 𝜙 𝑖 =𝐿𝑜𝑠𝑠(𝑓,𝑥) 𝑣 𝑖 − 𝐄 𝑥∼𝑃 𝜙 𝑖 𝑥 ≤τ with prob. 1−𝛽 Can measure correlations, moments, accuracy/loss Run any statistical query algorithm

Answering non-adaptive SQs Given 𝑘 non-adaptive query functions 𝜙 1 ,… ,𝜙 𝑘 and 𝑛 i.i.d. samples from 𝑃 estimate 𝐄 𝑥∼𝑃 𝜙 𝑖 𝑥 Use empirical mean: 𝐄 𝑆 𝜙 𝑖 = 1 𝑛 𝑥∈𝑆 𝜙 𝑖 𝑥 𝑛=𝑂 log (𝑘/𝛽) 𝜏 2

Answering adaptively-chosen SQs What if we use 𝐄 𝑆 𝜙 𝑖 ? For some constant 𝛽>0: 𝑛≥ 𝑘 𝜏 2 Variable selection, boosting, bagging, step-wise regression .. Data splitting: 𝑛=𝑂 𝑘⋅log 𝑘 𝜏 2

Answering adaptive SQs [DFHPRR 14] Exists an algorithm that can answer 𝑘 adaptively chosen SQs with accuracy 𝜏 for 𝑛= 𝑂 𝑘 𝜏 2.5 Data splitting: 𝑂 𝑘 𝜏 2 [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15] 𝑛= 𝑂 𝑘 𝜏 2 Generalizes to low-sensitivity analyses: 𝐴 𝑖 𝑆 − 𝐴 𝑖 𝑆 ′ ≤ 1 𝑛 when 𝑆,𝑆′ differ in a single element Estimates 𝐄 𝑆∼ 𝑃 𝑛 [ 𝐴 𝑖 (𝑆)] within 𝜏

Differential privacy [Dwork,McSherry,Nissim,Smith 06] ratio bounded M 𝑆 𝑆′ Randomized algorithm 𝑀 is (𝜖,𝛿)-differentially private if for any two data sets 𝑆,𝑆′ that differ in one element: ∀𝑍⊆range 𝑀 , Pr 𝑴 𝑴 𝑆 ∈𝑍 ≤ 𝑒 𝜖 ⋅ Pr 𝑴 𝑴 𝑆 ′ ∈𝑍 +𝛿

Differential privacy is stability DP implies generalization DP composes adaptively Differential privacy is stability Implies strongly uniform replace-one stability and generalization in expectation DP implies generalization with high probability [DFHPRR 14, BNSSSU 15] Composition of 𝑘 𝜖-DP algorithms: for every 𝛿>0, is 𝜖 𝑘 log 1/𝛿 ,𝛿 -DP [Dwork,Rothblum,Vadhan 10]

Value perturbation [DMNS 06] Answer low-sensitivity query 𝐴 with 𝐴 𝑆 +𝜁 Given 𝑛 samples achieves error ≈Δ(𝐴)⋅ 𝑛 ⋅ 𝑘 1 4 where Δ(𝐴) is the worst-case sensitivity: max 𝑆,𝑆′ 𝐴 𝑆 −𝐴( 𝑆 ′ ) Δ(𝐴)⋅ 𝑛 could be much larger than standard deviation of 𝐴 max 𝑆,𝑆′ 𝐴 𝑆 −𝐴 𝑆 ′ ≤1/𝑛 Gaussian 𝑁(0,𝜏)

Beyond low-sensitivity [F, Steinke 17] Exists an algorithm that for any adaptively-chosen sequence 𝐴 1 ,…, 𝐴 𝑘 : 𝑋 𝑡 →ℝ given 𝑛= 𝑂 𝑘 ⋅ 𝑡 i.i.d. samples from 𝑃 outputs values 𝑣 1 ,…, 𝑣 𝑘 such that w.h.p. for all 𝑖: 𝐄 𝑆∼ 𝑃 𝑡 𝐴 𝑖 𝑆 − 𝑣 𝑖 ≤2 𝜎 𝑖 where 𝜎 𝑖 = 𝐕𝐚𝐫 𝑆∼ 𝑃 𝑡 𝐴 𝑖 𝑆 For statistical queries: 𝜙 𝑖 :𝑋→[−𝐵,𝐵] given 𝑛 samples get error that scales as 𝐕𝐚𝐫 𝑥∼𝑃 𝜙 𝑖 𝑥 𝑛 ⋅ 𝑘 1/4 Value perturbation: 𝐵 𝑛 ⋅ 𝑘 1/4

Stable Median 𝑆 𝑆 1 𝑆 2 𝑆 3 ⋯ 𝑆 𝑚 𝑛=𝑡𝑚 ⋯ 𝐴 𝑖 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑚−2 𝑦 𝑚−1 𝑈 Find an approximate median with DP relative to 𝑈 value 𝑣 greater than bottom 1/3 and smaller than top 1/3 in 𝑈 𝑣

Median algorithms Exponential mechanism [McSherry, Talwar 07] Requires discretization: ground set 𝑇, |𝑇|=𝑟 Upper bound: 2 𝑂( log ∗ 𝑟) samples Lower bound: Ω( log ∗ 𝑟) samples [Bun,Nissim,Stemmer,Vadhan 15] 𝑈 𝑇 Exponential mechanism [McSherry, Talwar 07] Output 𝑣∈𝑇 with prob. ∝ 𝑒 −𝜖 # 𝑦∈𝑈 𝑣≤𝑦 − 𝑚 2 Uses 𝑂 log 𝑟 𝜖 samples Stability and confidence amplification for the price of one log factor!

Analysis Differential privacy approximately preserves quantiles If 𝑣 is within 1 3 , 2 3 empirical quantiles then 𝑣 is within 1 4 , 3 4 true quantiles 𝑣 is within mean ±2𝜎 If 𝜙 is well-concentrated on 𝐷 then easy to prove high probability bounds [F, Steinke 17] Let 𝑀 be a DP algorithm that on input 𝑈∈ 𝑌 𝑚 outputs a function 𝜙:𝑌→ℝ and a value 𝑣∈ℝ. Then w.h.p. over 𝑈∼ 𝐷 𝑚 and 𝜙,𝑣 ←𝑀 𝑈 : Pr 𝑦∼𝐷 𝑣≤𝜙(𝑦) ≈ Pr 𝑦∼𝑈 𝑣≤𝜙(𝑦)

Limits Any algorithm for answering 𝑘 adaptively chosen SQs with accuracy 𝜏 requires* 𝑛=Ω( 𝑘 /𝜏) samples [Hardt, Ullman 14; Steinke, Ullman 15] *in sufficiently high dimension or under crypto assumptions Verification of responses to queries: 𝑛=𝑂( 𝑐 log 𝑘) where 𝑐 is the number of queries that failed verification Data splitting if overfitting [DFHPRR 14] Reusable holdout [DFHPRR 15] Maintaining public leaderboard in a competition [Blum, Hardt 15]

Open problems Analysts without side information about 𝑃 Queries depend only on previous answers Fixed “natural” analyst/Learning algorithm Gradient descent for stochastic convex optimization Does there exist an SQ analyst whose queries require more than 𝑂( log 𝑘) samples to answer? (with 0.1 accuracy/confidence)

Stochastic convex optimization Convex body 𝐾= 𝔹 2 𝑑 1 ≐ 𝑥 𝑥 2 ≤1} Class 𝐹 of convex 1-Lipschitz functions 𝐹= 𝑓 convex ∀𝑥∈𝐾, 𝛻𝑓(𝑥) 2 ≤1 Given 𝑓 1 ,…, 𝑓 𝑛 sampled i.i.d. from unknown 𝑃 over 𝐹 Minimize true (expected) objective: 𝑓 𝑃 (𝑥)≐ 𝐄 𝑓∼𝑃 [𝑓(𝑥)] over 𝐾: Find 𝑥 s.t. 𝑓 𝑃 𝑥 ≤ min 𝑥∈𝐾 𝑓 𝑃 𝑥 +𝜖 𝑓 1 𝑓 2 … 𝑓 𝑛 𝑓 𝑃 𝑥

Gradient descent ERM via projected gradient descent: 𝑓 𝑆 (𝑥)≐ 1 𝑛 𝑓 𝑖 (𝑥) Initialize 𝑥 1 ∈𝐾 For 𝑡=1 to 𝑇 𝑥 𝑡+1 = Project 𝐾 𝑥 𝑡 −𝜂⋅𝛻 𝑓 𝑆 𝑥 𝑡 Output: 1 𝑇 𝑡 𝑥 𝑡 1 𝑛 𝛻𝑓 𝑖 ( 𝑥 𝑡 ) Fresh samples: 𝛻 𝑓 𝑆 𝑥 𝑡 −𝛻 𝑓 𝑃 𝑥 𝑡 2 ≤1/ 𝑛 Sample complexity is unknown Uniform convergence: 𝑂 𝑑 𝜖 2 samples (tight [F. 16]) SGD solves using 𝑂 1 𝜖 2 samples [Robbins,Monro 51; Polyak 90] Overall: 𝑑/ 𝜖 2 statistical queries with accuracy 𝜖 in 1/ 𝜖 2 adaptive rounds Sample splitting: 𝑂 log 𝑑 𝜖 4 samples DP: 𝑂 𝑑 𝜖 3 samples

Conclusions Real-valued analyses (without any assumptions) Going beyond tools from DP Other notions of stability for outcomes Max/mutual information Generalization beyond uniform convergence Using these techniques in practice