Vitaly (the West Coast) Feldman

Slides:

Advertisements

Similar presentations

CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.

Advertisements

Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.

Boosting Approach to ML

FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.

By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Sparse vs. Ensemble Approaches to Supervised Learning

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

1 On The Learning Power of Evolution Vitaly Feldman.

Experimental Evaluation

Ensemble Learning (2), Tree and Forest

How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.

Using Data Privacy for Better Adaptive Predictions Vitaly Feldman IBM Research – Almaden Foundations of Learning Theory, 2014 Cynthia Dwork Moritz Hardt.

Foundations of Privacy Lecture 6 Lecturer: Moni Naor.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.

Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.

Foundations of Privacy Lecture 5 Lecturer: Moni Naor.

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.

Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

AP STATISTICS LESSON 11 – 1 (DAY 2) The t Confidence Intervals and Tests.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.

Space for things we might want to put at the bottom of each slide. Part 6: Open Problems 1 Marianne Winslett 1,3, Xiaokui Xiao 2, Yin Yang 3, Zhenjie Zhang.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Multiple Regression Analysis: Inference

Cost-Sensitive Boosting algorithms: Do we really need them?

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Rigorous Data Dredging Theory and Tools for Adaptive Data Analysis.

CS 9633 Machine Learning Support Vector Machines

Reading: R. Schapire, A brief introduction to boosting

DEEP LEARNING BOOK CHAPTER to CHAPTER 6

Private Data Management with Verification

Deep Feedforward Networks

Stochastic Streams: Sample Complexity vs. Space Complexity

Statistical Cost Sharing: Learning Fair Cost Allocations from Samples

Introduction to Machine Learning

Understanding Generalization in Adaptive Data Analysis

Generative Adversarial Networks

Generalization and adaptivity in stochastic convex optimization

Machine Learning Basics

ECE 5424: Introduction to Machine Learning

Statistics in Applied Science and Technology

Privacy as a tool for Robust Mechanism Design in Large Markets

Algorithmic Approaches to Preventing Overfitting in Adaptive Data Analysis Part 1 Aaron Roth.

Privacy-Preserving Classification

Differential Privacy in Practice

Preserving Validity in Adaptive Data Analysis

Differential Privacy and Statistical Inference: A TCS Perspective

CSCI B609: “Foundations of Data Science”

INF 5860 Machine learning for image classification

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Lecture 18: Bagging and Boosting

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Privacy-preserving Prediction

CSCI B609: “Foundations of Data Science”

≠ Particle-based Variational Inference for Continuous Systems

Lecture 06: Bagging and Boosting

CSCI B609: “Foundations of Data Science”

Ensemble learning Reminder - Bagging of Trees Random Forest

Generalization bounds for uniformly stable algorithms

The reusable holdout: Preserving validity in adaptive data analysis

Supervised machine learning: creating a model

Machine Learning: Lecture 5

Presentation transcript:

Vitaly (the West Coast) Feldman Understanding Generalization in Adaptive Data Analysis Vitaly (the West Coast) Feldman

Overview Adaptive data analysis New results (with Thomas Steinke) Motivation Framework Basic results With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14,15] New results (with Thomas Steinke) Open problems

𝐄 𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓,𝑥)]=? Results Data 𝑓=𝐴(𝑆) Analysis 𝐴 Probability distribution 𝑃 over domain 𝑋 Data Results Analysis 𝑓=𝐴(𝑆) 𝑆= 𝑥 1 ,…, 𝑥 𝑛 ∼ 𝑃 𝑛 𝐴

Statistical inference Data 𝑆 𝑛 i.i.d. samples from 𝑃 Theory Concentration/CLT Model complexity Rademacher compl. Stability Online-to-batch Algorithm 𝐴 Hypothesis test Parameter estimator Classification Generalization guarantees for 𝑓 𝑓=𝐴(𝑆) 𝐿𝑜𝑠 𝑠 𝑃 (𝑓)

Data analysis is adaptive Steps depend on previous analyses of the same dataset 𝐴 1 Data pre-processing Exploratory data analysis Feature selection Model stacking Hyper-parameter tuning Shared datasets … 𝑆 𝑣 1 𝐴 2 𝑣 2 𝐴 𝑘 Mention non-adaptive reuse 𝑣 𝑘 Data analyst(s)

Thou shalt not test hypotheses suggested by data “Quiet scandal of statistics” [Leo Breiman, 1992]

Reproducibility crisis? Why Most Published Research Findings Are False [Ioannidis 2005] “Irreproducible preclinical research exceeds 50%, resulting in approximately US$28B/year loss” [Freedman,Cockburn,Simcoe 2015] Adaptive data analysis is one of the causes 𝑝-hacking Researcher degrees of freedom [Simmons, Nelson, Simonsohn 2011] Garden of forking paths [Gelman, Loken 2015]

Existing approaches Sample splitting Selective inference Model selection + parameter estimation Variable selection + regression Pre-registration © Center for Open Science

Adaptive data analysis [DFHPRR 14] 𝐴 1 Algorithm 𝑆= 𝑥 1 ,…, 𝑥 𝑛 ∼ 𝑃 𝑛 𝑣 1 𝐴 2 𝑣 2 𝐴 𝑘 𝑣 𝑘 Data analyst(s) Need both tools for analysis and ways to make algorithms that compose better. Goal: given 𝑆 compute 𝑣 𝑖 ’s “close” to running 𝐴 𝑖 on fresh samples Each analysis is a query Design algorithm for answering adaptively-chosen queries

Adaptive statistical queries Statistical query oracle [Kearns 93] 𝑆 𝐴 1 𝑆= 𝑥 1 ,…, 𝑥 𝑛 ∼ 𝑃 𝑛 𝑣 1 𝐴 2 𝑣 2 𝐴 𝑘 𝑣 𝑘 Data analyst(s) 𝐴 𝑖 (𝑆)≡ 1 𝑛 𝑥∈𝑆 𝜙 𝑖 𝑥 𝜙 𝑖 :𝑋→ 0,1 Example: 𝜙 𝑖 =𝐿𝑜𝑠𝑠(𝑓,𝑥) 𝑣 𝑖 − 𝐄 𝑥∼𝑃 𝜙 𝑖 𝑥 ≤τ with prob. 1−𝛽 Can measure correlations, moments, accuracy/loss Run any statistical query algorithm

Answering non-adaptive SQs Given 𝑘 non-adaptive query functions 𝜙 1 ,… ,𝜙 𝑘 and 𝑛 i.i.d. samples from 𝑃 estimate 𝐄 𝑥∼𝑃 𝜙 𝑖 𝑥 Use empirical mean: 𝐄 𝑆 𝜙 𝑖 = 1 𝑛 𝑥∈𝑆 𝜙 𝑖 𝑥 𝑛=𝑂 log (𝑘/𝛽) 𝜏 2

Answering adaptively-chosen SQs What if we use 𝐄 𝑆 𝜙 𝑖 ? For some constant 𝛽>0: 𝑛≥ 𝑘 𝜏 2 Variable selection, boosting, step-wise regression .. Sample splitting: 𝑛=𝑂 𝑘⋅log 𝑘 𝜏 2

Answering adaptive SQs [DFHPRR 14] Exists an algorithm that can answer 𝑘 adaptively chosen SQs with accuracy 𝜏 for 𝑛= 𝑂 𝑘 𝜏 2.5 Data splitting: 𝑂 𝑘 𝜏 2 [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15] 𝑛= 𝑂 𝑘 𝜏 2 Generalizes to low-sensitivity analyses: 𝐴 𝑖 𝑆 − 𝐴 𝑖 𝑆 ′ ≤ 1 𝑛 when 𝑆,𝑆′ differ in a single element Estimates 𝐄 𝑆∼ 𝑃 𝑛 [ 𝐴 𝑖 (𝑆)] within 𝜏

Value perturbation Answer low-sensitivity query 𝐴 with 𝐴 𝑆 +𝜁 Laplace/Gaussian

Differential privacy [Dwork,McSherry,Nissim,Smith 06] DP implies generalization Differential privacy is stability If 𝑀 is (𝜖,𝛿)-DP and outputs a function 𝑋→ 0,1 then for every 𝑆,𝑆′,𝑥 𝐄 𝜙=𝑀 𝑆 𝜙 𝑥 − 𝐄 𝜙=𝑀 𝑆 ′ 𝜙 𝑥 ≲𝜖+𝛿 uniform replace-one stability implies generalization in expectation [Bousquet,Elisseeff 02] 𝐄 𝑆∼ 𝑃 𝑛 , 𝜙=𝑀 𝑆 𝐄 𝑆 𝜙 − 𝐄 𝑆∼ 𝑃 𝑛 , 𝜙=𝑀 𝑆 𝐄 𝑃 𝜙 ≲𝜖+𝛿 DP implies generalization with high probability [DFHPRR 14, BNSSSU 15]

Differential privacy [DMNS 06] DP implies generalization Differential privacy limits information learned about the dataset Max-information: for an algorithm 𝑀: 𝑋 𝑛 →𝑌 and dataset 𝑺∼𝐷 over 𝑋 𝑛 I ∞ 𝛽 𝑺;𝑴(𝑺) ≤𝑘 if for any event 𝑍⊆ 𝑋 𝑛 ×𝑌 Pr 𝑺∼𝐷 𝑺,𝑴 𝑺 ∈𝑍 ≤ 𝑒 𝑘 ⋅ Pr 𝑺,𝑻∼𝐷 𝑻,𝑴 𝑺 ∈𝑍 +𝛽 𝜖-DP bounds max-information [DFHPRR 15] (𝜖,𝛿)-DP bounds max-information for 𝐷= 𝑃 𝑛 [Rogers,Roth,Smith,Thakkar 16]

Differential privacy [DMNS 06] DP implies generalization DP composes adaptively Adaptive composition of 𝑘 (𝜖,𝛿)-DP algorithms is 𝜖 𝑘 log 1 𝛿 ′ , 𝛿 ′ +𝑘𝛿 -DP For every 𝛿 ′ >0 and 𝜖≤1/ 𝑘 [Dwork,Rothblum,Vadhan 10]

Differential privacy [DMNS 06] DP implies generalization DP composes adaptively If 𝑀 is “accurate” when fresh samples are used to answer a query differentially private Then 𝑀 is “accurate” when same dataset is reused for adaptively-chosen queries

Value perturbation [DMNS 06] Answer low-sensitivity query 𝐴 with 𝐴 𝑆 +𝜁 Given 𝑛 samples achieves error ≈Δ(𝐴)⋅ 𝑛 ⋅ 𝑘 1 4 where Δ(𝐴) is the worst-case sensitivity: max 𝑆,𝑆′ 𝐴 𝑆 −𝐴( 𝑆 ′ ) Δ(𝐴)⋅ 𝑛 could be much larger than standard deviation of 𝐴 on 𝑃 max 𝑆,𝑆′ 𝐴 𝑆 −𝐴 𝑆 ′ ≤1/𝑛

Beyond low-sensitivity [F, Steinke 17] Exists an algorithm that for any adaptively-chosen sequence 𝐴 1 ,…, 𝐴 𝑘 : 𝑋 𝑡 →ℝ given 𝑛= 𝑂 𝑘 ⋅ 𝑡 i.i.d. samples from 𝑃 outputs values 𝑣 1 ,…, 𝑣 𝑘 such that w.h.p. for all 𝑖: 𝐄 𝑆∼ 𝑃 𝑡 𝐴 𝑖 𝑆 − 𝑣 𝑖 ≤2 𝜎 𝑖 where 𝜎 𝑖 = 𝐕𝐚𝐫 𝑆∼ 𝑃 𝑡 𝐴 𝑖 𝑆

Stable Median 𝑆 𝑆 1 𝑆 2 𝑆 3 ⋯ 𝑆 𝑚 𝑛=𝑡𝑚 ⋯ 𝐴 𝑖 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑚−2 𝑦 𝑚−1 𝑈 Find an approximate median with (weak) DP relative to 𝑈 value 𝑣 greater than bottom 1/3 and smaller than top 1/3 in 𝑈 𝑣

Median algorithms Exponential mechanism [McSherry, Talwar 07] Requires discretization: ground set 𝑇, |𝑇|=𝑟 Upper bound: 2 𝑂( log ∗ 𝑟) samples Lower bound: Ω( log ∗ 𝑟) samples [Bun,Nissim,Stemmer,Vadhan 15] 𝑈 𝑇 Exponential mechanism [McSherry, Talwar 07] Output 𝑣∈𝑇 with prob. ∝ 𝑒 −𝜖 # 𝑦∈𝑈 𝑣≤𝑦 − 𝑚 2 Uses 𝑂 log 𝑟 𝜖 samples Stability and confidence amplification for the price of one log factor!

Limits Any algorithm for answering 𝑘 adaptively chosen SQs with accuracy 𝜏 requires* 𝑛=Ω( 𝑘 /𝜏) samples [Hardt, Ullman 14; Steinke, Ullman 15] *in sufficiently high dimension or under crypto assumptions Analysts without side information about 𝑃? Queries depend only on previous answers Fixed “natural” analyst/Learning algorithm Gradient descent for stochastic convex optimization Does there exist an analyst whose statistical queries require more than 𝑂( log 𝑘) samples to answer? (with 0.1 accuracy/confidence)

ML practice Testing Training Validation 𝜃 𝑓 𝜃 𝑓 Test error of 𝑓 Data Training Validation 𝜃 𝑓 𝜃 XGBoost SVRG Tensorflow 𝑓 Test error of 𝑓 ≈𝐄 𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓,𝑥)]

Reusable holdout [DFHPRR 15] Data Training 𝑇 Holdout 𝐻 𝑓 1 𝐿𝑜𝑠𝑠 ( 𝑓 1 ) Reusable holdout algorithm AI guru 𝑓 2 𝐿𝑜𝑠𝑠 ( 𝑓 2 ) 𝑓 𝑘 𝐿𝑜𝑠𝑠 ( 𝑓 𝑘 )

Reusable holdout [DFHPRR 15, FS 17] Exists an algorithm that can accurately estimate the loss of 𝑘 adaptively chosen functions as long as at most ℓ overfit to the training set for 𝑛 ~ ℓ ⋅log 𝑘 Overfitting: 𝐄 𝑥∼𝑇 [𝐿𝑜𝑠𝑠(𝑓,𝑥)] ≉𝐄 𝑥∼𝑃 [𝐿𝑜𝑠𝑠(𝑓,𝑥)] Verifying mostly correct answers with DP is cheap Sparse vector technique [Dwork,Naor,Reingold,Rothblum,Vadhan 09]

Conclusions Datasets are reused adaptively New conceptual framework Deep connections to DP Privacy and generalization are aligned Data “freshness” is a limited resource Real-valued analyses (without any assumptions) Going beyond adversarial adaptivity Connections to stability and selective inference Using these techniques in practice