Preserving Validity in Adaptive Data Analysis

Slides:

Advertisements

Similar presentations

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Advertisements

Topic 12: Multiple Linear Regression

A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.

Inference for Regression

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

11 Simple Linear Regression and Correlation CHAPTER OUTLINE

Multiple Regression Analysis

July 1, 2008Lecture 17 - Regression Testing1 Testing Relationships between Variables Statistics Lecture 17.

Session 2. Applied Regression -- Prof. Juran2 Outline for Session 2 More Simple Regression –Bottom Part of the Output Hypothesis Testing –Significance.

Chapter 10 Simple Regression.

Stat 112 – Notes 3 Homework 1 is due at the beginning of class next Thursday.

Quantitative Business Analysis for Decision Making Simple Linear Regression.

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Simple Linear Regression Analysis

Chapter 13: Inference in Regression

Using Data Privacy for Better Adaptive Predictions Vitaly Feldman IBM Research – Almaden Foundations of Learning Theory, 2014 Cynthia Dwork Moritz Hardt.

Introduction to Linear Regression

Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.

Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.

Part 2: Model and Inference 2-1/49 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics.

6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.

1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

Environmental Modeling Basic Testing Methods - Statistics III.

Faculty of Arts Atkinson College MATH 3330 M W 2003 Welcome Seventh Lecture for MATH 3330 M Professor G.E. Denzel.

Chapter 13 Understanding research results: statistical inference.

Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron.

Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,

Stats Methods at IC Lecture 3: Regression.

Rigorous Data Dredging Theory and Tools for Adaptive Data Analysis.

Chapter 13 Simple Linear Regression

Why Model? Make predictions or forecasts where we don’t have data.

Lecture 11: Simple Linear Regression

GS/PPAL Section N Research Methods and Information Systems

Machine Learning with Spark MLlib

Chapter 14 Introduction to Multiple Regression

*Bring Money for Yearbook!

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Inference for Regression (Chapter 14) A.P. Stats Review Topic #3

Chapter 12 Simple Linear Regression and Correlation

AP Statistics Chapter 14 Section 1.

Inference for Regression

Understanding Generalization in Adaptive Data Analysis

Chapter 13 Created by Bethany Stubbe and Stephan Kogitz.

Generalization and adaptivity in stochastic convex optimization

Chapter 11 Simple Regression

Part Three. Data Analysis

Simple Linear Regression

Algorithmic Approaches to Preventing Overfitting in Adaptive Data Analysis Part 1 Aaron Roth.

Vitaly (the West Coast) Feldman

Chapter 12 Simple Linear Regression and Correlation

Statistical Inference about Regression

Simple Linear Regression

Privacy-preserving Prediction

Simple Linear Regression

Multiple Regression Chapter 14.

Chapter 4, Regression Diagnostics Detection of Model Violation

Psych 231: Research Methods in Psychology

Product moment correlation

Inferential Statistics

Making Inferences about Slopes

The reusable holdout: Preserving validity in adaptive data analysis

Introduction to Regression

Introduction to Machine learning

MGS 3100 Business Analysis Regression Feb 18, 2016

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression

Presentation transcript:

Preserving Validity in Adaptive Data Analysis Vitaly Feldman Accelerated Discovery Lab IBM Research - Almaden

Distribution 𝑃 over 𝑋 Results Analysis 𝑆= 𝑥 1 ,…, 𝑥 𝑛 ∼ 𝑃 𝑛 Parameter estimate Classifier Clustering etc. 𝑆= 𝑥 1 ,…, 𝑥 𝑛 ∼ 𝑃 𝑛 Analysis

Data Analysis 101 Does student nutrition predict academic performance? 50 Normalized grade 100

Check correlations

Pick candidate foods

Fit linear function of 3 selected foods SUMMARY OUTPUT Regression Statistics Multiple R 0.4453533 R Square 0.1983396 Adjusted R Square 0.1732877 Standard Error 1.0041891 Observations 100 ANOVA df SS MS F Significance F Regression 3 23.95086544 7.983622 7.917151 8.98706E-05 Residual 96 96.80600126 1.008396 Total 99 120.7568667 Coefficients t Stat P-value Intercept -0.044248 0.100545016 -0.44008 0.660868 Mushroom -0.296074 0.10193011 -2.90468 0.004563 Pumpkin 0.255769 0.108443069 2.358555 0.020373 Nutella 0.2671363 0.095186165 2.806462 0.006066 FALSE DISCOVERY ≈0.0001 True 𝑃: 𝑥 𝑖 ,𝑦∼𝑁(0,1) Uncorrelated Freedman’s Paradox [1983]

Statistical inference Data “Fresh” i.i.d. samples Procedure Hypothesis tests Regression Learning Illustrate separately Result + generalization guarantees 𝜇

Data analysis is adaptive Steps depend on previous analyses of the same dataset 𝐴 1 𝑆 𝑣 1 Data cleaning Exploratory data analysis Variable selection Hyper-parameter tuning Shared datasets …. 𝐴 2 𝑣 2 𝐴 𝑚 Mention that had we unlimited amount of data that would not be an issue. There is plenty of empirical evidence for that: from effects of selection to papers demonstrating overfitting to the test set. 𝑣 𝑚 Data analyst(s) 𝐴 𝑖 : 𝑋 𝑛 → 𝑌 𝑖 𝑣 𝑖 = 𝐴 𝑖 (𝑆)

It’s an old problem Thou shalt not test hypotheses suggested by data “Quiet scandal of statistics” [Leo Breiman, 1992]

Is this a real problem? “Why Most Published Research Findings Are False” [Ioannidis 2005] “Irreproducible preclinical research exceeds 50%, resulting in approximately US$28B/year loss” [Freedman,Cockburn,Simcoe 2015] Adaptive data analysis is one of the causes 𝑝-hacking Researcher degrees of freedom [Simmons, Nelson, Simonsohn 2011] Garden of forking paths [Gelman, Loken 2015] Add learning references

Existing approaches I Abstinence Pre-registration © Center for Open Science

Existing approaches II Selective inference B A Data A+B Examples: Model selection + parameter inference Variable selection + regression Survey: [Taylor, Tibshirani 2015]

Existing approaches III B A Data C Sample splitting Might be necessary for standard approaches

New approach: differential privacy [Dwork,McSherry,Nissim,Smith 06] DATA

Differential privacy as outcome stability ratio bounded A 𝑆 𝑆′ Randomized algorithm 𝐴 is 𝜖-differentially private (DP) if for any two data sets 𝑆,𝑆′ such that dist 𝑆, 𝑆 ′ =1: ∀𝑍⊆range 𝐴 , Pr 𝑨 𝑨 𝑆 ∈𝑍 ≤ 𝑒 𝜖 ⋅ Pr 𝑨 𝑨 𝑆 ′ ∈𝑍

DP composes adaptively 𝐴 𝑆 is 𝜖 1 −DP 𝐴 𝑌 𝑌 𝐵 𝑆,𝑦 is 𝜖 2 -DP ∀ 𝑦∈𝑌 𝐵 𝑌′ 𝐵 𝑆,𝑦 is 𝜖 2 -DP ∀ 𝑦∈𝑌

DP composes DP implies adaptively generalization A B 𝑌′ 𝑆 DP is a strong form of algorithmic stability 𝜖 1 + 𝜖 2 -DP

Approaches (Approximate) max-information Differential privacy Description length Can be used to analyze any algorithm. The goal is primarily to lay the theoretical groundwork Additional approaches: Mutual information [Russo, Zhou 2016] KL-stability [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 2016] Typical stability [Bassily,Freund 2016]

Application: holdout validation Data Data Data Data Training Holdout/Testing Learning happens here Inference only happens Test error of 𝑓 Predictor 𝑓 𝐸𝑟𝑟 (𝑓)

Reusable Holdout algorithm Data Training 𝑇 Holdout 𝐻 Analyst(s) 𝑓 1 Reusable Holdout algorithm 𝐸𝑟𝑟 ( 𝑓 1 ) 𝑓 2 𝐸𝑟𝑟 ( 𝑓 2 ) 𝐻 𝑓 𝑚 𝐸𝑟𝑟 ( 𝑓 𝑚 )

Thresholdout Theorem: Given a holdout set of 𝑛 i.i.d. samples Thresholdout can estimate the true error of a “large” number of adaptively chosen predictors if “few” overfit to the training set We know reasonably well what the test set is used for: primarily to estimate the error. Can be applied to maintaining accurate leaderboard in machine learning competitions [Blum, Hardt 2015]

Illustration Given points in 𝐑 𝑑 × −1,+1 Pick variables correlated with the label on the training set. Validate on the holdout set Check prediction error of the linear classifier given by sign 𝑖∈𝑉 𝑠 𝑖 𝑥 𝑖 where 𝑠 𝑖 is the sign of the correlation Data distribution: 10,000 points from 𝑁 0,1 10,000 randomly labeled

Conclusions Adaptive data analysis: Further work: Ubiquitous and useful Leads to false discovery and overfitting Possible to model and improve on standard approaches Further work: New and stronger theory Tune to specific applications Practical applications

The cast Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron Roth Harvard Google Res. U. of Toronto Stanford U. Penn Reusable holdout: Preserving validity in adaptive data analysis. Science, 2015 Preserving Statistical Validity in Adaptive Data Analysis, Symposium on the Theory of Computing (STOC), 2015. Generalization in Adaptive Data Analysis and Holdout Reuse Neural Information Processing Systems (NIPS), 2015.