Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Linear Regression Using Excel 2010 Linear Regression Using Excel ® 2010 Managerial Accounting Prepared by Diane Tanner University of North Florida Chapter.
Multiple Regression Analysis
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Chapter 13 Multiple Regression
Chapter 10 Simple Regression.
Statistics for Managers Using Microsoft® Excel 5th Edition
Chapter 12 Simple Regression
Statistics for Managers Using Microsoft® Excel 5th Edition
Chapter 12 Multiple Regression
BCOR 1020 Business Statistics Lecture 28 – May 1, 2008.
Stat 112 – Notes 3 Homework 1 is due at the beginning of class next Thursday.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 13-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Linear Regression Example Data
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Business Statistics: Communicating with Numbers By Sanjiv Jaggia.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
1 732G21/732A35/732G28. Formal statement  Y i is i th response value  β 0 β 1 model parameters, regression parameters (intercept, slope)  X i is i.
Simple Linear Regression Analysis
Chapter 6 (cont.) Regression Estimation. Simple Linear Regression: review of least squares procedure 2.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Quantitative Demand Analysis
Correlation & Regression
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Using Data Privacy for Better Adaptive Predictions Vitaly Feldman IBM Research – Almaden Foundations of Learning Theory, 2014 Cynthia Dwork Moritz Hardt.
Hydrologic Modeling: Verification, Validation, Calibration, and Sensitivity Analysis Fritz R. Fiedler, P.E., Ph.D.
Chapter 14 Simple Regression
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Chapter 14 Introduction to Multiple Regression
You want to examine the linear dependency of the annual sales of produce stores on their size in square footage. Sample data for seven stores were obtained.
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
Statistics for Business and Economics 8 th Edition Chapter 11 Simple Regression Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Regression Analysis Relationship with one independent variable.
Statistics for Business and Economics 8 th Edition Chapter 11 Simple Regression Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Lecture 10: Correlation and Regression Model.
Linear Discriminant Analysis (LDA). Goal To classify observations into 2 or more groups based on k discriminant functions (Dependent variable Y is categorical.
Applied Quantitative Analysis and Practices LECTURE#25 By Dr. Osman Sadiq Paracha.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 14-1 Chapter 14 Introduction to Multiple Regression Statistics for Managers using Microsoft.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Statistics for Managers Using Microsoft® Excel 5th Edition
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Regression Analysis1. 2 INTRODUCTION TO EMPIRICAL MODELS LEAST SQUARES ESTIMATION OF THE PARAMETERS PROPERTIES OF THE LEAST SQUARES ESTIMATORS AND ESTIMATION.
Real Estate Sales Forecasting Regression Model of Pueblo neighborhood North Elizabeth Data sources from Pueblo County Website.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Regression Modeling Applications in Land use and Transport.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Simple linear regression and correlation Regression analysis is the process of constructing a mathematical model or function that can be used to predict.
REGRESSION REVISITED. PATTERNS IN SCATTER PLOTS OR LINE GRAPHS Pattern Pattern Strength Strength Regression Line Regression Line Linear Linear y = mx.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Understanding Generalization in Adaptive Data Analysis
Vitaly (the West Coast) Feldman
Chapter 12 Inference on the Least-squares Regression Line; ANOVA
Preserving Validity in Adaptive Data Analysis
Multiple Regression Models
Simple Linear Regression
Simple Linear Regression
The reusable holdout: Preserving validity in adaptive data analysis
Presentation transcript:

Preserving Statistical Validity in Adaptive Data Analysis Vitaly Feldman IBM Research - Almaden Cynthia Dwork Moritz Hardt Toni Pitassi Omer Reingold Aaron Roth Microsoft Res. Google Res.U. of Toronto Samsung Res. Penn, CS

Analysis Findings Param. estimates Correlations Predictive model Classifier, Clustering etc.

Data Science 101 Does student nutrition affect academic performance? Normalized grade

Check correlations

Pick candidate foods

Fit linear function of 3 selected foods Freedman’s Paradox: “Such practices can distort the significance levels of conventional statistical tests. The existence of this effect is well known, but its magnitude may come as a surprise, even to a hardened statistician.” (1983) SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations100 ANOVA dfSSMSFSignificance F Regression E-05 Residual Total Coefficient sStandard Errort StatP-value Intercept Mushroom Pumpkin Nutella FALSE DISCOVERY

Statistical inference Data Result and statistical guarantees Procedure Hypothesis tests Regression Learning p-values confidence intervals prediction intervals “Fresh” data

Data analysis is adaptive Data Result Exploratory data analysis Variable selection Hyper-parameter tuning Shared data - findings inform others

Is this a real problem? In the course of collecting and analyzing data, researchers have many decisions to make […] It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance”, and to then report only what “worked”. [Simmons,Nelson,Simonsohn 11] 1,000,000+ downloads; citations “Irreproducible preclinical research exceeds 50%, resulting in approximately US$28B/year loss” [Freedman,Cockburn,Simcoe 15] “Why Most Published Research Findings Are False” [Ioannidis 05] Adaptive data analysis is one of the causes

Evaluating adaptive queries Data analyst(s) Statistical query oracle [Kearns 93] Can measure correlations, moments, accuracy/error, parameters and run any SQ-based algorithm!

Answering non-adaptive SQs

Answering adaptive SQs

Our results

Tool: differential privacy DATA

Differential Privacy [Dwork,McSherry,Nissim,Smith 06] S Algorithm ratio bounded Cynthia Frank Chris Kobbi Adam Aaron

Why DP? DP composes adaptively A B

B A Why DP? DP composes adaptively

DP implies generalization Why DP? DP composes adaptively

Back to queries

Further developments