Differentially Private Verification of Regression Model Results

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
The Regression Equation  A predicted value on the DV in the bi-variate case is found with the following formula: Ŷ = a + B (X1)
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
HSRP 734: Advanced Statistical Methods July 24, 2008.
Unit 2: Research Methods in Psychology
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Hypothesis Testing II The Two-Sample Case.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 25 Categorical Explanatory Variables.
A.Murari 1 (24) Frascati 27 th March 2012 Residual Analysis for the qualification of Equilibria A.Murari 1, D.Mazon 2, J.Vega 3, P.Gaudio 4, M.Gelfusa.
1 1 Slide © 2016 Cengage Learning. All Rights Reserved. The equation that describes how the dependent variable y is related to the independent variables.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
JSM, Boston, August 8, 2014 Privacy, Big Data and The Public Good: Statistical Framework Stefan Bender (IAB)
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
Tobacco Use Supplement to the Current Population Survey User’s Workshop June 9, 2009 Tips and Tricks Analyzing TUS-CPS Data Lloyd Hicks Westat.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables.
October 15. In Chapter 19: 19.1 Preventing Confounding 19.2 Simpson’s Paradox 19.3 Mantel-Haenszel Methods 19.4 Interaction.
Shane Lloyd, MPH 2011, 1,2 Annie Gjelsvik, PhD, 1,2 Deborah N. Pearlman, PhD, 1,2 Carrie Bridges, MPH, 2 1 Brown University Alpert Medical School, 2 Rhode.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Multiple Linear Regression ● For k>1 number of explanatory variables. e.g.: – Exam grades as function of time devoted to study, as well as SAT scores.
Calculus from The Student’s Viewpoint David Bressoud St. Paul, MN CBMS Forum Hyatt Regency, Reston, VA October 7, 2014 NSF # A pdf file of this.
Chapter 13 Multiple Regression
Chelsie Guild, Taylor Larsen, Mary Magee, David Smith, Curtis Wilcox TERM PROJECT- VISUAL PRESENTATION.
BUS 362 Marketing Research SPSS Exam Spring 2014 Name: Emilija Naumoska Time of the exam start: digit/letter code:
Statistical Discrimination Statistical Discrimination: –Discrimination in absence of prejudice. –Employers use actual average labor market attachment differences.
Foundations of Sociological Inquiry Statistical Analysis.
Gillian Raab, Chris Dibben, & Paul Burton UNECE-Eurostat Work Session on Statistical Data Confidentiality, Helsinki, 2015 Running an analysis of combined.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for year old working males.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/20/12 Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory.
Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡
Developing a Hiring System Measuring Applicant Qualifications or Statistics Can Be Your Friend!
Jerry Reiter Department of Statistical Science and the Information Initiative at Duke Duke University.
BUS 308 Entire Course (Ash Course) For more course tutorials visit BUS 308 Week 1 Assignment Problems 1.2, 1.17, 3.3 & 3.22 BUS 308.
BUS 308 Entire Course (Ash Course) FOR MORE CLASSES VISIT BUS 308 Week 1 Assignment Problems 1.2, 1.17, 3.3 & 3.22 BUS 308 Week 1.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Chapter 15 Inference for Regression. How is this similar to what we have done in the past few chapters?  We have been using statistics to estimate parameters.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
BUS 362 Marketing Research SPSS Exam Prep Fall’16
Degree of Commitment among Students at a Technological University – Testing a New Research Instrument Hannu Vanharanta, Jarno Einolander Industrial Management.
BUS 308 mentor innovative education/bus308mentor.com
Private Data Management with Verification
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
A Comparison of Two Nonprobability Samples with Probability Samples
April 18 Intro to survival analysis Le 11.1 – 11.2
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Carina Omoeva, FHI 360 Wael Moussa, FHI 360
Lecture 13 Preview: Dummy and Interaction Variables
Information networks: Is the public ready?
John Loucks St. Edward’s University . SLIDES . BY.
Lecture 12 More Examples for SLR More Examples for MLR 9/19/2018
BUS 308 MENTOR Education for Service--bus308mentor.com.
Racial Disproportionality in Identification of Behavioral Disorders: A Longitudinal Analysis of Contextual Factors AERA Annual Meting April 12, 2016.
Implementation of a more efficient way of collecting data SBS: use of administrative data Statistics Belgium June 2009.
Hypothesis Testing Make a tentative assumption about a parameter
Soc 3306a: ANOVA and Regression Models
Multiple Regression Analysis with Qualitative Information
Logistic Regression.
Soc 3306a Lecture 11: Multivariate 4
Correlational Research
CS639: Data Management for Data Science
Decision trees MARIO REGIN.
Table 2. Regression statistics for independent and dependent variables
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

Differentially Private Verification of Regression Model Results Jerry Reiter Department of Statistical Science Information Initiative at Duke Duke University jerry@stat.duke.edu

Acknowledgements Research supported by National Science Foundation ACI 14-43014, SES-11-31897, CNS-10-12141 Alfred P. Sloan Foundation: G-2015-20166003 Any views expressed are those of the author and not necessarily of NSF or the Sloan Foundation.

A vision we are working towards Integrated system for access to confidential data including unrestricted access to fully synthetic data, coupled with means for approved researchers to access confidential data via remote access (or FSRDC), glued together by verification servers that allow users to assess quality of inferences from the synthetic data.

Synergies of integrated system Use synthetic data to develop code, explore data, determine right questions to ask User saves time and resources when synthetic data good enough for her purpose If not, user can apply for special access to data This user has not wasted time Exploration with synthetic data results in more efficient use of the real data Explorations done offline free resources (cycles and staff) for final analyses

Verification Servers Verification servers (Reiter et al. 2009, Computational Statistics and Data Analysis) Separate system with confidential and synthetic data User submits query to system for verification of particular analysis Server reports back measure of similarity of analysis on confidential and synthetic data User can decide to publish if quality sufficient But quality measures can leak information

Where are we now? Allowable verifications depend on user characteristics We have developed verification measures that satisfy differential privacy Plots of residuals versus predicted values for regression ROC curves in logistic regression Statistical significance of regression coefficients Tests that coefficients exceed user-defined thresholds R software package in development

Illustrative application: The OPM Synthetic Data Project Created fully synthetic version of the OPM CPDF-EHRI status file Longitudinal work histories of civil servants from 1988 to 2011 Simulate careers, demographics, grades and steps, salaries, …. Only available to OPM and Duke IRB approved researchers at the moment

Illustrative application: Verification of regression Regress log(basic pay) per employee-year on demographics including race. Modeled by gender. Some interesting results from the synthetic data for the analysis pooling all years Median pay for Asian men about 2.8% lower than median pay for White men, holding all else constant Black women have higher median pay than White women but not statistically significant Are the results from the synthetic data believable?

Illustrative application: Verification of regression User defines a threshold that represents a result of practical significance Test if true coefficient for Asian male B < -.01 Verification algorithm returns differentially private answer that reflects uncertainty due to noise Goal: estimate the probability, r = Pr( B < -.01 ) Output: 95% credible interval or posterior mode for r Examples: interval for r is (.92, 1.0), conclude synthetic data result valid, interval for r is (.00, .20), don’t trust synthetic data result.

Verification of males’ regression (ε = 1, threshold = -.01 for each B) Variable Synthetic Mode of p Confidential AI/AN -.006 (4) .76 -.019 (12) Asian -.028 (30) .99 -.040 (43) Black -.021 (39) .99 -.036 (61) Hispanic -.014 (22) .99 -.029 (42) Age .033 (365) .043 (480) Age Sq. -.00027 (269) -.00036 (352) Education .013 (122) .021 (180)

Verification of females’ regression (ε = 1, threshold = - Verification of females’ regression (ε = 1, threshold = -.01 for each B) Variable Synthetic Mode of p Confidential AI/AN -.009 (7) .97 -.027 (19) Asian -.011 (13) .42 -.010 (11) Black .00013 (.3) .003 -.003 (8) Hispanic -.013 (19) .99 -.021 (30) Age .023 (286) .032 (404) Age Sq. -.00019 (205) -.00027 (295) Education .014 (130) .023 (198)

Illustrative application: Summary of verification results Males: synthetic data suggest all coefficients except for AI/AN less than -.01. Verification confirms! Females: Synthetic data suggest coefficients for AI/AN near -.01 and Hispanic less than -.01. Verification confirms! Synthetic data suggest coefficient for Black not less than -.01. Verification confirms! Synthetic data suggest coefficient for Asian close to -.01. Synthetic data agree with .42.

Longitudinal version Adapt idea for longitudinal analyses Estimate regression in each year Does trend in OLS coefficients in confidential data match trend in synthetic data (or other posited trend)? Compute match indicators with differential privacy Illustration for OPM data Run same regressions as before but separately by year. Plot coefficients by year for synthetic and confidential data Figures displayed on pdf of draft paper….

Future directions Finding ways to get high quality results without high privacy budgets. Considering multivariate quantities Global privacy budget versus individual privacy budget versus analysis privacy budget – interactions with trust Accounting for uncertainty when estimates based on complex, modest-sized samples Developing software to implement the idea Paper coming soon….