Statistical Methodology for the Automatic Confidentialisation of Remote Servers at the ABS Session 1 UNECE Work Session on Statistical Data Confidentiality.

Slides:



Advertisements
Similar presentations
Balancing Access and Confidentiality Jenny Telford Australian Bureau of Statistics September 2008.
Advertisements

Continued Psy 524 Ainsworth
The Microdata Analysis System (MAS): A Tool for Data Dissemination Disclaimer: The views expressed are those of the authors and not necessarily those of.
Apr-15H.S.1 Stata: Linear Regression Stata 3, linear regression Hein Stigum Presentation, data and programs at: courses.
Managerial Economics in a Global Economy
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
Logistic Regression Example: Horseshoe Crab Data
Simple Linear Regression and Correlation
Chapter 8 Linear Regression © 2010 Pearson Education 1.
Multiple Regression [ Cross-Sectional Data ]
Introduction to Data Mining with XLMiner
Chapter 13 Multiple Regression
Multiple Regression Predicting a response with multiple explanatory variables.
Chapter 12 Multiple Regression
Statistics for Business and Economics
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Jul-15H.S.1 Linear Regression Hein Stigum Presentation, data and programs at:
Empirical Estimation Review EconS 451: Lecture # 8 Describe in general terms what we are attempting to solve with empirical estimation. Understand why.
Regression Eric Feigelson. Classical regression model ``The expectation (mean) of the dependent (response) variable Y for a given value of the independent.
Classification and Prediction: Regression Analysis
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
ABS Tablebuilder and DataAnalyser Session 7 UNECE Work Session on Statistical Data Confidentiality October 2013 Daniel Elazar
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Linear Trend Lines = b 0 + b 1 X t Where is the dependent variable being forecasted X t is the independent variable being used to explain Y. In Linear.
The Use of Dummy Variables. In the examples so far the independent variables are continuous numerical variables. Suppose that some of the independent.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Causality and confounding variables Scientists aspire to measure cause and effect Correlation does not imply causality. Hume: contiguity + order (cause.
Linear Trend Lines = b 0 + b 1 X t Where is the dependent variable being forecasted X t is the independent variable being used to explain Y. In Linear.
Michelle Simard Statistics Canada UNECE Worksessions on Statistical Disclosure Control Methods Helsinki, October 2015 Development of rules from administrative.
Leapfrog’s Resource Utilization Measures & Severity-Adjustment Models April 25, 2008.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
Multiple regression. Example: Brain and body size predictive of intelligence? Sample of n = 38 college students Response (Y): intelligence based on the.
Worked Example Using R. > plot(y~x) >plot(epsilon1~x) This is a plot of residuals against the exploratory variable, x.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Michelle Simard, Thérèse Lalor Statistics Canada CSPA Project Manager UNECE Work Session on Statistical Data Confidentiality Helsinki, October 2015 Confidentialized.
Correlation/Regression - part 2 Consider Example 2.12 in section 2.3. Look at the scatterplot… Example 2.13 shows that the prediction line is given by.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
EXCEL DECISION MAKING TOOLS BASIC FORMULAE - REGRESSION - GOAL SEEK - SOLVER.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
1/15/2016Marketing Research2  How do you test the covariation between two continuous variables?  Most typically:  One independent variable  and: 
B AD 6243: Applied Univariate Statistics Multiple Regression Professor Laku Chidambaram Price College of Business University of Oklahoma.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Stat 112 Notes 6 Today: –Chapters 4.2 (Inferences from a Multiple Regression Analysis)
Chapter 8 Linear Regression. Fat Versus Protein: An Example 30 items on the Burger King menu:
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS.
Remote Analysis Server for Tabulation and Analysis of Data Tarragonia, October 2011 James Chipperfield and Frank Yu (presenter)
BIVARIATE/MULTIVARIATE DESCRIPTIVE STATISTICS Displaying and analyzing the relationship between continuous variables.
1 G Lect 10M Contrasting coefficients: a review ANOVA and Regression software Interactions of categorical predictors Type I, II, and III sums of.
Chapter 11 REGRESSION Multiple Regression  Uses  Explanation  Prediction.
Instructor: R. Makoto 1richard makoto UZ Econ313 Lecture notes.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Before the class starts: Login to a computer Read the Data analysis assignment 1 on MyCourses If you use Stata: Start Stata Start a new do file Open the.
Australian Census of Population and Housing Dissemination Strategies UNSC Seminar February 2011 Gillian Nicoll Australian Bureau of Statistics.
Stats Methods at IC Lecture 3: Regression.
Lecture 24 Multiple Regression Model And Residual Analysis
QM222 A1 Nov. 27 More tips on writing your projects
The greatest blessing in life is
Regression diagnostics
Confidentiality on the Fly
Presentation transcript:

Statistical Methodology for the Automatic Confidentialisation of Remote Servers at the ABS Session 1 UNECE Work Session on Statistical Data Confidentiality October 2013 Daniel Elazar

Tabular attacks Averaging Differencing Scope coverage Sparsity Regression attacks Tabular attacks as above, plus Leverage High R2 – saturated or ideal model fit Influence Solving model equations Confidentiality Risks for Remote Server Outputs Known Types of Attack from the literature

TableBuilder Functionality WeightedRSEs Counts  Estimates  Means  Quantiles 

TableBuilder Protections ProtectionDescription PerturbationStatistical noise added to values Custom Rangesmin, max, min interval width Field Exclusion RulesCertain combinations of variable that increase identification risk are prohibited AdditivityRestores additivity of inner cells to margins Sparsity checksTables with too high a proportion of cells with a small number of contributors are not released RSEsFurther adjusted; quality cutoff

DataAnalyser Functionality Written in R Full User Authentication Audit System Exploratory Data Analysis Transformations / Derivations Analysis Procedures /Specifications Outputs Output Formats Summary statistics (sums, counts) Summary Tables Graphics (side-by-side box plots) Summary statistics (count) Graphics Logical derivations Categorical/ Dummy variables Category collapsing Expression Editor for categ. vars Drop variables / records Action List Robust Linear Regression Binomial logistic Probit Multinomial Poisson Diagnostics Weighted Analysis R-squared Pseudo R-squared Coefficients Standard errors Other Diagnostics CSV Storage of intermediate datasets Workflow Control Data Repository Interface Metadata Handler

DataAnalyser Protections (additional to TB) PerturbationStatistical noise added to regression score function Linear RobustHuber Mallows robustness incorporating perturbation for outliers and leverage points Hex Bin PlotsReplaces scatter plots Coverage and scope based Perturbation Perturbation controlled by the specific units included in scope and the definition of scope Drop k unitsOne record is dropped for each category of each explanatory categorical variable Explanatory Only VariablesDemographic variables not allowed in the response variable field SparsityRegressions based on to few units are not released LeverageRegressions on data containing units with excessive leverage are not released

So where’s the Risk in Regressions? Saturated Model x 1,x 2,…,x n Sparse Model x1x1 The Perfect Model x 1,x 2,…,x k Leverage Attack x y c

AB Confidentialised outputs from requests A and B differ slightly  unit(s) (in red) exists in set B excluding A and are likely to be rare/unique Confidentialised outputs from requests A and B are exactly the same  There are no units in set B excluding A Case 1 Scope-Coverage (Differencing) Attack Age Other Characteristics AB Case 2 Age Other Characteristics

p col_index p row_index Perturbation Table pUWC = UWC + p Perturbation of Unweighted Counts Unweighted Count ( UWC ) p = pTable[ p row_index, p col_index ]

Perturbation of Unweighted Counts

Protects against differencing Ensures that the same cell value receives the same perturbation (prevents averaging) Does not perturb zero cells Will not produce negative values for counts Applies relatively more noise to smaller values Does not add bias The Perturbation Algorithm:

Perturbation of Weighted Continuous Values where direction magnitude noise

Perturbation of Regression Estimates

Future Directions