Disclosure detection & control in research environments Felix Ritchie.

Slides:



Advertisements
Similar presentations
The Microdata Analysis System (MAS): A Tool for Data Dissemination Disclaimer: The views expressed are those of the authors and not necessarily those of.
Advertisements

Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Apr-15H.S.1 Stata: Linear Regression Stata 3, linear regression Hein Stigum Presentation, data and programs at: courses.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
Structural Equation Modeling
1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.
Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
1 General Structural Equation (LISREL) Models Week #2 Class #2.
Ridge Regression Population Characteristics and Carbon Emissions in China ( ) Q. Zhu and X. Peng (2012). “The Impacts of Population Change on Carbon.
Chapter 7 – Classification and Regression Trees
Operationalising ‘safe statistics’ the case of linear regression Felix Ritchie Bristol Business School, University of the West of England, Bristol.
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Section 4.2 Fitting Curves and Surfaces by Least Squares.
Multiple Linear Regression Model
Simple Linear Regression
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Developing a Statistical Disclosure Standard for Europe Tanvi Desai LSE Research Laboratory Data Manager Research Laboratory IASSIST 2010: Cornell.
BCOR 1020 Business Statistics
EViews. Agenda Introduction EViews files and data Examining the data Estimating equations.
Business Statistics - QBM117 Statistical inference for regression.
Monitoring the Project
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.
Objectives of Multiple Regression
Multiple Discriminant Analysis and Logistic Regression.
Some matrix stuff.
User-focused Threat Identification For Anonymised Microdata Hans-Peter Hafner HTW Saar – Saarland University of Applied Sciences
Chapter 9 – Classification and Regression Trees
Extension to Multiple Regression. Simple regression With simple regression, we have a single predictor and outcome, and in general things are straightforward.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Copyright 2010, The World Bank Group. All Rights Reserved. Part 2 Labor Market Information Produced in Collaboration between World Bank Institute and the.
Access to sensitive data in the UK: a principles-based approach Felix Ritchie.
Access to Microdata Felix Ritchie Business Data Linking.
CJT 765: Structural Equation Modeling Class 8: Confirmatory Factory Analysis.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
IOPS Toolkit for Risk-based Supervision Module 5: Supervisory Response.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Copyright © 2010 Pearson Addison-Wesley. All rights reserved. Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models.
Multiple Discriminant Analysis
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Multivariate Statistics Confirmatory Factor Analysis I W. M. van der Veld University of Amsterdam.
Linear Discriminant Analysis and Its Variations Abu Minhajuddin CSE 8331 Department of Statistical Science Southern Methodist University April 27, 2002.
Optimal portfolios and index model.  Suppose your portfolio has only 1 stock, how many sources of risk can affect your portfolio? ◦ Uncertainty at the.
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 5-1 Data Mining Methods: Classification Most frequently used DM method Employ supervised.
Development of UK Virtual Microdata Laboratory Felix Ritchie Shanghai, March 2010.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
Business data linking recent UK experience. business data in the UK common register (IDBR) since 1994 key law: Statistics of Trade Act 1947 data collection.
Statistics 350 Lecture 13. Today Last Day: Some Chapter 4 and start Chapter 5 Today: Some matrix results Mid-Term Friday…..Sections ; ;
DISCRIMINANT ANALYSIS. Discriminant Analysis  Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Chapter 12 Multiple Linear Regression and Certain Nonlinear Regression Models.
Development of UK Virtual Microdata Laboratory
Logistic Regression APKC – STATS AFAC (2016).
Linear Regression.
Confidentiality in Published Statistical Tables
Multiple Discriminant Analysis and Logistic Regression
Simple Linear Regression
Linear Regression.
Treatment of statistical confidentiality Table protection using Excel and tau-Argus Practical course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER.
Treatment of statistical confidentiality Table protection using Excel and tau-Argus Practical course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER.
Cornerstones of Managerial Accounting, 5e
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Ch 4.1 & 4.2 Two dimensions concept
Machine Learning – a Probabilistic Perspective
Treatment of statistical confidentiality Part 3: Generalised Output SDC Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK.
Presentation transcript:

Disclosure detection & control in research environments Felix Ritchie

Why are research environments special? Little disclosure control on input Few limits on processing Unpredictable, complex outputs –an infinity of “special cases”  Manual review for disclosiveness required

Problems of reviewing research outputs Limited application of rules How do we ensure –consistency? –transparency? –security? How do we do this with few resources?

Classifying the research zoo Some outputs inherently “safe” Some inherently “unsafe” Concentrate on the unsafe –Focus training –Define limits –Discourage use

Safe versus unsafe Safe outputs –Will be released unless certain conditions arise Unsafe outputs –Won’t be released unless demonstrated to be safe Examples: * = conditions for release apply UnsafeSafeIndeterminate QuantilesLinear regression*Herfindahl indexes GraphsPanel data estimates Aggregated tables Cross-product matrices Estimated covariances*

Determining safety Key is to understand whether the underlying functional form is safe or unsafe Each output type assessed for risk of –Primary disclosure –Disclosure by differencing

Example: linear aggregates of data are unsafe Inherent disclosiveness: –Differencing is feasible  each data point needs to be assessed for threshold/dominance limits => resource problem for large datasets Disclosure by differencing:

Example: linear regression coefficients are safe Let  can’t identify single data point But  No risk of differencing Exceptions –All right hand variables public and an excellent fit (easily tested, can generate automatic limits on prediction) –All observations on a single person/company –Must be a valid regression

Example: cross-product/variance-covariance matrices –Can’t create a table for X unless Z=X and W=I  weighted covariance matrix is safe Cross product matrix M = (X’X) is unsafe Frequencies/totals identified by interaction with constant And for any other categorical variables What about variance-covariance matrices? –V is unsafe – can be inverted to produce M –But in the more general case

Example: Herfindahl indices Safe as long as at least 3 firms in the industry? No: –Quadratic term exacerbates dominance –If second-largest share is much smaller,  H  share of largest firm –Standard dominance rule of largest unit<45% share doesn’t prevent this Current tests for safety not very satisfactory Composite index of industrial concentration

Questions? Felix Ritchie Microdata Analysis and User Support Office for National Statistics