Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with.
View-Based Application Development Lecture 1 1. Flows of Lecture 1 Before Lab Introduction to the Game to be developed in this workshop Comparison between.
Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Design of Experiments Lecture I
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
DETAILED DESIGN, IMPLEMENTATIONA AND TESTING Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
Business microdata dissemination at Istat Daniela Ichim Luisa Franconi
Using synthetic data to improve the accessibility of the SLS Susan Carsley, SLS Project Manager.
Administrative Data Research Centre – Scotland Chris Dibben.
Statistics Canada Statistique Canada Protecting Confidentiality in Canadian Research Data Centres Cynthia Cook Senior Research Data Centre Analyst, Statistics.
FIN 685: Risk Management Topic 5: Simulation Larry Schrenk, Instructor.
Overview of Databases and Transaction Processing Chapter 1.
About ISoft … What is Decision Tree? Alice Process … Conclusions Outline.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Data Linkage Graphical User Interface for Febrl Author: Changyang Li Student ID:u Supervisor: Peter Christen Faculty of Engineering and Information.
SIMULATION. Simulation Definition of Simulation Simulation Methodology Proposing a New Experiment Considerations When Using Computer Models Types of Simulations.
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
TIBCO Designer TIBCO BusinessWorks is a scalable, extensible, and easy to use integration platform that allows you to develop, deploy, and run integration.
UNECE Workshop on Confidentiality Manchester, December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.
Synthetic Data within the Risk – Utility Framework Keith Spicer Office for National Statistics.
Basque Statistics Office Confidentiality Project: Final stages Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality Tarragona, Spain,
Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, September 2011 Overview of Archiving of Microdata Session 4 United Nations.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Research data workflow Practice in Slovenian Social Science Data Archives SERSCIDA WP4 – WORKSHOP Ljubljana September 2013.
© Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.
De-identifying Pathology Reports for Pathology Informatics
Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003.
Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.
Dissemination and interpretation of time use data Social and Housing Statistics Section United Nations Statistics Division Time Use Statistics workshop.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.
Innovations in Data Dissemination Thomas L. Mesenbourg, Jr. Acting Director U.S. Census Bureau United Nations Seminar on Innovations in Official Statistics.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
National design, fieldwork and data harmonization for Labour Force Survey Irena Svetin Statistical Office of the Republic of Slovenia September 2014.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
Some aspects concerning analytical validity and disclosure risk of CART generated synthetic data Hans-Peter Hafner and Rainer Lenz Research Data Centre.
Chapter 4: Introduction to Predictive Modeling: Regressions
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
26 August 2011 Future of access to EU confidential data for scientific purposes Jean-Marc Museux Eurostat – 58th ISI conference,
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
SW318 Social Work Statistics Slide 1 Percentile Practice Problem (1) This question asks you to use percentile for the variable [marital]. Recall that the.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Creating Something from Nothing: Working with Synthetic Files ACCOLEDS /DLI Training: December 2003 Chuck Humphrey University of Alberta.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland.
Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester
Development of UK Virtual Microdata Laboratory Felix Ritchie Shanghai, March 2010.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
The 2011 Census: Estimating the Population Alexa Courtney.
Joint Eurostat Unece Worksession on Statistical Data Confidentiality 2011, Tarragona Initial analyses on comparable dissemination from the Essnet project.
Csci 418/618 Simulation Models Dr. Ken Nygard, IACC 262B
Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Probabilistic Slope Stability Analysis with the
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Data Analysis.
Development of UK Virtual Microdata Laboratory
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Classification 3 (Nearest Neighbor Classifier)
Beata Nowok Chris Dibben & Gillian Raab Administrative Data
Overview of Databases and Transaction Processing
Artificial data in social science
Protecting Confidential Data
Modeling and Analysis Tutorial
Overview Activities from additional UP disciplines are needed to bring a system into being Implementation Testing Deployment Configuration and change management.
Imputation as a Practical Alternative to Data Swapping
Presentation transcript:

Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield

What is synthpop?  A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis and preparing code Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014

SexAgeEducation Marital status IncomeLife satisfaction FEMALE57VOCATIONAL/GRAMMARMARRIED800PLEASED MALE41SECONDARYUNMARRIED1500MIXED FEMALE18VOCATIONAL/GRAMMARUNMARRIEDNAPLEASED FEMALE78PRIMARY/NO EDUCATIONWIDOWED900MIXED FEMALE54VOCATIONAL/GRAMMARMARRIED1500MOSTLY SATISFIED MALE20SECONDARYUNMARRIED-8PLEASED FEMALE39SECONDARYMARRIED2000MOSTLY SATISFIED MALE39SECONDARYMARRIED1197MIXED FEMALE38VOCATIONAL/GRAMMARMARRIEDNAMOSTLY DISSATISFIED FEMALE73VOCATIONAL/GRAMMARWIDOWED1700PLEASED FEMALE54SECONDARYWIDOWED2000MOSTLY SATISFIED MALE30VOCATIONAL/GRAMMARUNMARRIED900MOSTLY SATISFIED MALE68SECONDARYMARRIED-8DELIGHTED MALE61PRIMARY/NO EDUCATIONMARRIED-8MIXED Real (input) SexAgeEducation Marital status IncomeLife satisfaction MALE81PRIMARY/NO EDUCATIONMARRIED2100PLEASED MALE54VOCATIONAL/GRAMMARMARRIED1700PLEASED FEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED FEMALE98PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED FEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED FEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED MALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED FEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED MALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED FEMALE29SECONDARYMARRIED580MOSTLY SATISFIED MALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED MALE41SECONDARYUNMARRIED1500MIXED MALE18SECONDARYUNMARRIED-8PLEASED FEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED Synthetic (output) Data that look (structurally) like original data but contain artificial units only

Data that behave (statistically) like original data

Generating synthetic versions of sensitive microdata for statistical disclosure control package

Generating synthetic data Sequentially replacing original data values with synthetic values generated from conditional probability distributions fit draw Y j ~ (Y 0,Y 1,...,Y j−1 ) synthetic real

Generating synthetic data synthetic real syn ()

Overview of synthpop functions synthetic real read.real() write.syn() sdc() compare.synds()summary.synds() compare.fit.synds() glm.synds() summary.fit.synds() descriptive models syn ()

syn() & common data problems  Missing-data codes: contNA  categorical variables: additional factor level(s)  continuous variables: specified by contNA and modelled separately  Semi-continuous variables: semicont  Restricted values (interrelationships between variables): rules & rvalues  Linear constraints: denom  Non-negativity / non-normality: method set to ‘ lognorm’, ‘ sqrtnorm’ or ‘ cubertnorm’  Deterministic relations: method set to “~I(…)” Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014

sdc() & statistical disclosure control  Data labelling: label  Removing replicated uniques: rm.replicated.uniques  Bottom- and top-coding: recode.vars, bottom.top.coding, recode.exclude  syn(): smoothing, minbucket Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014 sdc(syn.obj, real, label="false data", rm.replicated.uniques = TRUE, recode.vars = c("age","income"), bottom.top.coding = list(c(NA,85),c(NA,1500)))

SexAgeEducation Marital status IncomeLife satisfaction FEMALE57VOCATIONAL/GRAMMARMARRIED800PLEASED MALE41SECONDARYUNMARRIED1500MIXED FEMALE18VOCATIONAL/GRAMMARUNMARRIEDNAPLEASED FEMALE78PRIMARY/NO EDUCATIONWIDOWED900MIXED FEMALE54VOCATIONAL/GRAMMARMARRIED1500MOSTLY SATISFIED MALE20SECONDARYUNMARRIED-8PLEASED FEMALE39SECONDARYMARRIED2000MOSTLY SATISFIED MALE39SECONDARYMARRIED1197MIXED FEMALE38VOCATIONAL/GRAMMARMARRIEDNAMOSTLY DISSATISFIED FEMALE73VOCATIONAL/GRAMMARWIDOWED1700PLEASED FEMALE54SECONDARYWIDOWED2000MOSTLY SATISFIED MALE30VOCATIONAL/GRAMMARUNMARRIED900MOSTLY SATISFIED MALE68SECONDARYMARRIED-8DELIGHTED MALE61PRIMARY/NO EDUCATIONMARRIED-8MIXED Real (input) Synthetic (output) sdc() SexAgeEducation Marital status IncomeLife satisfaction MALE81PRIMARY/NO EDUCATIONMARRIED2100PLEASED MALE54VOCATIONAL/GRAMMARMARRIED1700PLEASED FEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED FEMALE98PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED FEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED FEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED MALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED FEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED MALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED FEMALE29SECONDARYMARRIED580MOSTLY SATISFIED MALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED MALE41SECONDARYUNMARRIED1500MIXED MALE18SECONDARYUNMARRIED-8PLEASED FEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED SexAgeEducation Marital status IncomeLife satisfaction false dataMALE81PRIMARY/NO EDUCATIONMARRIED1500PLEASED false dataMALE54VOCATIONAL/GRAMMARMARRIED1500PLEASED false dataFEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED false dataFEMALE85PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED false dataFEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED false dataFEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED false dataMALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED false dataFEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED false dataMALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED false dataFEMALE29SECONDARYMARRIED580MOSTLY SATISFIED false dataMALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED false dataMALE18SECONDARYUNMARRIED-8PLEASED false dataFEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED

SexAgeEducation Marital status IncomeLife satisfaction FEMALE57VOCATIONAL/GRAMMARMARRIED800PLEASED MALE41SECONDARYUNMARRIED1500MIXED FEMALE18VOCATIONAL/GRAMMARUNMARRIEDNAPLEASED FEMALE78PRIMARY/NO EDUCATIONWIDOWED900MIXED FEMALE54VOCATIONAL/GRAMMARMARRIED1500MOSTLY SATISFIED MALE20SECONDARYUNMARRIED-8PLEASED FEMALE39SECONDARYMARRIED2000MOSTLY SATISFIED MALE39SECONDARYMARRIED1197MIXED FEMALE38VOCATIONAL/GRAMMARMARRIEDNAMOSTLY DISSATISFIED FEMALE73VOCATIONAL/GRAMMARWIDOWED1700PLEASED FEMALE54SECONDARYWIDOWED2000MOSTLY SATISFIED MALE30VOCATIONAL/GRAMMARUNMARRIED900MOSTLY SATISFIED MALE68SECONDARYMARRIED-8DELIGHTED MALE61PRIMARY/NO EDUCATIONMARRIED-8MIXED Real (input) SexAgeEducation Marital status IncomeLife satisfaction MALE81PRIMARY/NO EDUCATIONMARRIED2100PLEASED MALE54VOCATIONAL/GRAMMARMARRIED1700PLEASED FEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED FEMALE98PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED FEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED FEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED MALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED FEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED MALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED FEMALE29SECONDARYMARRIED580MOSTLY SATISFIED MALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED MALE41SECONDARYUNMARRIED1500MIXED MALE18SECONDARYUNMARRIED-8PLEASED FEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED Synthetic (output)

Disclosure control Providing sufficient disclosure protection Disclosure control measures Watermarking Partially synthetic data Data synthesis Handling various data types, data structures and real data problems Stratified synthesis Value bounds Multiple event data Household and other hierarchical data Complex survey design Small geographic areas Package usability Making synthpop flexible and accessible to a wider range of users A graphical user interface (GUI) Dealing with computational limitations Support for LSs projects Training workshops Quality of synthetic data Measuring and improving analytical validity Tests of synthesising approaches (parametric vs CART models) CART extensions Case studies for ADRC-S projects Guidelines for best practise synthpop: future developments