Perturbative methods for ESS census tables

Slides:

Advertisements

Similar presentations

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.

Advertisements

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.

Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.

Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University

1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.

Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.

Chi-squared Tests. We want to test the “goodness of fit” of a particular theoretical distribution to an observed distribution. The procedure is: 1. Set.

1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.

WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.

Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.

The Review of the Dissemination of Health Statistics Carole Abrahams Office for National Statistics.

The 2011 Census: Estimating the Population Alexa Courtney.

Joint Eurostat Unece Worksession on Statistical Data Confidentiality 2011, Tarragona Initial analyses on comparable dissemination from the Essnet project.

Remote Analysis Server for Tabulation and Analysis of Data Tarragonia, October 2011 James Chipperfield and Frank Yu (presenter)

Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.

1 How to produce population gridded data - the aggregation approach Ola Nordbeck Statistics Norway.

11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester

Natalie Shlomo Social Statistics, School of Social Sciences

Open Science=Open Methodology Oshrat Hochman & Christof Wolf

Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.

Progress towards a table builder with in-built disclosure control for 2021 Census Keith Spicer UNECE, 22 September 2017.

Assessing Disclosure Risk in Microdata

Confidentiality in Published Statistical Tables

Mobile phone data Belgium State of affairs, datasets, use cases

Legal, political and methodological issues in confidentiality in the ESS Maria João Santos, Jean-Marc Museux Eurostat.

Integration of INSPIRE & SDMX data infrastructures for the 2021 Census

Access to European microdata for scientific purposes

Presentation 2b 2018 Census Products & Services Engagement.

Harmonisation process of anonymisation of microdata

2001 Census Disclosure Control UK variations

Census Hub in practice Working Group "European Statistical Data Support" Luxembourg, 29 April 2015.

Working Group on Population and Housing Censuses

Goals and objectives of Work package 2 of the ESSnet on Consistency of concepts and applied methods of business and trade-related statistics Norbert Rainer,

The European Statistical System

LAMAS Working Group December 2014

Disposable income in rural areas

WORKSHOP ON THE DATA COLLECTION OF OCCUPATIONAL DATA Luxembourg, 28 November 2008 Occupation as a core variable in social surveys Sylvain Jouhette

Inna Šteinbuka Director, Social and Information Society Statistics

Ola Nordbeck Statistics Norway

Data from statistical modeling (e. g

Item 8 Cost assessment survey of production of statistics in the ESS

Community Census Programme 2001

Noumea, New Caledonia, 3 to 7 December, 2018

Working Group on Population and Housing Censuses

Technical guidance for grid based provision of data for MSFD reporting

Albania 2021 Population and Housing Census - Plans

Variable Standardisation: State of Play

GEOSTAT 1B – presentation of the call for proposals

HBS business needs for future

Education and Training Statistics Working Group – 2-3 June 2016

Meeting of the Directors of Social Statistics October 2016

TG EHIS January 2012 Item 4.1 of the agenda EHIS wave 2 Implementing Regulation Bart De Norre, Eurostat.

State of play: data transmission, validation and dissemination

SGA on perturbative methods, report of the CoE

A review of the 2011 census round in the EU, including the successful implementation of a detailed European legal base First meeting of the Technical Coordination.

Marleen De Smedt Geoffrey Thomas Cynthia Tavares

GUIDELINES FOR THE COLLECTION OF PESTICIDE USAGE STATISTICS A summary

Collection and dissemination of data geo-referenced to a 1km² grid Item 3.2 of the draft agenda DSS Meeting 1 and 2 March 2018.

Item 4.3 Confidentiality on the fly

Confidentiality on the Fly

Eurostat's plans and legal framework of the 2021 round of population and housing censuses David Thorogood Population and migration unit Eurostat Policy.

Dealing with confidential data Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION.

Treatment of statistical confidentiality Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE.

REFLECTIONS ON THE SUSTAINABILITY OF ARGUS

Item 5 Wim Kloek, Eurostat

Item 2.2 Scientific Use Files for the Time Use Survey

Item 2. Latest developments as regards the population and housing census round 2021 in the ESS Luis del Barrio Enlargement, neighbourhood.

Report of the CoE Peter-Paul de Wolf 14 November 2019.

Geocoding of Population and Housing Census 2000: Lessons Learned

Presentation transcript:

Perturbative methods for ESS census tables Current status of SGA N° 2018.0108 under the FPA N° 11112.2014.005-2014.533 Co-financed by the European Commission Peter-Paul de Wolf 14 March 2019

Introduction Population census has long history Eurostat: census hub (https://ec.europa.eu/CensusHub2/) data from 32 European countries Big step: Harmonized census output But: No harmonized statistical disclosure control approaches (Member states used their own methods)

Introduction Census 2021: also harmonized SDC ? SGA “Harmonized protection of census data in the ESS” Recommend methods for protecting census hyper-cubes Grid squares (1 km2) AND administrative areas new

Introduction Two suggested methods: (TRS) and / or (CKM) Targeted Record Swapping Adding noise using Cell Key Method

Targeted Record Swapping Pre-tabular method (changes in microdata) Preparation: Specify variables that define risk (k-anonymity) Specify variables that define regional hierarchy Calculate risk for all households at each regional level Specify variables that define “similar” households Specify minimum swap rate

Targeted Record Swapping Go from highest regional level to lowest: Make donor-set of households “Similar” households of the high risk households Draw a donor household for a high risk household Same regional level, different region Swap all regional variables If minimum swap rate is not reached, swap additional households at lowest regional level

TRS Implementation C++ code (library) Callable from µ-Argus Callable from R using e.g., package* recordSwapping swapdata <- recordSwap(data,similar,hierarchy,risk,hid,th,swaprate,seed) All available through subpages of TargetedRecordSwapping on https://github.com/sdcTools/protoTestCensus *TRS will become part of sdcMicro package

Noise addition using Cell Key Method Post tabular method (noise added to table cells) Determine p-table Draw 𝒰(0,1) value for each record = record key Sum record keys of records in each table cell Assign fractional part of that sum as cell key to each table cell Use cell value AND cell key AND p-table to get amount of noise to add to that cell

1. Determine p-table p-table: Transition probabilities (R-package ptable) pij = P(cell value i is changed into value j) E.g., probabilities v = j – i : Cumulative i j 1 2 3 4 5 6 0.5133 0.4600 0.0267 0.1656 0.5463 0.2449 0.0432 0.4208 0.2776 0.1824 0.1192 0.0739 0.2442 0.3637

2.–4. Draw record-keys and make cell-keys ID Sex Age Record Key 1 M A 0.34582249 2 F B 0.68438579 3 0.95880618 4 C 0.62902289 5 0.86598721 6 0.36307981 7 0.91420393 8 0.69629390 9 0.53460054 10 0.68511663 11 0.03426370 12 0.33696811 13 0.11181613 14 0.56526973 15 0.01047942 Sex Age i Cell Key T 15 0.73611646 A 5 0.53206947 B 8 0.21194429 C 2 0.99210270 M 7 0.70435560 4 0.96679974 3 0.73755586 F 0.03176086 1 0.56526973 0.47438843 Sex = M Age = B Sum=1.73755586

5. Determine amount of noise to add v = j – i : Cumulative Sex Age i Cell Key T 15 0.73611646 A 5 0.53206947 B 8 0.21194429 C 2 0.99210270 M 7 0.70435560 4 0.96679974 3 0.73755586 F 0.03176086 1 0.56526973 0.47438843 (M, B): i = 3 (M, B): j = i + 1 = 4

CKM implementation Part of τ-Argus R-package cellKey inp <- ck_create_input(…) new_table <- perturbTable(inp,dimList) All available through subpages of CellKey on https://github.com/sdcTools/protoTestCensus

Preliminary tests by project partners CKM implementations Test set of up to 10 million records 2011 census hypercubes 9.1-9.4 and 11.1-11.2 Acceptable runtime Record keys and p-table need to be generated outside of τ-Argus To force same record keys and p-table between runs Easy to use

Preliminary tests by project partners TRS implementations Only tested with relatively small datasets Very fast Easy to use

Further tests Open for census teams to test the implementations (announced in Eurostat email on 17 December 2018) Already some feedback: some questions on how to install some questions on how to choose parameters

Next steps Extend risk and utility measures Guidelines for appropriate parameter values Communication of method to public (users) What parameters / quality measures can be given? How does method protect census information? Extend idea of CKM to more general situations

Next steps Need (more) feedback from census teams on usage Need feedback from users of protected tables Assess risk related to Grid cells versus administrative regions European grid versus national grids Any help/input is welcome!