1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.

Slides:

Advertisements

Similar presentations

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.

Advertisements

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.

Output Consultation Plans and Statistical Disclosure Control Strategy developments Angele Storey and Jane Longhurst ONS.

WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics Caroline.

Progress on the SDC Strategy for the 2011 Census 23 rd June 2008 Keith Spicer and Caroline Young.

Chapter 18: The Chi-Square Statistic

Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.

WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics

© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.

SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

Chap 8-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 8 Estimation: Single Population Statistics for Business and Economics.

Multinomial Experiments Goodness of Fit Tests We have just seen an example of comparing two proportions. For that analysis, we used the normal distribution.

Quality assurance -Population and Housing Census Alma Kondi, INSTAT, Albania.

Assessing Disclosure Risk in Sample Microdata Under Misclassification

Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Categorical Variables Chapter 15.

Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK

PY 427 Statistics 1Fall 2006 Kin Ching Kong, Ph.D Lecture 12 Chicago School of Professional Psychology.

Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.

Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.

Chapter 17 Analysis of Variance

Chapter 26: Comparing Counts. To analyze categorical data, we construct two-way tables and examine the counts of percents of the explanatory and response.

PSY 307 – Statistics for the Behavioral Sciences Chapter 19 – Chi-Square Test for Qualitative Data Chapter 21 – Deciding Which Test to Use.

11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.

CHP400: Community Health Program - lI Research Methodology. Data analysis Hypothesis testing Statistical Inference test t-test and 22 Test of Significance.

Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.

© Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.

1 Tel Aviv April 29th, 2007 Disclosure Limitation from a Statistical Perspective Natalie Shlomo Dept. of Statistics, Hebrew University Central Bureau of.

Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)

WP. 46 Providing access to data and making microdata safe, experiences of the ONS Jane Longhurst Paul Jackson ONS.

1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.

Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Innovations in Data Dissemination Thomas L. Mesenbourg, Jr. Acting Director U.S. Census Bureau United Nations Seminar on Innovations in Official Statistics.

Chapter 16 The Chi-Square Statistic

WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.

Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada

1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.

Data Analysis for Two-Way Tables. The Basics Two-way table of counts Organizes data about 2 categorical variables Row variables run across the table Column.

JOINT UN-ECE/EUROSTAT MEETING ON POPULATION AND HOUSING CENSUSES GENEVA, MAY 2009 DETERMINING USER NEEDS FOR THE 2011 UK CENSUS IAN WHITE, Office.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-5 Estimating a Population Variance.

Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.

Disclosure Control in the UK Census Keith Spicer 11 January 2005.

Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.

Chapter Outline Goodness of Fit test Test of Independence.

1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.

Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.

1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa October 2013 Johan Heldal and Svetlana.

Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.

Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.

Remote Analysis Server for Tabulation and Analysis of Data Tarragonia, October 2011 James Chipperfield and Frank Yu (presenter)

Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.

7 1 Database Systems: Design, Implementation, & Management, 7 th Edition, Rob & Coronel 7.6 Advanced Select Queries SQL provides useful functions that.

Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &

Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.

11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester

Natalie Shlomo Social Statistics, School of Social Sciences

Data Confidentiality and the Common Good.

Introduction The two-sample z procedures of Chapter 10 allow us to compare the proportions of successes in two populations or for two treatments. What.

Progress towards a table builder with in-built disclosure control for 2021 Census Keith Spicer UNECE, 22 September 2017.

Assessing Disclosure Risk in Microdata

Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 16: Research with Categorical Data.

Association between two categorical variables

Data Analysis for Two-Way Tables

Chapter 11: Inference for Distributions of Categorical Data

Chapter 18: The Chi-Square Statistic

Imputation as a Practical Alternative to Data Swapping

Presentation transcript:

1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton

2 Topics: Introduction Disclosure risk SDC methods for protecting Census frequency tables Disclosure risk and data utility measures Description of table Risk-Utility analysis Summary of Analysis Discussion and future work

3 Disclosure risk in Census tables: Need to protect many tables from one dataset containing population counts which can be linked and differenced Need to consider output strategies for standard tables and web based table generating applications Need to interact with users and develop SDC framework with a focus on both disclosure risk and data utility Introduction IdentificationIndividual Attribute Disclosure

4 Disclosure Risk For Census tables: 1’s and 2’s in cells are disclosive since these cells lead to identification, 0’s may be disclosive if there are only a few non-zero cells in a row or column (attribute disclosure) Consideration of disclosure risk: Threshold rules (minimum average cell size, ratio of small cells to zeros, etc.) Proportion of high-risk cells (1 or 2) Entropy (minimum of 0 if distribution has one non-zero cell and all others zero, maximum of (log K) if all cells are equal).

5 SDC Methods for Protecting Frequency Tables 1.Pre-tabular methods (special case of PRAM) Random Record Swapping Targeted Record Swapping In a Census context, geographical variables typically swapped to avoid edit failures and minimize bias Implementation: Randomly select p% of the households Draw a household matching on set of key variables (i.e. household size and broad sex-age distribution) and swap all geographical variables Can target records for swapping that are in high-risk cells of size 1 or 2

6 SDC Methods for Protecting Frequency Tables 2.Rounding Unbiased random rounding Entries are rounded up or down to a multiple of the rounding base depending on pre-defined probabilities and a stochastic draw Example: For unbiased random rounding to base 3: 1 0 w.p of 2/3 1 3 w.p 1/3 2 0 w.p of 1/3 2 3 w.p 2/3 Expectation of rounding is 0 Margins and internal cells rounded separately Small cell rounding: internal cells aggregated to obtain margins

7 SDC Methods for Protecting Frequency Tables 2.Rounding (cont.) Semi-controlled unbiased random rounding Control the selection strategy for entries to round, i.e. use a “without replacement” strategy Implementation: - Calculate the expected number of entries to round up - Draw an srswor sample from among the entries and round up, the rest round down. Can be carried out per row/column to ensure consistent totals on one dimension (key statistics) Eliminates extra variance as a result of the rounding

8 SDC Methods for Protecting Frequency Tables 2.Rounding (cont.) Controlled rounding Feature in Tau-Argus (Salazar-González, Bycroft and Staggemeier, 2005) - Uses linear programming techniques to round entries up or down, results similar to deterministic rounding - All rounded entries add up to rounded margins - Method not unbiased and entries can jump a base

9 SDC Methods for Protecting Frequency Tables 3. Cell Suppression Hypercube method ( Giessing, 2004 ) Feature in Tau-Argus and suited for large tables Uses heuristic based on suppressing corners of a hypercube formed by the primary suppressed cell with optimality conditions Imputing suppressed cells for utility evaluation: Replace suppressed cell by the average information loss in each row/column. Example: Two suppressed cells in a row and known margin is 500. The total of non-suppressed cells is 400. Each cell is replaced with a value of 50

10 Disclosure Risk Measures Need to determine output strategies and SDC together Hard-copy tables, non-flexible categories and geographies: can control SDC methods to suit the tables Web-based tables and flexible categories and geographies: need to add noise or round for every query Disclosure risk measures: Proportion of high-risk cells (C 1 and C 2 ) not protected Percent true zeros out of total zeros

11 Distance metric - distortion to distributions ( Gomatam and Karr, 2003 ): Internal cells: Let be a table for row k, the number of rows, and the cell frequency for cell c, Margins: Let M be the margin, the number of categories, the number of persons in the category: Utility Measures

12 Utility Measures Impact on Tests for Independence: Cramer’s V measure of association: where is the Pearson chi-square statistic Same utility measure for entropy and the Pearson chi- square statistics Impact on log linear analysis for multi-dimensional tables, i.e. deviance 

13 Utility Measures “ Between” Variance : Let be a target proportion for a cell c in row k, and let be the overall proportion across all rows of the table The “between” variance is defined as: and the utility measure is: 

14 Utility Measures Variance of Cell Counts: The variance of the cell count for row k:  where is the number of columns The average variance across all rows: The utility measure is:

15 Description of Table 2001 UK Census Table: Rows: Output Areas (1,487) Columns: Economic Activity (9) * Sex (2)* Long- Term Illness (2) Table includes 317,064 persons between in 53,532 internal cells Average cell size: 5.92 although table is skewed Number of zeros: 17,915 (33.5%) Number of small cells: 14,726 (27.5%)

16

17

18

19

20

21 Summary of Analysis Rounding eliminates small cells but need to protect against disclosure by differencing and linking when random rounding Rounding adds more ambiguity into the zero counts Random rounding to base 5 has greatest impact on distortions to distribution Semi-controlled rounding has almost no effect on distortions to internal cells but has less distortion on marginal cells Full controlled rounding has less distortion to internal cells since it is similar to deterministic rounding Cell suppression with simple imputation method has highest utility (no perturbation on large cells) but difficult to implement in a Census

22 Summary of Analysis High percent of true small cells in record swapping and less ambiguity of zero cells Record swapping has less distortion to internal cells than rounding which increases with higher swapping rates Targeted swapping has more distortion on internal cells than random swapping but has less impact on marginal cells Column margins of the table have no distortion because of controls in swapping Combining record swapping with rounding results in more distortion but provides added protection

23 Summary of Analysis Record swapping across geographies attenuates: - loss of association (moving towards independence) - counts “flattening” out - proportions moving to the overall proportion Attenuation increases with higher swapping rates Targeted record swapping has less attenuation than random swapping Rounding introduces more zeros: - levels of association are higher - cell counts “sharper” Effects less severe for controlled rounding Combing record swapping and rounding cancel out opposing effects depending on the direction and magnitude of each procedure separately

24 Discussion Choice of SDC method depends on tolerable risk thresholds and demands for “fit for purpose” data Modifying and combining SDC methods (non-perturbative and perturbative methods) can produce higher utility, i.e. ABS developed microdata keys for consistency in rounding Dissemination of quality measures and guidance for carrying out statistical analysis on protected tables Future output strategies based on flexible table generating software. More need for research into disclosure risk by differencing and linking (collaboration with CS community) Safe setting, remote access and license agreements for highly disclosive Census outputs (sample microdata and origin-destination tables)

25 Natalie Shlomo