WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics Caroline.

Slides:



Advertisements
Similar presentations
Statistical Significance and Population Controls Presented to the New Jersey SDC Annual Network Meeting June 6, 2007 Tony Tersine, U.S. Census Bureau.
Advertisements

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
Output Consultation Plans and Statistical Disclosure Control Strategy developments Angele Storey and Jane Longhurst ONS.
Chapter 7 Sampling and Sampling Distributions
Contingency Tables Chapters Seven, Sixteen, and Eighteen Chapter Seven –Definition of Contingency Tables –Basic Statistics –SPSS program (Crosstabulation)
Chapter 16 Goodness-of-Fit Tests and Contingency Tables
Progress on the SDC Strategy for the 2011 Census 23 rd June 2008 Keith Spicer and Caroline Young.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
2013/12/10.  The Kendall’s tau correlation is another non- parametric correlation coefficient  Let x 1, …, x n be a sample for random variable x and.
Statistical Analysis and Data Interpretation What is significant for the athlete, the statistician and team doctor? important Will Hopkins
Contingency Tables Chapters Seven, Sixteen, and Eighteen Chapter Seven –Definition of Contingency Tables –Basic Statistics –SPSS program (Crosstabulation)
WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
Multinomial Experiments Goodness of Fit Tests We have just seen an example of comparing two proportions. For that analysis, we used the normal distribution.
Assessing Disclosure Risk in Sample Microdata Under Misclassification
By Wendiann Sethi Spring  The second stages of using SPSS is data analysis. We will review descriptive statistics and then move onto other methods.
Len Cook: Hero or Zero of the 2001 Census? OR A look at the impact of disclosure control on aggregate census outputs.
Chapter Goals After completing this chapter, you should be able to:
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
Slides by JOHN LOUCKS St. Edward’s University.
Cross Tabulation and Chi-Square Testing. Cross-Tabulation While a frequency distribution describes one variable at a time, a cross-tabulation describes.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
Inference for regression - Simple linear regression
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
1 Tel Aviv April 29th, 2007 Disclosure Limitation from a Statistical Perspective Natalie Shlomo Dept. of Statistics, Hebrew University Central Bureau of.
Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.
Multinomial Distribution
Multinomial Experiments Goodness of Fit Tests We have just seen an example of comparing two proportions. For that analysis, we used the normal distribution.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
New and easier ways of working with aggregate data and geographies from UK censuses Justin Hayes UK Data Service Census Support.
Some ACS Data Issues and Statistical Significance (MOEs) Table Release Rules Statistical Filtering & Collapsing Disclosure Review Board Statistical Significance.
Evaluating generalised calibration / Fay-Herriot model in CAPEX Tracy Jones, Angharad Walters, Ria Sanderson and Salah Merad (Office for National Statistics)
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
American Community Survey (ACS) Product Types: Tables and Maps Samples Revised
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Correlation/Regression - part 2 Consider Example 2.12 in section 2.3. Look at the scatterplot… Example 2.13 shows that the prediction line is given by.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
Copyright © 2014 by Nelson Education Limited Chapter 11 Introduction to Bivariate Association and Measures of Association for Variables Measured.
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.
Remote Analysis Server for Tabulation and Analysis of Data Tarragonia, October 2011 James Chipperfield and Frank Yu (presenter)
Statistics Josée L. Jarry, Ph.D., C.Psych. Introduction to Psychology Department of Psychology University of Toronto June 9, 2003.
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
Data disclosure control Nordic Forum for Geography and Statistics Stockholm, 10 th September 2015.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Bivariate Association. Introduction This chapter is about measures of association This chapter is about measures of association These are designed to.
Natalie Shlomo Social Statistics, School of Social Sciences
School of Geography, University of Leeds
Chapter 11 Chi-Square Tests.
Assessing Disclosure Risk in Microdata
Making Use of Associations Tests
8. Association between Categorical Variables
Correlation – Regression
Hypothesis testing. Chi-square test
Data Analysis for Two-Way Tables
Basic Statistical Terms
Hypothesis testing. Chi-square test
Gerald Dyer, Jr., MPH October 20, 2016
females males Analyses with discrete variables
Chapter 11 Chi-Square Tests.
15.1 The Role of Statistics in the Research Process
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Chapter 11 Chi-Square Tests.
Presentation transcript:

WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics Caroline Young University of Southampton Office for National Statistics

Topics of Discussion 1.Introduction 2.Methods for perturbing frequency tables containing whole population counts 3.Information loss measures for assessing the impact of SDC methods on utility and quality 4.Data description and definition of tables 5.Examples and analysis of results 6.Conclusions and future research

Introduction 1.Focus on frequency tables containing whole population counts: UK Neighborhood Statistics (NeSS) website which disseminates small area statistics from census and administrative data 2. Tables are intentionally perturbed for statistical disclosure control (SDC) causing information loss 3. Develop quantitative information loss measures for choosing optimal SDC methods which preserves high utility in the tables 4. Information loss depends on the SDC method, characteristics of the table and the use of the data

SDC Methods for Frequency Tables SDC for frequency tables containing population counts: Small Cell Adjustments (SCA) – random rounding to base 3 of small cells: Perturbation has a mean of zero and variance of 2. Marginal totals obtained by adding perturbed and non-perturbed cells Full Random Rounding (RaRo) – random rounding to base 3 for all entries. Same method described above after converting all entries to residuals of 3. Marginal totals rounded separately and tables arent additive Can improve utility by semi-controlling for marginal totals

SDC Methods for Frequency Tables SDC for frequency tables containing population counts (cont.): Controlled Rounding (Cr(3)) – all entries rounded to base 3 according to solution of linear programming while ensuring that aggregated rounded internal cells equal the rounded margins. Controlled rounding via Tau-Argus (standard tool for NeSS tables) Cell suppression – small cells (ones and twos) are suppressed and secondary suppressions are found to protect against recalculation through margins. Cell suppression via Tau-Argus and the hyper-cube method

SDC Methods for Frequency Tables SDC for frequency tables containing population counts (cont.): Imputation methods for cell suppression: Margins are known and the total of the suppressed cells are known Impute by average of the total of the suppressed cells in each row (S-A) Impute by weighted average of the total of the suppressed cells in each row where weights are the column totals (S-WA)

Information Loss Measures Measuring distortion to distributions: Distance metrics between original and perturbed cells in each geography (i.e., ward ( NUTS5 )) and average across all wards Let be a table for ward k, the number of cells in the ward, the number of wards, and the cell frequency for cell c : Hellingers Distance (HD) Relative Absolute Distance (RAD) Average Absolute Distance per Cell (AAD)

Information Loss Measures Aggregation of perturbed cells and effects on sub-totals: Users aggregate lower level geographies which are perturbed to obtain non-standard geographies Calculate sub-total where Impact on Tests for Independence: Cramers V measure of association: where is the Pearson chi-square statistic Information loss measure:

Information Loss Measures Impact on Variance : - Little impact on variance of cell counts - Between variance of target variables for proportions in wards: Let the proportion in a ward k: and the overall proportion: Between variance: Information loss measure: Mixed effects for this information loss measure

Information Loss Measures Impact on Rank Correlations: Sort original cell counts and define deciles Repeat on perturbed cell counts Information loss measure: where I is the indicator function and the number of wards Log Linear Analysis: Information loss measure based on the ratio of the deviance (likelihood ratio test statistic) between perturbed table and original table for a given model: Need to also compare different models since model for original table may differ from model of perturbed table

Data Used Estimation Area Southwest England: 437,744 persons, 182,337 households, 70 wards (on average 6,250 persons to a ward) The tables were the following: Tenure(3) * Age (7) * Health(4) * Ward Ethnicity (17) * Ward Economic Activity (9) * Sex (2) * Long-Term Illness(2) * Ward

Data Used TenureEthnicityEmployment Number of cells5,8801,1202,520 Average cell size and SE 73.8 (3.3) (51.3) (6.6) % of small cells12%9% % of zero cells26%23%17%

Distance Metrics : (Left)-Hellingers Distance, (Centre)-Relative Absolute Difference and (Right)- Absolute Distance per cell CR3 RaRo SASCASWACR3 RaRo SASCASWACR3 RaRo SASCASWA

Box Plots: Difference between Perturbed and Original Subtotals of Three Consecutive Wards (ADs) PAs for Number of Unemployed Females with Long Term Illness Perturbation Method (Internal cells)

Change in Cramers V Measure of Association after Perturbation Increase in association Decrease in association Percent Relative Difference

CR3 RaRo SA SCA SWA Percentage of Cells in a Different Decile after Perturbation Male (column 1) Female (column 2) Students with Long Term Illness Male StudentsFemale Students Percentage of cells N.B. The selected columns are very sparse with approx 70% of cells having counts < 4.

Log-Linear Models: Effect of Perturbation on Model Selection Original Model: Choose a better model? SCA5, RaRo5, CR35, SA6, SWA4, Original4,486 DevianceRatio (/Orig)

Conclusions Inconsistent results for some of the information loss measures (Cramers V, between variance) showing that stochastic processes for SDC will have varying effects on the quality of the data Emergence of some guidelines: - skewed tables (one or two large columns and the rest small columns) - prefer rounding to cell suppression - uniform tables - less information loss due to SDC methods so choose method with least changes to the table - sparse tables – need to have benchmarked totals so control round (if possible) or semi-control random round Improve utility by: designing tables to avoid disclosive cells; controlling for totals when random or small cell rounding; giving clear guidance to users on how best to impute suppressed cells

Future Research Determine optimal methods of SDC depending on the use of the data and the characteristics of the table (skewed, sparse, uniform) Generalize and expand information loss measures for all types of statistical data (tabular and microdata) and statistical analysis Develop software to give to suppliers of data for assessing information loss under different SDC methods and choosing the optimal method which gives high utility tables