1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Balancing Access and Confidentiality Jenny Telford Australian Bureau of Statistics September 2008.
Output Consultation Plans and Statistical Disclosure Control Strategy developments Angele Storey and Jane Longhurst ONS.
Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census.
Using American FactFinder John DeWitt Project Manager Social Science Data Analysis Network Lisa Neidert Data Services Population Studies Center.
The Microdata Analysis System (MAS): A Tool for Data Dissemination Disclaimer: The views expressed are those of the authors and not necessarily those of.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
National Science Foundation Division of Science Resources Statistics May The Confidential Information Protection and Statistical Efficiency Act.
The estimation strategy of the National Household Survey (NHS) François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation.
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
The Smith Consulting Group1 Ethics and Accountability Bob Smith The Smith Consulting Group Spring 2004 Conference Oklahoma Association for Instructional.
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
Using American FactFinder John DeWitt Project Manager Social Science Data Analysis Network Lisa Neidert Data Services Population Studies Center.
1 U.S. Census Bureau Data Availability for Geographic Areas March 25, 2008.
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
1 Commuting and Migration Data Products from the American Community Survey Journey-to-Work and Migration Statistics Branch U.S. Census Bureau State Data.
“OnTheMap” The Census Bureau’s New Tool for Residence-Workplace Analysis Fredrik Andersson and Jeremy Wu May 7, 2007 Daytona Beach, FL.
Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.
1 Boundary and Annexation Survey (BAS) Laura Waggoner Legal Areas Team Lead Boundary and Annexation Survey Project Manager Michael Clements Geographer.
Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, September 2011 Overview of Archiving of Microdata Session 4 United Nations.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
1 Methods of Confidentiality Protection Arnold P. Reznek U.S. Census Bureau CES Room 2K128F Washington, DC Fax
Copyright 2010, The World Bank Group. All Rights Reserved. Estimation and Weighting, Part I.
1 Overview of Statistical Disclosure Methodology for Microdata Laura Zayatz Census Bureau BTS Confidentiality Seminar Series, April.
Using 5-year ACS for Transportation Planning Applications Elaine Murakami FHWA Office of Planning (in Seattle) 1.
Overview of 2002 CIPSEA: Methods to Protect Confidential Tabular Data Amrut Champaneri, Ph.D. U.S. Department of Transportation Bureau of Transportation.
Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Introduction to the Public Use Microdata Sample (PUMS) File from the American Community Survey Updated February 2013.
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.
Small Area Economic Data from the 2007 Economic Census and Economic Surveys Presented by: Andrew W Hait and Patrice C. Norman U.S. Census Bureau Economic.
Discussion of “ Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis” Nancy J. Kirkendall Energy Information Administration.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Small Area Health Insurance Estimates (SAHIE) Program Joanna Turner, Robin Fisher, David Waddington, and Rick Denby U.S. Census Bureau October 6, 2004.
Some ACS Data Issues and Statistical Significance (MOEs) Table Release Rules Statistical Filtering & Collapsing Disclosure Review Board Statistical Significance.
Joint UNECE / Eurostat meeting on Population and Housing Censuses 7-9 July 2010, Geneva Disseminating Census information to maximise use and value Keith.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 SIPP IMPUTATION SCHEME AND DISCUSSION ITEMS Presenters: Nat McKee - Branch Chief Census Bureau Demographic Surveys Division (DSD) Income Surveys Programming.
Statistical data confidentiality and micro data in Albania
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Organizing & Reporting Data: An Intro Statistical analysis works with data sets  A collection of data values on some variables recorded on a number cases.
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
Demographic Full Count Review Presentation to the FSCPE March 26, 2001 Washington D.C.
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
The London Health Observatory: monitoring health and health care in the capital, supporting practitioners and informing decision-makers Disclosure control.
Finding and Mapping Census Data Kathleen Fear, Data Librarian Blair Tinker, GIS Research Specialist.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Data Confidentiality and the Common Good.
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
Classification Trees for Privacy in Sample Surveys
Federal Statistical Office Germany Research Data Centre
Imputation as a Practical Alternative to Data Swapping
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz U.S. Census Bureau 4600 Silver Hill Road Washington, DC Fax

2 Legal Requirements and the Balancing Act Title 13, U.S. Code and the Confidential Information Protection and Statistical Efficiency Act (CIPSEA) of 2002 Publish as much valuable statistical information as possible without violating the confidentiality of respondents Preserve data utility while avoiding disclosure

3 This Presentation 1.Noise for Tabular Magnitude Data 2.Synthetic Tabular Frequency and Microdata 3.Remote Microdata Analysis System

4 This Presentation A.Introduction to the method B.What happened with real data C.How we altered the method D.Current uses of the method on real data products

5 Noise for Tabular Magnitude Data: Introduction to the Method Perturb each establishment’s underlying microdata by a small amount, e.g. 10%, randomly up or down prior to table creation Sensitive cells needing protection end up being changed by a large amount Non-sensitive cells end up being changed by a small amount Simple procedure, values can be shown for all cells, guarantees additivity, no coordination problems for related (overlapping) tables

6 Noise for Tabular Magnitude Data: Introduction to the Method To perturb an establishment’s value by about 10%, multiply that value by a random number close to 1.1 or 0.9 Distribution must be symmetric about 1 for no bias All establishments within the same company are perturbed in the same direction Can incorporate the increase in variance into published coefficients of variation

7 Noise for Tabular Magnitude Data: What Happened with Real Data Because of randomness, the method can occasionally add excessive amounts of noise to some non-sensitive cells (a problem that cell suppression avoids for non- suppressed cells and controlled tabular adjustment can minimize) Is there anything we can do to avoid this problem or at least improve results?

8 Noise for Tabular Magnitude Data: What Happened with Real Data US Census Bureau magnitude data is almost always published in rounded form (integer form representing thousands or millions) Noise changes individual response values by a small percentage Rounding can remove the effect of noise on small response values Is that OK --- does rounding provide enough protection --- or should additional steps be taken to protect such small values?

9 Noise for Tabular Magnitude Data: How we Altered the Method Balanced Noise (See Massell and Funk) Experiment and choose a table(s) --- quite often a lower level table (in the hierarchy) is a good choice and has a trickle up effect Random noise for estabs in sensitive cells and in companies represented in more than 1 cell For others, use a sort to choose noise directions to minimize change to non- sensitive cells

10 Noise for Tabular Magnitude Data: How we Altered the Method Currently testing various modifications to standard rounding techniques Options include rounding underlying microdata values and rounding tabulated cells values Want to ensure standard rounding does not undo the protection provided by the noise Ceiling/Floor techniques seem to work well, but results differ for different data products

11 Noise for Tabular Magnitude Data: Current Uses on Real Data Products Done:Quarterly Workforce Indicators Non-Employer Data Products Near Future:Commodity Flow Survey Census of Island Areas Survey of Business Owners Under Study:County Business Patterns

12 Synthetic Tabular Frequency and Microdata: Introduction to the Method Posterior predictive models generate synthetic data with many of the same statistical properties as the original data Sequential regression imputation, one variable in one record at a time (blank and impute variables causing a disclosure risk for a given record) Full or partial synthesis, demographic or economic, tables or microdata, one or more implicates

13 Synthetic Tabular Frequency and Microdata: What Happened with Real Data Problems with relationships between variables within a data set Records of households linked to records of all people within the household (father, mother, son, daughter, etc.) Structurally missing (blank) values because of skip patterns in survey instrument Examples: people under age 15 cannot have income, a mother cannot be 6 years older than her child

14 Synthetic Tabular Frequency and Microdata: How we Altered the Method Impute some of the structurally missing values, but then restore them to missing for standard imputation and edits For one product - additional layer of programming that became a nine-level collection of parent-child relationships to enforce all constraints

15 Synthetic Tabular Frequency and Microdata: Current Uses on Real Data Products Done:SSA Earnings and CB SIPP Data “On The Map” ACS Group Quarters Data Under Study:ACS Household Data Special Tabs for Veterans

16 Remote Microdata Analysis System: Introduction to the Method Advanced Query System allow users to generate tables from Census 2000 data Request passes through 2 firewalls to previously swapped, recoded, and topcoded files; tables are generated and electronically reviewed for disclosure problems; if none are found the results are sent to the user Can we extend this to data from demographic surveys and other types of statistical analyses?

17 Remote Microdata Analysis System: What Happened with Real Data Enabled or disabled system? We chose enabled Disabled is more flexible for the user but may require “babysitting” Enabled is more restricted in types of analyses but can be available to more people without strict monitoring Users choose from lists of data sets, geographic areas, universes, analyses, and variables (system writes the code)

18 Remote Microdata Analysis System: How we Altered the Method In looking for disclosure problems, we first focused on the model statements, but later realized the need to look at the underlying data tables (marginal totals of size 1 in particular) in various types of analyses Working on methods to best identify “cut points” in the detail of short, medium, and long lists of continuous variables that need to be categorized

19 Remote Microdata Analysis System: Current Uses on Real Data Products Done:Advanced Query System available to Census Bureau State Data Centers and Census Information Centers and researchers who request an account Under Study:Extended Microdata Analysis System being tested with American Community Survey and Current Population Survey

20 Conclusion Many recent developments in disclosure avoidance at the US Census Bureau Using the noise technique for several tabular magnitude data products Releasing several products based on partially synthetic data AQS is being used widely and work continues on the MAS It takes time, but it is worth the effort