1 Overview of Statistical Disclosure Methodology for Microdata Laura Zayatz Census Bureau BTS Confidentiality Seminar Series, April.

Slides:



Advertisements
Similar presentations
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Advertisements

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
1 State & County Characteristics: Overview The basics State –The general method –July 1, 2000 beginning population –Domestic migration IRS pre-processing.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
National Center for Health Statistics DCC CENTERS FOR DISEASE CONTROL AND PREVENTION Changes in Race Differentials: The Impact of the New OMB Standards.
Exploring Administrative Records Use for Race and Hispanic Origin Item Non-Response Sonya Rastogi, Leticia Fernandez, James Noon, Ellen Zapata and Renuka.
What are Wage Records? Wage records are an administrative database used to calculate Unemployment Insurance benefits for employees who have been laid-off.
Results from the 2010 Census Race and Hispanic Origin Alternative Questionnaire Experiment Nicholas Jones  Roberto Ramirez U.S. Census Bureau Presentation.
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
CE Overview Jay T. Ryan Chief, Division of Consumer Expenditure Survey December 8, 2010.
1 U.S. Census Bureau Data Availability for Geographic Areas March 25, 2008.
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
Public Aggregate Reporting – DHCS Business Reports Overview
1 The American Community Survey (ACS) 2005 Data Release.
REVIEW OF VITAL STATISTICS Brady E. Hamilton, Ph.D. Reproductive Statistics Branch and Elizabeth Arias, Ph.D. Mortality Statistics Branch Division of Vital.
1 Commuting and Migration Data Products from the American Community Survey Journey-to-Work and Migration Statistics Branch U.S. Census Bureau State Data.
U.S. Census Bureau Demographic Census 2000 July 8, 2003.
VITAL STATISTICS ANALYSIS RESULTS FENGQING (ZOE) ZHANG COMMUNITY HEALTH INTERN 2012.
2014 SDC and CIC Annual Training Conference: Accessing ACS PUMS Data Tim Gilbert U.S. Census Bureau April 2, 2014.
Orange County Race and Ethnic Profile: Social and Economic Data, 2000 Social and Economic Data, 2000 Eui-Young Yu (Cal State L.A. and KAC-CIC) and Peter.
11 The American Community Survey Steve Murdock, Ph.D. Director, Hobby Center for the Study of Texas Rice University.
Household Surveys ACS – CPS - AHS INFO 7470 / ECON 8500 Warren A. Brown University of Georgia February 22,
American Community Survey Presented at the Meeting of the National Neighborhood Indicators Partnership Susan Schechter May
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Introduction to PUMS: Public Use Microdata Sample Presentation is based on material a. borrowed from Chuck Purvis MTC, Oakland, California b. PUMS 2000.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
C2ER 52 nd Annual Conference & LMI Training Institute Annual Forum Regional Socioeconomic Statistics Update on U.S. Census Bureau Programs June 8, 2012.
National Projections Program, 2005 Population Projections Branch Population Division U.S. Census Bureau.
Data on: Race and Hispanic Origin Data on: Race and Hispanic Origin Census 2000:
1 Overview of the 2010 Census Two Hundred Twenty Years and Counting …
Introduction to GIS. » Should the census collect racial data at all?
Ed Christopher Resource Center near Chicago Federal Highway Administration Governors Drive Olympia Fields, IL 60461
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Introduction to the Public Use Microdata Sample (PUMS) File from the American Community Survey Updated February 2013.
Introduction to the US Census. Historical Context Article I, Section 2 of the U.S. Constitution adopted in 1787 approved that Representatives and Taxes.
Using the ACS: Issues with studying small areas and change over time Presented to Association of Public Data Users January 20, 2011.
American Community Survey Overview September 4, 2013 Tim Gilbert American Community Survey Office.
HHS Data Standards for Race, Ethnicity, Sex, Primary Language and Disability Status Rashida Dorsey, PhD, MPH Department of Health and Human Services Office.
1 Projecting Race and Hispanic Origin for the U.S. Population and an Examination of the Impact of Net International Migration David G. Waddington Victoria.
American Community Survey Maryland State Data Center Affiliate Meeting September 16, 2010.
1 Language Data from the American Community Survey.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
The use of protected microdata in tabulation: case of SDC-methods microaggregation and PRAM Researcher Janika Konnu Manchester, United Kingdom December.
American Community Survey “It Don’t Come Easy”, Ringo Starr Jane Traynham Maryland State Data Center March 15, 2011.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Determinants of Race Reporting by Hispanics in a National Health Survey Jacqueline Wilson Lucas, M.P.H. Division of Health Interview Statistics Elizabeth.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
The U.S. Census Bureau’s Postcensal and Intercensal Population Estimates Alexa Jones-Puthoff Population Division National Conference on Health Statistics.
AASHTO & FHWA Appeal re: DRB “rule of three” decision before the Data Stewardship Executive Policy Committee 8/28/2008.
ETHNICITIES CHAPTER 7 | p Feb 17 – 27.
CENSUS 2000 DATA AND HOW WE CAN USE IT Claudette Bennett Population Division U.S. Census Bureau January 2003.
ASDC Annual Meeting November 10, 2011 Kathleen Gabler Socioeconomic Research Associate Center for Business and Economic Research Culverhouse College of.
Map of Pacific Islands California: 347,501 NHPI Second fastest growing racial group- 29% growth between Overview CA Demographics By 2060, NHPI.
CENSUS 2000 DATA ON RACE, HISPANIC ORIGIN, AND ANCESTRY Nancy M. Gordon Associate Director for Demographic Programs U.S. Census Bureau March 2001.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Coding and Editing Multiple Race and Ethnicity
Census 2010: Data on Race and Ethnicity
Data Confidentiality and the Common Good.
The National Health Interview Survey: Celebrating 50 years of success
Begin with watching video above.
Federal Statistical Office Germany Research Data Centre
Imputation as a Practical Alternative to Data Swapping
Status Update on 2020 Census Data Products Plan
Presentation transcript:

1 Overview of Statistical Disclosure Methodology for Microdata Laura Zayatz Census Bureau BTS Confidentiality Seminar Series, April 2003

2 Microdata ëRespondent level data ëEach record represents one responding household or person ëUsually demographic data

3 Microdata Disclosure Risk ëHigh visibility records ëLinkable external files

4 Addressing Risk ëCan we measure it? (file level – function of ever-changing external database environment) ëGetting specific (record level) ëReduce the amount of information ëDistort the information

5 Making the Data Safe ëRemove obvious identifiers ëSet a geographic threshold ëKnow external data ëKnow your data ëLook for problems

6 Look for Problems: External Files ëGet to know external files State and local government agencies Other federal agencies Private firms ($) ëPerform reidentification experiments Hunt and peck Data fusion

7 Look for Problems: Your File ëSpecial uniques 2 way cross tabulations of all variables Work your way up ëVariables that you know exist on external databases (even if you cannot access them)

8 Be Careful with …… ëGeographic detail ëContextual variables/Sampling information ëEstablishment data ëLongitudinal data (life events often generate records on external files) ëAdministrative data ëData that will also be published in the form of tables at smaller geographic levels

9 When You Find a Problem….. ëReduce risk by reducing the amount of information ëReduce risk by perturbing the data

10 Reducing the Amount of Information ëDelete a variable ëRecode a categorical variable into larger categories (perhaps using thresholds) ëRecode a continuous variable into categories ëRound continuous variables ëUse top and bottom codes (provide means,…) ëUse local suppression ëEnlarge geographic areas

11 Perturbing the Data ëNoise addition ëSwapping ëRank Swapping ëBlanking and imputation ëMicroaggregation ëMultiple imputation/modeling to generate synthetic data

12 Census 2000 Public Use Microdata Samples (PUMS) --- Decrease in Detail ëInternal reidentification study looking at all products (microdata and tables) from 1990 census ëIncreased concern about external data linking capabilities

13 The Files ë17% of households receive the long form ëMaximum of 6% of the population appears on PUMS ë2 mutually exclusive files ë5% state file, PUMAs contain 100,000 people, less variable detail, 2003 release ë1% characteristics file, SuperPUMAs contain 400,000 people, same variable detail, 2003 release

14 Changes for the 5% and the 1% Files ëRound all dollar amounts $1-7 = $4 $8-$999 round to nearest $10 $1,000-$49,000 round to nearest $100 $50,000+ round to nearest $1,000

15 Changes for the 5% and the 1% Files ëRound departure time in 30-minute intervals in 10-minute intervals in 5 minute intervals in 10-minute intervals

16 Changes for the 5% and the 1% Files ëNoise added to ages of people in households with 10 or more people ëAges must stay within certain groupings ëBlank original ages ëNew ages generated from a given distribution of ages in that grouping

17 Small Amount of Additional Noise ëCertain characteristics of small, unusual subgroups of people and housing units ëVulnerable to disclosure via publicly available datasets ëNot disclosing the details ëResulted from reidentification studies

18 Changes for the 5% File Only Categories Must Have 10,000 People Nationwide ëLanguage ëAncestry ëBirthplace ëTribe2723 ëHispanic O.4829 ëOccupation506443

19 Multiple Races ëWhite ëBlack ëAmerican Indian or Alaska Native ëAsian: Asian Indian, Chinese, Filipino, Japanese, Korean, Vietnamese, Other Asian ëNative Hawaiian or Pacific Islander: Native Hawaiian, Guamanian or Chamorro, Samoan, Other Pacific Islander ëSome Other Race

20 Race on the PUMS ëYes/No for each major race grouping (6) ëDesignation of 1 of about 70 single race groups (includes some write-ins) ëDesignation of multiple race groups ëAt least 10,000 (5% file) and 8,000 (1% file) people nationwide

21 Data Swapping ë5% and 1% files ëSwapping of “special uniques” --- high risk records ëPairs of households that agree on certain demographic characteristics but are in difference geographic areas are swapped across PUMAs and SuperPUMAs

22 Issues that Have Not Changed ëTopcoding (half-percent/three-percent rule) ëProperty taxes (categorization)

23 Conclusion ëKnow your data ëKnow external data ëProactively look for problems ëIf possible, if you find a disclosure problem, let users help you choose the best method for reducing risk of disclosure

24 Thank You for Coming ë