Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Overview of Statistical Disclosure Methodology for Microdata Laura Zayatz Census Bureau BTS Confidentiality Seminar Series, April.

Similar presentations


Presentation on theme: "1 Overview of Statistical Disclosure Methodology for Microdata Laura Zayatz Census Bureau BTS Confidentiality Seminar Series, April."— Presentation transcript:

1 1 Overview of Statistical Disclosure Methodology for Microdata Laura Zayatz Census Bureau laura.zayatz@census.gov BTS Confidentiality Seminar Series, April 2003

2 2 Microdata ëRespondent level data ëEach record represents one responding household or person ëUsually demographic data

3 3 Microdata Disclosure Risk ëHigh visibility records ëLinkable external files

4 4 Addressing Risk ëCan we measure it? (file level – function of ever-changing external database environment) ëGetting specific (record level) ëReduce the amount of information ëDistort the information

5 5 Making the Data Safe ëRemove obvious identifiers ëSet a geographic threshold ëKnow external data ëKnow your data ëLook for problems

6 6 Look for Problems: External Files ëGet to know external files State and local government agencies Other federal agencies Private firms ($) ëPerform reidentification experiments Hunt and peck Data fusion

7 7 Look for Problems: Your File ëSpecial uniques 2 way cross tabulations of all variables Work your way up ëVariables that you know exist on external databases (even if you cannot access them)

8 8 Be Careful with …… ëGeographic detail ëContextual variables/Sampling information ëEstablishment data ëLongitudinal data (life events often generate records on external files) ëAdministrative data ëData that will also be published in the form of tables at smaller geographic levels

9 9 When You Find a Problem….. ëReduce risk by reducing the amount of information ëReduce risk by perturbing the data

10 10 Reducing the Amount of Information ëDelete a variable ëRecode a categorical variable into larger categories (perhaps using thresholds) ëRecode a continuous variable into categories ëRound continuous variables ëUse top and bottom codes (provide means,…) ëUse local suppression ëEnlarge geographic areas

11 11 Perturbing the Data ëNoise addition ëSwapping ëRank Swapping ëBlanking and imputation ëMicroaggregation ëMultiple imputation/modeling to generate synthetic data

12 12 Census 2000 Public Use Microdata Samples (PUMS) --- Decrease in Detail ëInternal reidentification study looking at all products (microdata and tables) from 1990 census ëIncreased concern about external data linking capabilities

13 13 The Files ë17% of households receive the long form ëMaximum of 6% of the population appears on PUMS ë2 mutually exclusive files ë5% state file, PUMAs contain 100,000 people, less variable detail, 2003 release ë1% characteristics file, SuperPUMAs contain 400,000 people, same variable detail, 2003 release

14 14 Changes for the 5% and the 1% Files ëRound all dollar amounts $1-7 = $4 $8-$999 round to nearest $10 $1,000-$49,000 round to nearest $100 $50,000+ round to nearest $1,000

15 15 Changes for the 5% and the 1% Files ëRound departure time 2400-0259 in 30-minute intervals 0300-0459 in 10-minute intervals 0500-1059 in 5 minute intervals 1100-2359 in 10-minute intervals

16 16 Changes for the 5% and the 1% Files ëNoise added to ages of people in households with 10 or more people ëAges must stay within certain groupings ëBlank original ages ëNew ages generated from a given distribution of ages in that grouping

17 17 Small Amount of Additional Noise ëCertain characteristics of small, unusual subgroups of people and housing units ëVulnerable to disclosure via publicly available datasets ëNot disclosing the details ëResulted from reidentification studies

18 18 Changes for the 5% File Only Categories Must Have 10,000 People Nationwide 1990 2000 ëLanguage 30574 ëAncestry292143 ëBirthplace312167 ëTribe2723 ëHispanic O.4829 ëOccupation506443

19 19 Multiple Races ëWhite ëBlack ëAmerican Indian or Alaska Native ëAsian: Asian Indian, Chinese, Filipino, Japanese, Korean, Vietnamese, Other Asian ëNative Hawaiian or Pacific Islander: Native Hawaiian, Guamanian or Chamorro, Samoan, Other Pacific Islander ëSome Other Race

20 20 Race on the PUMS ëYes/No for each major race grouping (6) ëDesignation of 1 of about 70 single race groups (includes some write-ins) ëDesignation of multiple race groups ëAt least 10,000 (5% file) and 8,000 (1% file) people nationwide

21 21 Data Swapping ë5% and 1% files ëSwapping of “special uniques” --- high risk records ëPairs of households that agree on certain demographic characteristics but are in difference geographic areas are swapped across PUMAs and SuperPUMAs

22 22 Issues that Have Not Changed ëTopcoding (half-percent/three-percent rule) ëProperty taxes (categorization)

23 23 Conclusion ëKnow your data ëKnow external data ëProactively look for problems ëIf possible, if you find a disclosure problem, let users help you choose the best method for reducing risk of disclosure

24 24 Thank You for Coming ëlaura.zayatz@census.gov ë301-457-4955


Download ppt "1 Overview of Statistical Disclosure Methodology for Microdata Laura Zayatz Census Bureau BTS Confidentiality Seminar Series, April."

Similar presentations


Ads by Google