SAFE – a method for anonymising the German Census

Slides:



Advertisements
Similar presentations
Statistics for Improving the Efficiency of Public Administration Daniel Peña Universidad Carlos III Madrid, Spain NTTS 2009 Brussels.
Advertisements

Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.
Counting the Dutch, The Future of the Virtual Census in the Netherlands Presentation at the seminar Counting the 7 Billion 24 February 2012 * Geert Bruinooge.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
MAE 552 Heuristic Optimization Instructor: John Eddy Lecture #16 3/1/02 Taguchi’s Orthogonal Arrays.
1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (28-30 October 2009) Accuracy evaluation of Nuts level 2 hypercubes with the adoption of.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
User-focused Threat Identification For Anonymised Microdata Hans-Peter Hafner HTW Saar – Saarland University of Applied Sciences
Daniel Beckler United States Department of Agriculture National Agricultural Statistics Service Timothy Mulcahy NORC at the University of Chicago Topic.
Dynamic Programming. What is dynamic programming? Break problem into subproblems Work backwards Can use ‘recursion’ ‘Programming’ - a mathematical term.
Arrays 1 Multiple values per variable. Why arrays? Can you collect one value from the user? How about two? Twenty? Two hundred? How about… I need to collect.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Some aspects concerning analytical validity and disclosure risk of CART generated synthetic data Hans-Peter Hafner and Rainer Lenz Research Data Centre.
Descriptive Research Study Investigation of Positive and Negative Affect of UniJos PhD Students toward their PhD Research Project Dr. K. A. Korb University.
1 Using Fixed Intervals to Protect Sensitive Cells Instead of Cell Suppression By Steve Cohen and Bogong Li U.S. Bureau of Labor Statistics UNECE/Work.
Perform Descriptive Statistics Section 6. Descriptive Statistics Descriptive statistics describe the status of variables. How you describe the status.
State Statistical Institute Berlin-Brandenburg Jörg Höhne / Julia HöningerResearch Data Centre Morpheus – Remote Data Access with a Quality Measure Joint.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa October 2013 Johan Heldal and Svetlana.
Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical.
Michelle Simard Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality Tarragona, Spain, November 23 rd, 2011 Progress on Real Time Remote.
Joint UNECE/Eurostat work session on statistical data confidentiality October 2015 Helsinki, Finland Circle of trust Maurice Brandt DESTATIS.
Basics in R part 2. Variable types in R Common variable types: Numeric - numeric value: 3, 5.9, Logical - logical value: TRUE or FALSE (1 or 0)
Linear Equations in Linear Algebra
Statistics in SPSS Lecture 3
Statistics 200 Lecture #7 Tuesday, September 13, 2016
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Subject Name: File Structures
Graphical & Tabular Descriptive Techniques
Multidimensional Access Structures
Progress towards a table builder with in-built disclosure control for 2021 Census Keith Spicer UNECE, 22 September 2017.
Two-Way Tables and Probability
Assessing Disclosure Risk in Microdata
Beata Nowok Chris Dibben & Gillian Raab Administrative Data
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 16: Research with Categorical Data.
Confidentiality in Published Statistical Tables
(Modeling of Decision Processes)
Country report Germany
Country report Germany
Chapter 25 Comparing Counts.
Graphical Descriptive Techniques
Matrix Algebra - Overview
Chapter 1 Data Analysis Section 1.1 Analyzing Categorical Data.
Chapter 8: Weighting adjustment
The Math Studies Project for Internal Assessment
The 2 (chi-squared) test for independence
2018, Spring Pusan National University Ki-Joune Li
Disclosure Avoidance: An Overview
Chapter 26 Comparing Counts.
Evaluation of Content Error Pres. 10
Linear Equations in Linear Algebra
Federal Statistical Office Germany Research Data Centre
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
The European Statistical Training Programme (ESTP)
Evaluation of Content Error Pres. 10
Chapter 26 Comparing Counts.
Two-Way Frequency Tables
A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019.
Correlation & Regression
A Brief Introduction to Stata(2)
Tabulation and Dual System of Estimation (DSE) Pres. 9
Open Data Sharing and its Statistical Limitations
Item 2.2 Scientific Use Files for the Time Use Survey
Stephanie Hirner ESTP ”Administrative data and censuses
Presentation transcript:

SAFE – a method for anonymising the German Census 26-28 October 2011 Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality (Tarragona, Spain)

SAFE - a pretabular method (1) SAFE creates an anonymous micro data file. What are anonymous micro data? Micro data are anonymous (confidential), “if they cannot be matched to the concerned person.” (German law of Statistics §16). Identification can be avoided by ambiguous records in the micro data file. SAFE - micro data files are confidential because of ambiguity in the record set. 26-28 October 2011 State Statistical Institute Berlin-Brandenburg

SAFE - a pretabular method (2) Advantages: Solution in one step Analysis can reidentify only the anonymous micro data file All analysis based on the same source and are consistent No cell suppression (primary or secondary) necessary Disadvantages: Change from cell suppression to uncertainty in cell values New interpretation of tables No easy extension of analysis to not controlled tables Calculation effort may be relatively great 26-28 October 2011 State Statistical Institute Berlin-Brandenburg

Idea of SAFE - solution (1) 15 10 5 0-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50 51-55 56-60 61-65 66-70 71-75 76-80 81-85 86-90 91-95 96-... male female 4 3 2 1 region 15 10 5 0-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50 51-55 56-60 61-65 66-70 71-75 76-80 81-85 86-90 91-95 96-... male female 4 3 2 1 region original anonymous in the anonymous data set all statistical objects exists at least tree times identically (triplets). data attacks (reidentification) lead to more then two objects for every tabulated combination only triplets data attacks (reidentification) lead to more than two objects 26-28 October 2011 State Statistical Institute Berlin-Brandenburg

Idea of SAFE - solution (2) original 15 10 5 0-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50 51-55 56-60 61-65 66-70 71-75 76-80 81-85 86-90 91-95 96-100 male female 4 3 2 1 region Inhabitants by region and sex (left: original right: anonymous) 10 20 30 40 50 60 70 80 1 2 3 4 5 region persons female male male 48% female 52% The micro data set seems not realistic (only triplets) in the information. but the analysis should be as similar as possible to the original micro data file. Would be controlled for a set of predefined tables 15 10 5 0-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50 51-55 56-60 61-65 66-70 71-75 76-80 81-85 86-90 91-95 96-100 male female 4 3 2 1 region anonymous inhabitants by age and sex (left: original right: anonymous) 5 10 15 20 25 0-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50 51-55 56-60 61-65 66-70 71-75 76-80 81-85 86-90 91-95 96-100 Age in 5 years persons female male no real data, only triplets but the analysis should be as similar as possible to the original micro data file 26-28 October 2011 State Statistical Institute Berlin-Brandenburg

Mathematical model of SAFE (1) use of original micro data y results in no deviation, but confidential records in the cases of (yj=1 und yj=2) the problem is a hughe non-linear integer optimization efficient heuristic approach used - unique original combination (yjo=1) - identical original combinations (yjo>1) - synthetic objects (xjo=0) a- vector of original frequencies in controlled table cells (result of tabulation of original micro data) A - linear relation matrix (aij=1, if object i is counted in table cell j, otherwise aij=0) d - vector of deviation trough tabulation of anonymised micro data instead of original data di as deviation in table cell j w - vector of weights for table cells (allow different deviation depending from cell size) with: y - vector of frequencies of category combinations in the micro data a - vector of original frequencies in controlled table cells A - linear relation matrix d - vector of deviation trough tabulation of anonymised micro data instead of original data di as deviation in table cell j w - vector of weights for table cells 26-28 October 2011 State Statistical Institute Berlin-Brandenburg

Adaption for the census different units: persons, housings, buildings are different units of the census with hierarchical dependencies. Splitting the micro data set in different variable blocks (persons: region, age, sex, profession, ... housing: region, heating (oil, gas,...), bath, ... building: construction year, size, ...) Counting variable for person, housing, building. Create control-table with counting variables. Persons were splitted from housing and building information. Housing and building in one data set with different counting variables. Persons were splitted, because the information results from other sources register information. Housing- and building information was collected together by one survey and anonymised by counting variables. size of the problem, program adaption to 64-bit-code and to multi-core-programming 26-28 October 2011 State Statistical Institute Berlin-Brandenburg

SAFE – tests with census 1987 West Germany (1) micro data file for persons register Records (persons): Variables: Controlled tables: Table cells: Maximal deviation: 63 202 834 21 (28) 430 10 219 270 9 table cell by size  number of table cells maximal deviation Mean in deviation from to 1 10 100 1 000 10 000 100 000 1 000 000 10 000 000 9 99 999 9 999 99 999 999 999 9 999 999 or more 3 787 507 3 246 990 2 022 537 887 972 233 991 37 507 2 727 39 8 4 1.64 2.22 2.29 2.33 2.38 2.45 2.61 1.31 21 variables (28 including hierarchies) 430 controlled tables is actual plan for publication of 1 up to 5 variables in combination 26-28 October 2011 State Statistical Institute Berlin-Brandenburg

SAFE – tests with census 1987 West Germany (2) micro data file for persons register Deviation in table cell number of table cells by table dimension … ratio of cells with maximal … deviation in table cell by table dimension … 1 2 3 4 or 5 7 730 41 472 429 058 764 471 56,7 12,0 11,9 12,2 4 600 86 008 1 053 292 2 034 970 90,4 36,8 41,1 44,8 1 210 73 607 840 969 1 523 732 99,2 58,0 64,4 69,2 104 60 243 594 461 954 158 100,0 75,4 80,9 84,4 4 - 47 035 411 153 608 160 89,0 92,3 94,1 5 26 310 202 569 274 847 96,5 97,9 98,5 6 10 077 66 662 81 715 99,5 99,7 99,8 7 1 823 8 954 9 489 8 80 155 154 9 10 Other all 13 644 346 657 3 607 273 6 251 696 26-28 October 2011 State Statistical Institute Berlin-Brandenburg

SAFE – tests with census 1987 West Germany (1) micro data file for housing and building Records (housing): Variables: Controlled tables: Table cells: Maximal deviation: 26 624 252 7 (15) 119 2 906 234 6 table cell by size  number of table cells maximal deviation Mean in deviation from to 1 10 100 1 000 10 000 100 000 1 000 000 9 99 999 9 999 99 999 999 999 9 999 999 1 358 071 915 183 471 606 130 859 26 334 3 838 343 5 6 1.38 1.43 1.39 1.44 1.46 26-28 October 2011 State Statistical Institute Berlin-Brandenburg

SAFE – tests with census 1987 West Germany (2) micro data file for housing and building Deviation in table cell number of table cells by table dimension … ratio of cells with maximal … deviation in table cell by table dimension … 1 2 3 4 or 5 4 391 37 980 155 866 278 551 44.5 17.0 16.4 16.2 5 250 84 913 420 560 752 125 97.7 55.0 60.5 59.9 223 64 154 260 389 471 625 100.0 83.8 87.9 87.3 27 290 86 197 165 421 96.0 96.9 4 - 8 555 27 912 51 211 99.8 99.9 5 425 1 195 1 996 6 7 Other all 13 644 346 657 3 607 273 6 251 696 26-28 October 2011 State Statistical Institute Berlin-Brandenburg

Interpretation of results SAFE solutions are: Like “noise” added to table cells. Maximal deviation is known and documented. Relative deviation in cells decreases in greater table cell values. Missing combinations are unlikely but not sure not existing. Unique combinations in table row (line or column) do not allow an information gain (group disclosure problem) through not sure uniqueness in the original data. Good preservation of structure in the data. No missing information through complementary cell suppression. like “noise” added to table cells Maximal deviation is known and documented. Relative deviation in cells increases in smaller table cell values. Missing combinations are unlikely but not sure not existing. Unique combinations in table row (line or column) do not allow an information gain (table sum problem) through not sure uniqueness in the original data. Good preservation of structure in the data. No missing information through complementary cell suppression. 26-28 October 2011 State Statistical Institute Berlin-Brandenburg

Thank you for your attention! 26-28 October 2011 State Statistical Institute Berlin-Brandenburg