Anonymization of longitudinal surveys in the presence of outliers Hans-Peter Hafner HTW Saar – Saarland University of Applied Sciences

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

Evaluating the Effects of Business Register Updates on Monthly Survey Estimates Daniel Lewis.
Innovation data collection: Advice from the Oslo Manual South East Asian Regional Workshop on Science, Technology and Innovation Statistics.
Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
 New reforms have been passed to modernize the Swiss electricity sector.  Legal (transmission and generation are separate legal entities) and functional.
Sampling Strategy for Establishment Surveys International Workshop on Industrial Statistics Beijing, China, 8-10 July 2013.
1 A REVIEW OF QUME 232  The Statistical Analysis of Economic (and related) Data.
PSY 307 – Statistics for the Behavioral Sciences
Increasing Survey Statistics Precision Using Split Questionnaire Design: An Application of Small Area Estimation 1.
Discriminant Analysis Testing latent variables as predictors of groups.
UNECE Workshop on Confidentiality Manchester, December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.
THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 12 Statistical Process Control.
©2006 Prentice Hall Business Publishing, Auditing 11/e, Arens/Beasley/Elder Audit Sampling for Tests of Details of Balances Chapter 17.
©2003 Prentice Hall Business Publishing, Auditing and Assurance Services 9/e, Arens/Elder/Beasley Audit Sampling for Tests of Details of Balances.
©2010 Prentice Hall Business Publishing, Auditing 13/e, Arens//Elder/Beasley Audit Sampling for Tests of Details of Balances Chapter 17.
©2012 Pearson Education, Auditing 14/e, Arens/Elder/Beasley Audit Sampling for Tests of Details of Balances Chapter 17.
Simple Linear Regression Models
Rudi Seljak, Metka Zaletel Statistical Office of the Republic of Slovenia TAX DATA AS A MEANS FOR THE ESSENTIAL REDUCTION OF THE SHORT-TERM SURVEYS RESPONSE.
Model Building III – Remedial Measures KNNL – Chapter 11.
© Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.
Combining prevalence estimates from multiple sources Julian Flowers.
User-focused Threat Identification For Anonymised Microdata Hans-Peter Hafner HTW Saar – Saarland University of Applied Sciences
Edoardo PIZZOLI, Chiara PICCINI NTTS New Techniques and Technologies for Statistics SPATIAL DATA REPRESENTATION: AN IMPROVEMENT OF STATISTICAL DISSEMINATION.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Natalya Kryvulina, Andrey Kashyn June, 2009 Astana, Kazakhstan.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Tenth Meeting of Working Groups on Macroeconomic Aspects of Intergenerational Transfer: International Symposium on Demographic Change and Policy Response.
Examining Relationships in Quantitative Research
1 Review of ANOVA & Inferences About The Pearson Correlation Coefficient Heibatollah Baghi, and Mastee Badii.
Summary Five numbers summary, percentiles, mean Box plot, modified box plot Robust statistic – mean, median, trimmed mean outlier Measures of variability.
Evaluating Transportation Impacts of Forecast Demographic Scenarios Using Population Synthesis and Data Simulation Joshua Auld Kouros Mohammadian Taha.
DPG Meeting National Panel Survey (NPS) 2008/09 – 2010/11 December 4, 2012.
UTOPPS—Fall 2004 Teaching Statistics in Psychology.
Spatial Interpolation III
Evaluating generalised calibration / Fay-Herriot model in CAPEX Tracy Jones, Angharad Walters, Ria Sanderson and Salah Merad (Office for National Statistics)
Some aspects concerning analytical validity and disclosure risk of CART generated synthetic data Hans-Peter Hafner and Rainer Lenz Research Data Centre.
1 Inferences About The Pearson Correlation Coefficient.
IAB homepage: Institut für Arbeitsmarkt- und Berufsforschung/Institute for Employment Research A New Approach for Disclosure Control in the.
© Federal Statistical Office Germany, Division IB, Institute for Research and Development in Federal Statistics Sheet 1 Surveys, administrative data or.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Simple linear regression Tron Anders Moger
European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata.
Slide 1 Copyright © 2004 Pearson Education, Inc..
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Statistics in Applied Science and Technology Chapter14. Nonparametric Methods.
Can't Type? press F11 Can’t Hear? Check: Speakers, Volume or Re-Enter Seminar Put ? in front of Questions so it is easier to see them. 1 Graded Projects.
Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical.
1 Testing of Hypothesis Two Sample test Dr. T. T. Kachwala.
IE241: Introduction to Design of Experiments. Last term we talked about testing the difference between two independent means. For means from a normal.
© Statistisches Bundesamt, Division IB, Institute for Research and Development in Federal Statistics Is the utilization of administrative data in short.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 5. Measuring Dispersion or Spread in a Distribution of Scores.
ANOVA Overview of Major Designs. Between or Within Subjects Between-subjects (completely randomized) designs –Subjects are nested within treatment conditions.
D/RS 1013 Discriminant Analysis. Discriminant Analysis Overview n multivariate extension of the one-way ANOVA n looks at differences between 2 or more.
Workshop on MDG, Bangkok, Jan.2009 MDG 3.2: Share of women in wage employment in the non-agricultural sector National and global data.
©2012 Prentice Hall Business Publishing, Auditing 14/e, Arens/Elder/Beasley Audit Sampling for Tests of Details of Balances Chapter 17.
WLTP-DHC Analysis of in-use driving behaviour data, influence of different parameters By Heinz Steven
Computer aided teaching of statistics: advantages and disadvantages
Screen Stage Lecturer’s desk Gallagher Theater Row A Row A Row A Row B
Quality of Electronic Emergency Department Data: How Good Are They?
Audit Sampling for Tests of Details of Balances
Audit Sampling for Tests of Details of Balances
Teaching Statistics in Psychology
CHAPTER 29: Multiple Regression*
Hand out z tables.
INTEGRATED LEARNING CENTER
PAIRWISE t-TEST AND ANALYSIS OF VARIANCE
Statistical Thinking and Applications
Regression Part II.
Presentation transcript:

Anonymization of longitudinal surveys in the presence of outliers Hans-Peter Hafner HTW Saar – Saarland University of Applied Sciences Rainer Lenz HTW Saar – Saarland University of Applied Sciences Technical University of Dortmund Work Session on Statistical Data Confidentiality Helsinki, 6 October

Motivation Anonymization longitudinal surveys 2 High demand for micro data of business surveys NSI cannot respond to requests in adequate time (confidentiality requirements, staff shortage) Synthetic data generated by CART models ► Not satisfactory for data containing outliers Other anonymization methods are needed!

Overview Linear Mixed Models and Extensions Data: German Cost Structure Survey Tested models Results analytical potential Results disclosure risk Conclusion and future work Overview Anonymization longitudinal surveys 3

Linear Mixed Models and Extensions Linear Mixed Model (LMM) (1) Y i = X i β + Z i b i + ε i (2) b i ~ N(0, D) (3) ε i ~ N(0, Σ i ) (4) b 1,…,b N, ε 1,…, ε N independent β fixed effects (constant across units) b i random effects (varying across units) Robust Linear Mixed Model (Koller 2014) Lower weights for outliers Generalized Linear Mixed Model (GLMM) Exponential family instead of normal distribution Linear Mixed Models and Extensions Anonymization longitudinal surveys 4

German Cost Structure Survey Information on enterprises of the manufacturing sector ► Branch of economic activity, location of headquarter, number of employees, turnover, raw material consumption … Sample at most enterprises per year, 20+ employees 500+ employees: Complete survey Panel 1999 to 2002: enterprises German Cost Structure Survey Anonymization longitudinal surveys 5

Tested models – 2 approaches Approach 1 Same units for model building and estimation of anonymized values. Approach 2 Two subsamples: Estimate model with sample 1 and compute synthetic values using values of sample 2 and model coefficients from „nearest“ record of sample 1 (and the other way around) „Nearest“ record: Divide dataset in layers of 17 economic groups and old / new federal states. Sort each layer by squaresum of the 4 values of the attribute to be anonymized Unit with highest value -> Sample 1, unit with second highest -> sample 2 … Assignment between samples: Units with same rank in layer Tested Models 1 Anonymization longitudinal surveys 6

Tested models – 5 variations For each approach 5 variations: Variation 1 Common LMM for whole dataset respective whole sample. Variation 2 Separate LMM for „normal“ data and outlier (Hampel rule). Variation 3 Separate LMM for each of the 17 economic groups. Variation 4 GLMM with Poisson distribution (for 17 economic groups). Variation 5 Robust LMM (for 17 economic groups) Tested Models 2 Anonymization longitudinal surveys 7

Analytical Potential: Criteria Anonymization longitudinal surveys 8 Criteria I)Deviations of single values II)Deviations of means, standard deviations above 10% and deviations of correlations above 0,1 between waves for 17 economic groups III)Deviations of means, standard deviations above 10% and correlations of change rates between waves above 0,1 for 17 economic groups IV)Deviations of trends (increase / decrease) for 17 economic groups

Results: Analytical Potential 1 Anonymization longitudinal surveys 9 Approach 1 Variation 5 aborted because of memory problems I)About 80% deviation less than 5 percent II)0 – 2% III)80 – 90% IV)12 – 18% single trends, 25 – 34% trends for combination of turnover and employees I) – III) No significant differences between variations IV) Variation 1 best results

Results: Analytical Potential 2 Anonymization longitudinal surveys 10 Approach 2 I)12,5 – 22% deviation less than 5 percent (best results: variation 5) II)About 13 – 50% (best results: variation 5) III)About 50 – 85% (Variation 5 than all variations of approach 1) IV)18 – 26% single trends, 31 – 41% trends for combination of turnover and employees (best results: variation 5) Variation 5 (robust LMM) has best analytical potential, but for criteria I) and II) higher deviations than for all variations of approach 1.

Disclosure Risk: Methodology Anonymization longitudinal surveys 11 Scenario: Database Crossmatch Anonymized dataset against original data Minimization of distance measure between records Blocks: 17 economic groups, old / new federal states, 6 employment size classes Hit rate: Proportion of correct matched units in block Useful value: Deviation from original value at most 10% Disclosure risk in block = (Hit rate * #useful values) / #values in block

Disclosure Risk: Results Anonymization longitudinal surveys 12 Approach 1 For all blocks disclosure risk above 20 percent. Number of correct matched enterprises in these blocks between 62 (variation 1) and 67 percent (variation 4) of all enterprises. Approach 2 Disclosure risk above 20 percent for 3.5 to 4.6 percent of the blocks. Number of correct matched enterprises in these blocks between 0.1 (variations 1 and 4) and 0.4 percent (variation 5) of all enterprises.

Conclusion Anonymization longitudinal surveys 13 Approach 1: High analytical potential, high disclosure risk. Approach 2: Low analytical potential, low disclosure risk Most promising model: Approach 1, robust LMM Possible modifications: 1)Assignment of units between the two samples. 2)Include correlation structure between waves into model.

Anonymization longitudinal surveys 14 THANK YOU FOR YOUR ATTENTION