European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata Files for Research Daniela Ichim
European Conference on Quality in Official Statistics, Rome, July 2008 Outline Dissemination of Microdata Files for Research Risk assessment Disclosure limitation Data quality –Record linkage –Data utility
European Conference on Quality in Official Statistics, Rome, July 2008 Confidentiality against Dissemination Find the right balance! Disclosure scenarios
European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey IDENTIFYING VARIABLES –Nace –Nuts –Size –Turnover (TURN) (STRUCTURAL VARIABLES) CONFIDENTIAL VARIABLES –Expenditures in innovation (RTOT, …) –Number of patents, … (VARIABLES INVOLVED IN ANALYSES)
European Conference on Quality in Official Statistics, Rome, July 2008 Confounding Categorical Numerical safe unsafe A … A k-anonymity
European Conference on Quality in Official Statistics, Rome, July 2008 a) Given a threshold (on units) b) Local Outlier Factor as a measure of difference in density between a unit and its nearest neighbours General risk function Distance between and Density around :
European Conference on Quality in Official Statistics, Rome, July 2008 Threshold - dissemination policy Parameters Cut-off point for density (LOF) –quantiles –automatic
European Conference on Quality in Official Statistics, Rome, July 2008 Stratification variables Analysis by Nace Nace A all Nace
European Conference on Quality in Official Statistics, Rome, July 2008 Disclosure limitation MFR Selective masking k-anonymity Nearest neighbour Micro-aggregation on tails
European Conference on Quality in Official Statistics, Rome, July 2008 Quality assessment Dissemination Confidentiality
European Conference on Quality in Official Statistics, Rome, July 2008 Risk measure assessment Quality of the external database D E Chambers of Commerce database Record linkage
European Conference on Quality in Official Statistics, Rome, July 2008 Record linkage M*=3 1 equal unit within 10% less than 3 units within 10% less than 3 units within 20% less than 3 units within 30% NACE 88%84%97%100% NACE EMP 63%60% a 74% a 87% a M*=5 1 equal unit within 10% less than 5 units within 10% less than 5 units within 20% less than 5 units within 30% NACE 88%73%87%96% NACE EMP 63%58% a 70% a 80% a a) 100% for enterprises with more than 250 employees
European Conference on Quality in Official Statistics, Rome, July 2008 Information content analysis Information preservation Selective masking –Data utility –Only identifying and confidential variables were modified. –Only records at risk were modified. The weights were not modified. –weighted totals (coherence with the already published information) Some statistical indicators were slightly modified: variances
European Conference on Quality in Official Statistics, Rome, July 2008 Information content analysis Data utility Assessment of the perturbation impact on ratios like RTOT/TURN Original Selective masking Individual ranking
European Conference on Quality in Official Statistics, Rome, July 2008 Conclusions 1.Confidentiality: Risk measure based on the k- anonymity principle Flexible a) continuous and categorical variables b) easy to implement c) consistent for extreme choices 2.Data utility: Selective protection to achieve the k- anonymity 3.Comparable dissemination: Control both risk of re-identification and information loss QUALITY DIMENSIONS