Disclosure scenario and risk assessment: Structure of Earnings Survey Daniela Ichim, Luisa Franconi Istat – DCMT – Methodology ichim@istat.it, franconi@istat.it
Outline 1. Objectives of the anonymisation 2. Disclosure scenarios 3. Risk assessment 4. Confidentiality protection 5. Information content analysis
Objectives Requirements: Member States Users Dissemination policy (Nace, Citizenship, Number of Employees, etc.) Coherence Users High-priority variables: NACE, NUTS, ISCO Minimum level of detail (NACE 2digits, Nuts1, ISCO 2digits …) Kinds of analysis Estimating the difference on Annual Earnings between two categories of the regional detail (estimating differences between regional politics) Weighted totals variation MICRODATA FILE FOR RESEARCH
Disclosure scenarios Mimic the intruder knowledge and interest. POSSIBLE INTRUDER = RESEARCHER. No external register scenario No nosy colleague scenario MICRODATA FILE FOR RESEARCH ONLY SPONTANEOUS IDENTIFICATION
Enterprise spontaneous identification Key variables Structural variables: NACE, NUTS, SIZE A sampled enterprise is considered at risk when both population and sample frequencies are simultaneously below the given threshold.
Enterprise protection Structural key variables are all categorical. Protection is achieved by recoding classes of the categorical key variable with the lowest priority: 1. Nace 2-digits 2. NUTS1 3. SIZE a) Recoding with respect to the population frequencies generates a lower information loss. b) If needed, recode another variable.
Employees spontaneous identification information on the enterprise (Nace x Nuts x Size) social variables (Gender x Age) extremely high earnings related to large enterprises MICRODATA FILE FOR RESEARCH
Employees at risk (use the scenario!) High AnnualEarnings: greater than the 99% quantile (T) for each combination of Nace, Nuts, Size, Gender, Age, AnnEarn the number of sampled employees with earnings greater than T was counted. If there was a single employee with such characteristics, it was considered at risk of identification.
Employees: selective protection Only records of employees at risk of identification ought to be perturbed. Only numerical key variables are perturbed. MICRODATA FILE FOR RESEARCH
Constrained regression Controlled perturbation Weighted total variation inferior to 0.5%. Can be easily adapted to whatever stratification.
Information content User requirements: Information preservation Weighted totals Sampling weights Only key and confidential variables are modified. Information loss Statistical indicators (correlations, summary statistics) Order relationships
Code Variable Status A.1.1 A.1.2 B.3.0 A.1.3 B.3.1 A.1.4 B.3.1.1 A.1.5 Geographical location not changed A.1.2 Size of enterprise changed B.3.0 Average gross hourly earnings in the representative month A.1.3 Principal economic activity B.3.1 Total gross earnings for a representative month A.1.4 Form of economic and financial control B.3.1.1 Earnings related to overtime A.1.5 Existence of collective pay agreements B.3.1.2 Special payment for shift work A.1.6 Total number of employees removed B.3.2 Total gross annual earnings in the reference year A.4.1 Enterprise sample weights B.3.2.1 Number of weeks to which the gross annual earnings relate B.2.1 Gender B.3.2.2 Total annual bonuses B.2.2 Employee’s age B.3.2.2.2 Annual bonuses based on productivity B.2.3 Occupation B.3.4 Number of paid hours during the representative month B.2.4 Management position or supervisory position B.3.4.1 Number of overtime hours paid in the reference month B.2.5 Education B.3.5 Annual days of absence B.2.6 Length of service in the enterprise B.3.5.1 Annual days of holiday leave B.2.7 Full-time or part-time B.3.5.1.1 Holiday entitlement or number of holidays B.2.7.1 Share of a full-time B.4.2 Employee sample weights B.2.8 Type of contract of employment
CONCLUSIONS Consider the dissemination features. Consider the data features. Confidentiality ensured, minimize the information loss.