Download presentation
Presentation is loading. Please wait.
Published byψυχή Νικολάκος Modified over 6 years ago
1
ESSnet on common tools and harmonized methodology for statistical data confidentiality
Daniela Ichim, Luisa Franconi
2
Essnet on SDC harmonisation
ESSnet on common tools and harmonised methodology for SDC in the ESS December 2010 – April 2012 CBS, Istat, Destatis, Statistics Austria, SCB Task 1: Harmonisation of microdata release in multiple countries Task 2: Case studies on tabular data Task 3: Future directions of SDC software tools Tasks on dissemination and management
3
Essnet on SDC harmonisation
Task 1 Task 1-1: Choice of measure for the output definition of the set of objective measures to be maintained by all possible candidate methods Task 1-2: User needs analysis of the projects undertaken by researchers on the data: definition of the benchmarking statistics and to prioritise features Task 1-3: Definition of methodologies study anonymisation taking into account the new framework and the benchmarking statistics Task 1-4: Implementation and reporting a report on the implementation process: pros and cons and critical points
4
Dissemination strategy
Microdata risk assessment Apply SDL to reduce risk maintaining some utility Evaluate utility SDL methods Disclosure risk Original microdata Utility Anonimized microdata R U
5
Comparability HOW to achieve it? Bounded utility comparability
1 Assessment of effects of different practices on predefined statistics 2 Definition of a threshold to determine when action is needed 3 setting a process for choosing acceptable practices Bounded utility comparability
6
SES-Benchmarking Setting of objectives: 1. Member States
a) Dissemination policy (Nace, Size, etc.) b) Coherence 2. Users a) High-priority variables: (eg: NACE, SIZE, region, salary, etc.) b) Minimum level of detail (NACE 2digits) c) Types of analyses Ratios, Weighted totals, salary change, etc.
7
SES-which predefined statistics???
Essnet SDC Harmonisation Deliverable 1: Focus on consequences on SDL Part A. Survey structure a) quality b) relationships between variables c) classifications, etc Part B. Scientific research on SES data a) models b) methods c) breakdowns d) minimum level of detail, etc Input Output
8
Optional variables What to do???
SES: Requirements stated in the legislative framework and its implementing measures Reg. (EC) No 1738/2005 5 themes: information on local units, employees, working period, earnings, grossing up factors Optional variables What to do???
9
SES2006: required characteristics
Theme Required characteristics Data source Tailored questionnaires, existing surveys, administrative sources or a combination of such sources; the information obtained must be of acceptable quality and be comparable between European countries Reference period Year 2006, month: October In some countries the accounting year does not coincide with the calendar year; for these countries the financial year is the best match with the calendar year 2006. Choice of another month is acceptable if appropriately justified. Sampling design Based on a sample of employees drawn from a stratified sample of local units. reporting unit: the local unit or the enterprise observation unit: local unit Economic activities Sections C-O excluding L of NACE Rev.1.1 Population: enterprises Enterprises with at least 10 employees in the covered economic activities. Population: employees Employees in the observation unit which have an employment contract in the reference month
10
SES2006: minimum requirements
1 1.1. Geographical location of the local unit NUTS 1 level 1.2. Size of the enterprise to which the local unit belongs 1-9*, 10-49, , , , and more employees. *This first band is optional for the 2006 SES. 1.3. Principal economic activity of the local unit 2-digit level of NACE Rev.1.1 for sections C to O. NACE section L is optional for the 2006 SES 2 2.3. Occupation in the reference month To be coded according to the International Standard Classification of Occupations, 1988 version (ISCO-88 (COM)) at the two-digit level and, if possible, at the three-digit level. 2.5. Highest successfully completed level of education and training Six levels coded according to the International Standard Classification of Education, 1997 version (ISCED 97). Share of a full-timer’s normal hours For a part-time employee, the hours contractually worked should be expressed as a percentage of the number of normal hours worked by a full-time employee in the local unit 3 3.1. Number of weeks in the reference year to which the gross annual earnings relate should correspond to the actual gross annual earnings (variable 4.1). 4 4.2. Gross earnings in the reference month should be re-calculated so that it reflects the exclusion of such employees from the sample. 4.3. Average gross hourly earnings in the reference month average gross earnings per hour paid to the employee in the reference month. 5 5.1. Grossing-up factor for the local unit Within each sampling stratum, (Variable 5.1) = (Number of local units in the population) / (Number of local units in the sample) 5.2. Grossing-up factor for the employees (Variable 5.2) = (Variable 5.1) * (Number of employees in the local unit / Number of employees in the sample) Hierarchical classification “Independent” SDC sampling relationship formula
11
SES2006: constraints among variables
Variables that need to be consistent Types of checks required Variable 3.2, Number of hours actually paid during the reference month, should be consistent with variable 4.2, Monthly earnings If the employee’s paid hours are affected by unpaid absence, then they should be adjusted to obtain paid hours for a full month. Where necessary, provide an approximate estimate of paid hours using: Adjusted 3.2 = Unadjusted 3.2 * (Adjusted 4.2/ Unadjusted 4.2). Where it is not feasible to adjust variable 4.2, then this employee should be excluded from the sample and the grossing-up factor (variable 5.2) re-calculated. Variable 3.2.1, Number of overtime hours paid in the reference month Variable 4.2.1, overtime earnings If the employee’s overtime hours are affected by unpaid absence, then they should be adjusted to obtain the paid overtime hours for a full month. Where necessary, provide a rough estimate of paid overtime hours using: Adjusted = Unadjusted * (Adjusted 4.2.1/ Unadjusted 4.2.1). Where it is not feasible to adjust variable 4.2 or 4.2.1, then this employee should be excluded from the sample and the grossing-up factor (variable 5.2) re-calculated. Variable 4.3 Average gross hourly earnings in the reference month Variable 4.2, Average gross hourly earnings derived from gross earnings for the reference month, divided by Variable 3.2, the number of hours paid during the same period
12
SES2006: deviations from EU Regulation (quality reports)
Inclusion of the employees in the sample, Due to calculation of the average month some consistency between variables are not met, The definition of the variable gross annual earnings in the reference year Classification of occupation and education Enterprises, not local units
13
SES: main outputs Data producers
a) what is already published by MS and Eurostat b) to be coherent c) to avoid identification d) don’t look for ways to increase the info to be published Users/researchers a) how SES data is used in scientific research (data, models, methods) b) obstacles (not administrative) c) without evaluating the scientific merit!
14
SES2006: indicators (Chronos)
1. Numbers of Employees a) characteristics of the observation unit and employee. b) specified bands of hours paid, annual holidays and of hourly/monthly/annual earnings 2. Gross Earnings, Paid Hours and Annual Days of Leave a) hourly, monthly and annual earnings, monthly paid hours and annual days of leave. b) several measures of location and of dispersion
15
SES2006: breakdowns (Chronos)
_ Region: restricted to the national level; _ Economic activity: restricted to NACE, _ Size of the enterprise: 1-9, 10-49, , , , 1000; _ Age: restricted to 5 size classes; _ Occupation: one digit level of the ISCO 88 (COM) classification.
16
Researchers/analysts: general
1. We do not evaluate the scientific merit! 2. More than 80 papers were consulted. 3. National and international comparisons 4. With or without using of the hierarchical employer-employee structure 5. Reporting some absence of information from the enterprise side (e.g. financial) 6. Few longitudinal studies (for the moment) 7. “Home-made” harmonisation (when a MS does not survey/disseminate info on a Nace division, the entire (EU-level) info on the division is excluded from analyses) 8. With or without sample weights.
17
Studies Wage differentials/wage dispersion Labour market policy
Determinants/decomposition “classical” average gross earnings per enterprise or employee Low(high)-pay dynamics Bargaining regimes
18
breakdowns Wage differentials Gender pay gap
Gini coefficient, Quintile Share ratio Region Education Age Gender Occupation Employer (enterprise) impact –economic activity, size, productivity, policy, etc … breakdowns
19
Models and methods Linear models Log(earnings) as response variable
Mixed-effects, multi-level, ANOVA, quantile Log(earnings) as response variable Assumption of normal distributions on error Method: Ordinary least squares Sometimes in two stages (enterprise and employee) Role of local units? No sampling weights – earnings on enterprise level (exception???)
20
Selected benchmarking statistics
European dissemination Breakdowns NUTS, Gender, Education, Age, Occupation Weighted means Linear models Relationships of earnings
21
Questionnaire to the LAMAS WG national representatives
Prioritization Questionnaire to the LAMAS WG national representatives 11 questions aiming at the collection of information on preferences regarding dissemination of EU anonymised microdata file 19 MS answered (70%)
22
100% existence of a legal dissemination framework
Prioritization 100% existence of a legal dissemination framework 58% - national requirements 100% - standard classifications NACE NUTS ISCO ISCED
23
58% only the removal of the optional identifying variables
Prioritization 32% removal of all the optional variables, independently on their identification power 58% only the removal of the optional identifying variables
24
Prioritization 1. Principal economic activity 2. Number of employees
3. Geographical location 1. Gender 2. Occupation 3. Education, 4. Age 5. Length of stay in service
25
– hierarchical, relationships, etc Disclosure risk assessment
SDC Methodologies Data structure – hierarchical, relationships, etc Disclosure risk assessment Disclosure risk limitation Individual ranking Constrained regression Flexibility
26
SDC Methodologies Part A: Risk assessment Part B: Protection
Part A: Risk assessment Part B: Protection Part C: Audit ENTERPRISE Population frequencies Recoding Quality indicators Sample frequencies Preliminary recoding EMPLOYEE Only outliers Constrained regression All records Individual ranking
27
SDC Methodologies - enterprises
Name Brief description Default value St Threshold for the sample frequencies 2 Pt Threshold for the population frequencies. ThresholdRiskStrata Threshold for the percentage of the admissible strata at risk. 0.02 ChooseDirectlyTheMostDetailedCombination The most detailed combination satisfying the criteria is considered. “y” Increase - severity Large values – only sample information 0 means no risk “n” – explorative analysis
28
SDC Methodologies – enterprises Testing
Country Number of enterprises Number of strata IT 19899 1029 NL 36762 867 AT 14368 105
29
SDC Methodologies – enterprises Testing
Country Number of strata at risk Number Enterprises Risk Size Region Nace IT 139 (13.51 %) 201 (1.01%) 3 classes NUTS0 2 digit NL 69 (7.96%) 80 (0.22 %) AT 49 (46.67%) 53 (0.37% )
30
Employees at Risk Frequency criteria for each combination of key categorical variables: Info on enterrpise (Nace, Nuts, Size) Demographic variables (Age, Gender) High AnnualEarnings: greater than a threshold T = quantile
31
SDC Methodologies – employees Testing
Name Brief description Default value SynthesizeAll SynthesizeAll="y", how to evaluate the risk. “n” MinNbEmployeesPerStrata More than MinNbEmployees per strata. 5 qq Quantile value for the definition of the employees at risk. 0.99 qqOverall How to compute the quantile. “y” threshold Number of employees that could be considered at risk. 1
32
SDC Methodologies – employees Testing
Name Value SynthesizeAll “y” No risk evaluation, all units at risk.
33
SDC Methodologies – employees Testing
Name Value SynthesizeAll “n” threshold 1 Unique cases Name Value SynthesizeAll “n” threshold 2 Unique and double cases
34
SDC Methodologies – employees Testing
Name Value SynthesizeAll “n” qqOverall Risk threshold, by strata Name Value SynthesizeAll “n” qqOverall “y” Risk threshold, no stratification
35
SDC Methodologies – employees Testing
36
SDC Methodologies – employees Testing
Name Value SynthesizeAll “y” qq All units at risk. Name Value SynthesizeAll “y” qq 0.95, 0.99 If AE>qqV and if unique. Name Value SynthesizeAll “y” qq 100 No units at risk.
37
SDC Methodologies – employees Testing
38
SDC Methodologies – employees Testing
39
Protection Minimal Requirements
Protect with respect to the assumed scenarios. Protect if needed. Depency on the disclosure scenario Probabilistic method.
40
Dissemination strategy
Microdata risk assessment Apply SDL to reduce risk maintaining some utility Evaluate utility SDL methods Disclosure risk Original microdata Utility Anonimized microdata R U
41
Controlled Selective Masking
Perturb, but generate (control) quality: coherence (already released statistics) utility (users’ needs) Add more (linear) constraints: Weighted totals variation
42
SDL - implementation Individual ranking Model – based
Parameters: IR.param, stratification Re-use Ease of implementation Flexibility
43
SDC Methodologies – employees Testing
Name Value SynthesizeAll “y” IR, 3 or 5, by strata, on all Name Value SynthesizeAll “n” Method IR IR, 3 or 5, by strata, on risk Name sdcMicro IR, no strata, on all IR:no control on
44
Model-based Disclosure Limitation
Assume a model. Estimate the parameters. Release the fitted values.
45
Model-based Disclosure Limitation
ConReg: control on
46
Utility Weighted Totals Analytical validity Correlations Variance
Linear models
47
Weighted Totals Constrained Regression
48
Weighted Totals IR
49
Correlation AE_AB
50
Correlation ME_OVER
51
Correlation ME_SHIFT
52
Variance
53
Linear Models By combs of Nace, Nuts and Size
Log(AE) = f(B21, B22, B23, B25, B26, B27) Log(AE.pert) = f(B21, B22, B23, B25, B26, B27) Compare the coefficients. The same for: B21 + B22 B23 + B25 + B26 + B27
54
Linear Models
55
Confidence intervals overlapping
IT NL Model Method Strat qq OWithinP PWithinO B21 B22 IR strat 0.992 0.984 0.985 B21 B22 B23 B25 B26 B27 0.998 0.952 B23 B25 B26 B27 0.968 no strat 0.945 0.929 0.936 0.947 0.960 0.958 0.949 0.970 0.946 0.956 ConReg 0.99 0.997 0.953 0.988 0.996 0.986 0.950 0.989 0.983 0.965 0.963
56
Data format – missing values, categories
Problems Data structure Data format – missing values, categories (Open source) software knowledge Data knowledge. Documentation is a must!
57
– hierarchical, relationships, etc Disclosure risk assessment
Final issues Data structure – hierarchical, relationships, etc Disclosure risk assessment - national and subjective Disclosure risk limitation - protect w.r.t the scenario Flexibility
58
Collaboration is necessary. Consultation is necessary.
Final issues Collaboration is necessary. Consultation is necessary. Testing is necessary. Comparability may be achieved. - development of bounded-utility methods Governance structure should be defined.
59
THANK YOU!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.