ESSnet on common tools and harmonized methodology for statistical data confidentiality Daniela Ichim, Luisa Franconi.

Slides:



Advertisements
Similar presentations
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
Advertisements

Business microdata dissemination at Istat Daniela Ichim Luisa Franconi
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data Lesson2-1 Lesson 2: Descriptive Statistics.
© URENIO Research Unit 2004 URENIO Online Benchmarking Application Thessaloniki 7 th of October 2004 Isidoros Passas BEng Computer System Engineering.
1 Item 7: National Accounts And Employment Data Using Employment Statistics in the Russian National Accounts Alexander Surinov Deputy Head of Rosstat Joint.
Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.
United Nations Economic Commission for Europe Statistical Division UNECE Workshop on Consumer Price Indices Istanbul, Turkey,10-13 October 2011 Session.
United Nations Economic Commission for Europe Statistical Division Mapping Data Production Processes to the GSBPM Steven Vale UNECE
Compilation of Distributive Trade Statistics in African Countries Workshop for African countries on the implementation of International Recommendations.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata.
26 April 2010 The unadjusted gender pay gap in the EU Didier Dupré, Eurostat unit F2 UNECE Work Session on Gender Statistics.
Joint Eurostat Unece Worksession on Statistical Data Confidentiality 2011, Tarragona Initial analyses on comparable dissemination from the Essnet project.
5.8 Finalise data files 5.6 Calculate weights Price index for legal services Quality Management / Metadata Management Specify Needs Design Build CollectProcessAnalyse.
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.
State of play and plans by variable Occupation. 2 Policy needs for comparable data on occupations  Indicators on gender segregation used in the follow.
1 General Recommendations of the DIME Task Force on Accuracy WG on HBS, Luxembourg, 13 May 2011.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
M O N T E N E G R O Negotiating Team for Accession of Montenegro to the European Union Working Group for Chapter 18 – Statistics Bilateral screening: Chapter.
Eurostat Quality reporting on energy statistics Framework and experience at EU level United Nations Oslo Group on Energy Statistics Aguascalientes (Mexico),
Disclosure scenario and risk assessment: Structure of Earnings Survey
BUS 308 mentor innovative education/bus308mentor.com
Towards more flexibility in responding to users’ needs
Gender wage inequalities in Serbia
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
Conducting of EU - SILC in the Republic of Macedonia, 2010
Estimation methods for the integration of administrative sources
Harmonisation process of anonymisation of microdata
Structural Business Statistics Data validation
LAMAS Working Group 7-8 December 2015
Earnings and labour cost statistics as exist in EUROSTAT’s website’s
Working Group on Labour Statistics for MEDSTAT countries October 2013
Education and Training Statistics Working Group – June 2014
Business Register Quality Improvement
Education and Training Statistics Working Group Meeting 5/6 June 2012 Item 4.1 Revised legal framework for CVT statistics Sylvain Jouhette 5/6 June.
Common Problems in Writing Statistical Plan of Clinical Trial Protocol
WORKSHOP ON THE DATA COLLECTION OF OCCUPATIONAL DATA Luxembourg, 28 November 2008 Occupation as a core variable in social surveys Sylvain Jouhette
Domestic extraction of mineral raw materials
Meeting of the European Directors of Social Statistics
Legal framework of territorial classifications and typologies for European statistics – state of play NUAC meeting, Brussels June 2015 Gorja Bartsch.
Labour Price Index Labour Market Statistics (LAMAS) Working Group
STATISTICAL AGENCY UNDER PRESIDENT OF THE REPUBLIC OF TAJIKISTAN
A New Business Statistics in Finland - Quarterly Investments
Quality-Adjusted Labour Input
Agenda Item 2.1 SES 2014: follow-up
SES 2014 IN SLOVENIA Miran Žavbi, SURS.
LAMAS October 2016 Agenda Item 2.1
LAMAS Working Group October 2014
LAMAS January 2016 Agenda Item 2.1 Structure of Earnings Survey (SES) Eusebio Bezzina Jean Thill.
High-level Working Group on Statistical Confidentiality
Zsófia Ercsey - KSH – Hungary Marie-Madeleine Fuger - INSEE – France
Education and Training Statistics Working Group – 2-3 June 2016
Session 7 – Eurostat 2017 SBR User Survey
LAMAS Working Group October 2018
Task Force 3, Cultural Industries Kutt Kommel
Mapping Data Production Processes to the GSBPM
Item 4.3 – Repeal of CVTS legal acts
Sampling and estimation
Strategies to achieve SDC harmonisation at European level: multiple countries, multiple files, multiple surveys Daniela Ichim and Luisa Franconi Istat,
Transformation of the National Statistical System: Experience
HELLENIC STATISTICAL AUTHORITY
Metadata on quality of statistical information
Business architecture
Linking trade statistics with business statistics
2.7 Annex 3 – Quality reports
STEPS Site Report.
Item 2.2 Scientific Use Files for the Time Use Survey
Zsófia Ercsey - KSH – Hungary Marie-Madeleine Fuger - INSEE – France
LAMAS Working Group June 2019
Presentation transcript:

ESSnet on common tools and harmonized methodology for statistical data confidentiality Daniela Ichim, Luisa Franconi

Essnet on SDC harmonisation ESSnet on common tools and harmonised methodology for SDC in the ESS December 2010 – April 2012 CBS, Istat, Destatis, Statistics Austria, SCB Task 1: Harmonisation of microdata release in multiple countries Task 2: Case studies on tabular data Task 3: Future directions of SDC software tools Tasks on dissemination and management

Essnet on SDC harmonisation Task 1 Task 1-1: Choice of measure for the output definition of the set of objective measures to be maintained by all possible candidate methods Task 1-2: User needs analysis of the projects undertaken by researchers on the data: definition of the benchmarking statistics and to prioritise features Task 1-3: Definition of methodologies study anonymisation taking into account the new framework and the benchmarking statistics Task 1-4: Implementation and reporting a report on the implementation process: pros and cons and critical points

Dissemination strategy Microdata  risk assessment Apply SDL to reduce risk maintaining some utility Evaluate utility SDL methods Disclosure risk Original microdata Utility Anonimized microdata R U

Comparability HOW to achieve it? Bounded utility comparability 1 Assessment of effects of different practices on predefined statistics 2 Definition of a threshold to determine when action is needed 3 setting a process for choosing acceptable practices Bounded utility comparability

SES-Benchmarking Setting of objectives: 1. Member States a) Dissemination policy (Nace, Size, etc.) b) Coherence 2. Users a) High-priority variables: (eg: NACE, SIZE, region, salary, etc.) b) Minimum level of detail (NACE 2digits) c) Types of analyses Ratios, Weighted totals, salary change, etc.

SES-which predefined statistics??? Essnet SDC Harmonisation Deliverable 1: Focus on consequences on SDL Part A. Survey structure a) quality b) relationships between variables c) classifications, etc Part B. Scientific research on SES data a) models b) methods c) breakdowns d) minimum level of detail, etc Input Output

Optional variables  What to do??? SES: Requirements stated in the legislative framework and its implementing measures Reg. (EC) No 1738/2005 5 themes: information on local units, employees, working period, earnings, grossing up factors Optional variables  What to do???

SES2006: required characteristics Theme Required characteristics Data source Tailored questionnaires, existing surveys, administrative sources or a combination of such sources; the information obtained must be of acceptable quality and be comparable between European countries Reference period Year 2006, month: October In some countries the accounting year does not coincide with the calendar year; for these countries the financial year is the best match with the calendar year 2006. Choice of another month is acceptable if appropriately justified. Sampling design Based on a sample of employees drawn from a stratified sample of local units. reporting unit: the local unit or the enterprise observation unit: local unit Economic activities Sections C-O excluding L of NACE Rev.1.1 Population: enterprises Enterprises with at least 10 employees in the covered economic activities. Population: employees Employees in the observation unit which have an employment contract in the reference month

SES2006: minimum requirements 1 1.1. Geographical location of the local unit NUTS 1 level 1.2. Size of the enterprise to which the local unit belongs 1-9*, 10-49, 50-249, 250-499, 500-999, 1000 and more employees. *This first band is optional for the 2006 SES. 1.3. Principal economic activity of the local unit 2-digit level of NACE Rev.1.1 for sections C to O. NACE section L is optional for the 2006 SES 2 2.3. Occupation in the reference month To be coded according to the International Standard Classification of Occupations, 1988 version (ISCO-88 (COM)) at the two-digit level and, if possible, at the three-digit level. 2.5. Highest successfully completed level of education and training Six levels coded according to the International Standard Classification of Education, 1997 version (ISCED 97). 2.7.1. Share of a full-timer’s normal hours For a part-time employee, the hours contractually worked should be expressed as a percentage of the number of normal hours worked by a full-time employee in the local unit 3 3.1. Number of weeks in the reference year to which the gross annual earnings relate should correspond to the actual gross annual earnings (variable 4.1). 4 4.2. Gross earnings in the reference month should be re-calculated so that it reflects the exclusion of such employees from the sample. 4.3. Average gross hourly earnings in the reference month average gross earnings per hour paid to the employee in the reference month. 5 5.1. Grossing-up factor for the local unit Within each sampling stratum, (Variable 5.1) = (Number of local units in the population) / (Number of local units in the sample) 5.2. Grossing-up factor for the employees (Variable 5.2) = (Variable 5.1) * (Number of employees in the local unit / Number of employees in the sample) Hierarchical classification “Independent” SDC sampling relationship formula

SES2006: constraints among variables Variables that need to be consistent Types of checks required Variable 3.2, Number of hours actually paid during the reference month, should be consistent with variable 4.2, Monthly earnings If the employee’s paid hours are affected by unpaid absence, then they should be adjusted to obtain paid hours for a full month. Where necessary, provide an approximate estimate of paid hours using: Adjusted 3.2 = Unadjusted 3.2 * (Adjusted 4.2/ Unadjusted 4.2). Where it is not feasible to adjust variable 4.2, then this employee should be excluded from the sample and the grossing-up factor (variable 5.2) re-calculated. Variable 3.2.1, Number of overtime hours paid in the reference month Variable 4.2.1, overtime earnings If the employee’s overtime hours are affected by unpaid absence, then they should be adjusted to obtain the paid overtime hours for a full month. Where necessary, provide a rough estimate of paid overtime hours using: Adjusted 3.2.1 = Unadjusted 3.2.1 * (Adjusted 4.2.1/ Unadjusted 4.2.1). Where it is not feasible to adjust variable 4.2 or 4.2.1, then this employee should be excluded from the sample and the grossing-up factor (variable 5.2) re-calculated. Variable 4.3 Average gross hourly earnings in the reference month Variable 4.2, Average gross hourly earnings derived from gross earnings for the reference month, divided by Variable 3.2, the number of hours paid during the same period

SES2006: deviations from EU Regulation (quality reports) Inclusion of the employees in the sample, Due to calculation of the average month some consistency between variables are not met, The definition of the variable gross annual earnings in the reference year Classification of occupation and education Enterprises, not local units

SES: main outputs Data producers a) what is already published by MS and Eurostat b) to be coherent c) to avoid identification d) don’t look for ways to increase the info to be published Users/researchers a) how SES data is used in scientific research (data, models, methods) b) obstacles (not administrative) c) without evaluating the scientific merit!

SES2006: indicators (Chronos) 1. Numbers of Employees a) characteristics of the observation unit and employee. b) specified bands of hours paid, annual holidays and of hourly/monthly/annual earnings 2. Gross Earnings, Paid Hours and Annual Days of Leave a) hourly, monthly and annual earnings, monthly paid hours and annual days of leave. b) several measures of location and of dispersion

SES2006: breakdowns (Chronos) _ Region: restricted to the national level; _ Economic activity: restricted to NACE, _ Size of the enterprise: 1-9, 10-49, 50-249, 250-499, 500-999, 1000; _ Age: restricted to 5 size classes; _ Occupation: one digit level of the ISCO 88 (COM) classification.

Researchers/analysts: general 1. We do not evaluate the scientific merit! 2. More than 80 papers were consulted. 3. National and international comparisons 4. With or without using of the hierarchical employer-employee structure 5. Reporting some absence of information from the enterprise side (e.g. financial) 6. Few longitudinal studies (for the moment) 7. “Home-made” harmonisation (when a MS does not survey/disseminate info on a Nace division, the entire (EU-level) info on the division is excluded from analyses) 8. With or without sample weights.

Studies Wage differentials/wage dispersion Labour market policy Determinants/decomposition “classical” average gross earnings per enterprise or employee Low(high)-pay dynamics Bargaining regimes

breakdowns Wage differentials Gender pay gap Gini coefficient, Quintile Share ratio Region Education Age Gender Occupation Employer (enterprise) impact –economic activity, size, productivity, policy, etc … breakdowns

Models and methods Linear models Log(earnings) as response variable Mixed-effects, multi-level, ANOVA, quantile Log(earnings) as response variable Assumption of normal distributions on error Method: Ordinary least squares Sometimes in two stages (enterprise and employee) Role of local units? No sampling weights – earnings on enterprise level (exception???)

Selected benchmarking statistics European dissemination Breakdowns NUTS, Gender, Education, Age, Occupation Weighted means Linear models Relationships of earnings

Questionnaire to the LAMAS WG national representatives Prioritization Questionnaire to the LAMAS WG national representatives 11 questions aiming at the collection of information on preferences regarding dissemination of EU anonymised microdata file 19 MS answered (70%)

100% existence of a legal dissemination framework Prioritization 100% existence of a legal dissemination framework 58% - national requirements 100% - standard classifications NACE NUTS ISCO ISCED

58% only the removal of the optional identifying variables Prioritization 32% removal of all the optional variables, independently on their identification power 58% only the removal of the optional identifying variables

Prioritization 1. Principal economic activity 2. Number of employees 3. Geographical location 1. Gender 2. Occupation 3. Education, 4. Age 5. Length of stay in service

– hierarchical, relationships, etc Disclosure risk assessment SDC Methodologies Data structure – hierarchical, relationships, etc Disclosure risk assessment Disclosure risk limitation Individual ranking Constrained regression Flexibility

SDC Methodologies Part A: Risk assessment Part B: Protection   Part A: Risk assessment Part B: Protection Part C: Audit ENTERPRISE Population frequencies Recoding Quality indicators Sample frequencies Preliminary recoding EMPLOYEE Only outliers Constrained regression All records Individual ranking

SDC Methodologies - enterprises Name Brief description Default value St Threshold for the sample frequencies 2 Pt Threshold for the population frequencies. ThresholdRiskStrata Threshold for the percentage of the admissible strata at risk. 0.02 ChooseDirectlyTheMostDetailedCombination The most detailed combination satisfying the criteria is considered. “y” Increase - severity Large values – only sample information 0 means no risk “n” – explorative analysis

SDC Methodologies – enterprises Testing Country Number of enterprises Number of strata IT 19899 1029 NL 36762 867 AT 14368 105

SDC Methodologies – enterprises Testing Country Number of strata at risk Number Enterprises Risk Size Region Nace IT 139 (13.51 %) 201 (1.01%) 3 classes NUTS0 2 digit NL 69 (7.96%) 80 (0.22 %) AT 49 (46.67%) 53 (0.37% )

Employees at Risk Frequency criteria for each combination of key categorical variables: Info on enterrpise (Nace, Nuts, Size) Demographic variables (Age, Gender) High AnnualEarnings: greater than a threshold T = quantile

SDC Methodologies – employees Testing Name Brief description Default value SynthesizeAll SynthesizeAll="y", how to evaluate the risk. “n” MinNbEmployeesPerStrata More than MinNbEmployees per strata. 5 qq Quantile value for the definition of the employees at risk. 0.99 qqOverall How to compute the quantile. “y” threshold Number of employees that could be considered at risk. 1

SDC Methodologies – employees Testing Name Value SynthesizeAll “y” No risk evaluation, all units at risk.

SDC Methodologies – employees Testing Name Value SynthesizeAll “n” threshold 1 Unique cases Name Value SynthesizeAll “n” threshold 2 Unique and double cases

SDC Methodologies – employees Testing Name Value SynthesizeAll “n” qqOverall Risk threshold, by strata Name Value SynthesizeAll “n” qqOverall “y” Risk threshold, no stratification

SDC Methodologies – employees Testing

SDC Methodologies – employees Testing Name Value SynthesizeAll “y” qq All units at risk. Name Value SynthesizeAll “y” qq 0.95, 0.99 If AE>qqV and if unique. Name Value SynthesizeAll “y” qq 100 No units at risk.

SDC Methodologies – employees Testing

SDC Methodologies – employees Testing

Protection Minimal Requirements Protect with respect to the assumed scenarios. Protect if needed.  Depency on the disclosure scenario Probabilistic method.

Dissemination strategy Microdata  risk assessment Apply SDL to reduce risk maintaining some utility Evaluate utility SDL methods Disclosure risk Original microdata Utility Anonimized microdata R U

Controlled Selective Masking Perturb, but generate (control) quality: coherence (already released statistics) utility (users’ needs) Add more (linear) constraints: Weighted totals variation

SDL - implementation Individual ranking Model – based Parameters: IR.param, stratification Re-use Ease of implementation Flexibility

SDC Methodologies – employees Testing Name Value SynthesizeAll “y” IR, 3 or 5, by strata, on all Name Value SynthesizeAll “n” Method IR IR, 3 or 5, by strata, on risk Name sdcMicro IR, no strata, on all IR:no control on

Model-based Disclosure Limitation Assume a model. Estimate the parameters. Release the fitted values.

Model-based Disclosure Limitation ConReg: control on

Utility Weighted Totals Analytical validity Correlations Variance Linear models

Weighted Totals Constrained Regression

Weighted Totals IR

Correlation AE_AB

Correlation ME_OVER

Correlation ME_SHIFT

Variance

Linear Models By combs of Nace, Nuts and Size Log(AE) = f(B21, B22, B23, B25, B26, B27) Log(AE.pert) = f(B21, B22, B23, B25, B26, B27) Compare the coefficients. The same for: B21 + B22 B23 + B25 + B26 + B27

Linear Models

Confidence intervals overlapping IT NL Model Method Strat qq OWithinP PWithinO B21 B22 IR strat 0.992 0.984 0.985 B21 B22 B23 B25 B26 B27 0.998 0.952 B23 B25 B26 B27 0.968 no strat 0.945 0.929 0.936 0.947 0.960 0.958 0.949 0.970 0.946 0.956 ConReg 0.99 0.997 0.953 0.988 0.996 0.986 0.950 0.989 0.983 0.965 0.963

Data format – missing values, categories Problems Data structure Data format – missing values, categories (Open source) software knowledge Data knowledge. Documentation is a must!

– hierarchical, relationships, etc Disclosure risk assessment Final issues Data structure – hierarchical, relationships, etc Disclosure risk assessment - national and subjective Disclosure risk limitation - protect w.r.t the scenario Flexibility

Collaboration is necessary. Consultation is necessary. Final issues Collaboration is necessary. Consultation is necessary. Testing is necessary. Comparability may be achieved. - development of bounded-utility methods Governance structure should be defined.

THANK YOU!