Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3:

Slides:



Advertisements
Similar presentations
Conducting of EU - SILC in the Republic of Macedonia, 2010 REPUBLIC OF MACEDONIA STATE STATISTICAL OFFICE State Statistical Office of Republic of Macedonia.
Advertisements

The Statistics Act and Research Access to Data Paul J Jackson Legal Services ONS.
Eurostat T HE E UROPEAN PROCESS OF ENHANCING ACCESS TO E UROSTAT DATA A LEKSANDRA B UJNOWSKA E UROSTAT.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
17 September SME Statistics OECD Workshop SME data and methodologies in the EU - item 5 Paul Feuvrier / Eurostat.
Counting the Dutch, The Future of the Virtual Census in the Netherlands Presentation at the seminar Counting the 7 Billion 24 February 2012 * Geert Bruinooge.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
The Dutch Censuses of 1960, 1971 and 2001 Producing public use files in the IPUMS project Wijnand Advokaat Statistics Netherlands Division Social and Spatial.
27 June 2007 QMSS CONFERENCE PRAGUE 1 European statistical microdata bases: What form of access for social science researchers? Michel GLAUDE Director.
United Nations Expert Group Meeting on Revising the Principles and Recommendations for Population and Housing Censuses New York, 29 October – 1 November.
Eurostat M ODES OF ACCESS TO EU MICRODATA IN THE NEW LEGAL FRAMEWORK A LEKSANDRA BUJNOWSKA E UROSTAT S TATISTICAL OFFICE OF THE E UROPEAN U NION.
Producing migration data using household surveys Experience of the Republic of Moldova UNECE Work Session on Migration Statistics, Geneva, October.
The new HBS Chisinau, 26 October Outline 1.How the HBS changed 2.Assessment of data quality 3.Data comparability 4.Conclusions.
Introduction to EU-SILC
ILUTE Microsimulation Modelling of Social/Financial Processes – An Overview Antoine Haroun June 2004.
Use of survey (LFS) to evaluate the quality of census final data Expert Group Meeting on Censuses Using Registers Geneva, May 2012 Jari Nieminen.
Comparing approaches of different (partly) register-based countries Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics.
Emerging methodologies for the census in the UNECE region Paolo Valente United Nations Economic Commission for Europe Statistical Division International.
Quality issues on the way from survey to administrative data: the case of SBS statistics of microenterprises in Slovakia Andrej Vallo, Andrea Bielakova.
Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.
Population census micro data for research: the case of Slovenia Danilo Dolenc Statistical Office of the Republic of Slovenia Ljubljana, First Regional.
The Dutch Virtual Census based on registers and already existing surveys Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics.
National design, fieldwork and data harmonization for Labour Force Survey Irena Svetin Statistical Office of the Republic of Slovenia September 2014.
The Dutch Virtual Census of 2001 A New Approach by Combining Different Sources Eric Schulte Nordholt ECE Census meetings Geneva, November 2004.
United Nations Economic Commission for Europe Statistical Division UNECE Workshop on Consumer Price Indices Istanbul, Turkey,10-13 October 2011 Session.
Eurostat Introduction to EU-SILC. Eurostat AGENDA 1.Scope of the instrument 2.Organization of the data 3.Main statistical concepts 4.Information sources.
Gender Aspects and Minority Data: An Illustrative Case of Roma Women in Southeast Europe United Nations Development Programme Nadja Dolata and Susanne.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
26 August 2011 Future of access to EU confidential data for scientific purposes Jean-Marc Museux Eurostat – 58th ISI conference,
Editing of linked micro files for statistics and research.
Working group on Living Conditions May 2006 Proposed indicators of non monetary deprivation : Update on the basis of EU-SILC 2004 and proposals of.
The availability of Dutch census microdata Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
Core variables in Estonian social surveys Merle Paats Statistics Estonia.
Statistics Canada Citizenship and Immigration Canada Methodological issues.
13-Jul-07 Item 1 – Introduction. 13-Jul-07WG Core variables in social surveys Name of the presentation 16 Core Variables… 1.Geographic data I (linked.
Joint Eurostat Unece Worksession on Statistical Data Confidentiality 2011, Tarragona Initial analyses on comparable dissemination from the Essnet project.
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.
HETUS Pilot Group 8 Privacy procedures and ethical issues Kimberly Fisher, Centre for Time Use Research – co-ordinator External consultant Kai Ludwigs.
13-Jul-07 State of the art of the ISCO-08 implementation.
Eurostat, Unit G-1 1 EuroGroups Register project UNECE/Eurostat/OECD June 2007 Road Map for the Future.
11 September 2008 Expert group meeting on the scope and content of Social Statistics 1 The Development of Social Statistics in the European Statistical.
State of play and plans by variable Occupation. 2 Policy needs for comparable data on occupations  Indicators on gender segregation used in the follow.
COMBINING SURVEY AND ADMINISTRATIVE DATA IN THE ITALIAN EU-SILC EXPERIENCE: POSITIVE AND CRITICAL ASPECTS National Institute of Statistics - Italy Claudio.
Introduction to EU regulation for Information Society statistics Armenia Twinning 2011 Component F – Information Society, 2 – 6 May. Danmarks Statistik.
ESDS Seminar Apr The EU Labour Force Survey Arturo de la Fuente, Estat-F2 “Labour Market Statistics”
Life circumstances and service delivery Community survey Finalise pilot survey (June 2006) List of dwellings completed (September 2006) Processes, systems.
M O N T E N E G R O Negotiating Team for Accession of Montenegro to the European Union Working Group for Chapter 18 – Statistics Bilateral screening: Chapter.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
M O N T E N E G R O Negotiating Team for Accession of Montenegro to the European Union Working Group for Chapter 18 – Statistics Bilateral screening: Chapter.
Disclosure scenario and risk assessment: Structure of Earnings Survey
Statistics Netherlands Division Social and Spatial Statistics
Legal, political and methodological issues in confidentiality in the ESS Maria João Santos, Jean-Marc Museux Eurostat.
Conducting of EU - SILC in the Republic of Macedonia, 2010
Weighting issues in EU-LFS
NCN module on cultural participation:
Statistical returns in respect of the carriage of goods by road
Effect of Panel Length and Following Rules on Cross-Sectional Estimates of Income Distribution: Empirical Evidence from FI-SILC Marjo Pyy-Martikainen Workshop.
Disseminating Statistics to the Research Community
WORKSHOP ON THE DATA COLLECTION OF OCCUPATIONAL DATA Luxembourg, 28 November 2008 Occupation as a core variable in social surveys Sylvain Jouhette
TG EHIS January 2012 Item 3.2 of the agenda EHIS wave 1 anonymised data Bart De Norre, Eurostat.
Core Variables revised guidelines 2011
High-level Working Group on Statistical Confidentiality
AES Anonymisation agreement
State of play: data transmission, validation and dissemination
ESDS Workshop on best practices
Item 2.2 Scientific Use Files for the Time Use Survey
Item 4.1: Annual labour market flows
Presentation transcript:

Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3: Living conditions and social protection

Geneva - Nov Eurostat - UNECE worksession Outline EU-SILC Task Force on Anonymisation EU-SILC instrument and database Methodological issues Implementation Conclusions

Geneva - Nov Eurostat - UNECE worksession EU-SILC Task Force on Anonymisation Objective To come up with best practices and recommendations for anonymisation of EU-SILC databases Participants B. Benard (Eurostat), L. Coppola (Istat), P. Feuvrier (INSEE), Ph. Gublin/J. Longhurst (ONS), N. Jukic (Stat of Slovenia), H. Minkel (Destatis), JM Museux (Eurostat), E. Schulte Nordholt (CBS), H. Sauli (Stat Fin)

Geneva - Nov Eurostat - UNECE worksession EU-SILC instrument Instrument: - gathering ex post harmonised micro data - on income and living conditions - from 27 European States Regulatory framework Harmonised definitions Minimum methodological requirements (probability sampling, fieldwork, …) Methodological recommendations Main source for EU (income) poverty indicators

Geneva - Nov Eurostat - UNECE worksession EU-SILC instrument Variables Income (Canberra recommendations) Demographic Labour status Living conditions – housing – deprivation - health Measurement units Households and individuals

Geneva - Nov Eurostat - UNECE worksession EU-SILC instrument Databases Annual cross sectional data from 2004 onwards (households and individuals) Longitudinal data (subset of individual variables) minimum 3 years spell (4 waves) Data collection Implementation under the responsibility of EU+ National Statistical Institutes Flexibility Rotational design, pure panel or independent components Survey data and/or register data

Geneva - Nov Eurostat - UNECE worksession Release policy Interest of the database Social and employment policy monitoring (EU Commission services and Study centres) Social research (Universities, Research centres) Legal issues Eu legislation allows for micro data release for scientific purpose Micro data have to be anonymised in order to minimise the risk of disclosure of individual information EU-SILC regulation plans scientific release according to a strict timetable

Geneva - Nov Eurostat - UNECE worksession Release policy Eurostat main orientations Right for information collected with public money Maximise utility of data collected and social return of money invested (20 Mo € /year) Significant improvement of the quality through user feedback Implementation Encrypted CD-ROM with anonymised EU-SILC database released under licence to researchers Centralised (Luxembourg) Safe Centre with limited capacity Decentralised access under study Remote access not yet developed

Geneva - Nov Eurostat - UNECE worksession Anonymisation – Main issues Heterogeneous environment in EU Different perceptions of disclosure risk No one European best practice Various implementations of merely the same common principles Significant variations of disclosure risk (i.e. Norwegian income register available on Web) Harmonisation of procedures in order to ease international comparison

Geneva - Nov Eurostat - UNECE worksession Anonymisation – Main issues Methodological issues Common disclosure/attacker scenarios for EU purpose Measures of risk Hierarchical files (household and individual levels) Longitudinal aspects Cross sectional and longitudinal files matching Sampling design information Register matching Methods of protection

Geneva - Nov Eurostat - UNECE worksession Methodological issues Common disclosure/attacker scenarios Broad band approach considering combinations of 3 types of identifying/key variables Geographic information Sex Age | Activity | Education | Dwelling | Marital Status | Citizenship | Place of Birth Economic status | Employment | Sector of activity | Household size | Household type

Geneva - Nov Eurostat - UNECE worksession Methodological issues Common EU disclosure/attacker scenarios 3 additional and more complex attacker scenarios EU1 (Simple attack with HH information (individual and household level) –REGION x SEX x YEAR OF BIRTH x MARITAL STATUS x HH SIZE x HH TYPE EU2 (Nosy neighbour individual attack) –REGION x URBANISATION x SEX x DATE OF BIRTH x BASIC ACTIVITY STATUS x BATH OR SHOWER x DO YOU HAVE A CAR? x EDUCATION x OCCUPATION x SECTOR OF ACTIVITY x HH SIZE x HH TYPE EU3 (Occupational group address book individual attack) –REGION x URBANISATION x SEX x DATE OF BIRTH x EMPLOYMENT STATUS x OCCUPATION x SECTOR OF ACTIVITY

Geneva - Nov Eurostat - UNECE worksession Methodological issues Measure of risk and threshold For broad band approach, thresholds are expressed in sample frequencies (heuristic developed by CBS-NL) Sampling fraction : fCountriesThreshold = int (1+114 f) 1/50 – 1/2LU (f=2.5%)5 1/100 – 1/50MT, IS, CY3 1/200 – 1/100EE, SI2 < 1/200All other 21 MS1

Geneva - Nov Eurostat - UNECE worksession Methodological issues Measure of risk and threshold for more complex scenario Probability of a correct match based the key variables between survey database and the attacker’s database Measure developed by Benedetti and Franconi and available in Mu-Argus Takes into account the hierarchical structure of the files : individuals/households In practice, due to software limitation, only six variables are handled simultaneously and various combinations using subset of key variables are tested.

Geneva - Nov Eurostat - UNECE worksession Methodological issues Hierarchical structure of information Household and individual information are collected in EU-SILC Household and individual records share common identifiers (linkable) Possibility of linkage is required for many statistical studies Increased risk of disclosure: individual information can be disclosed through household information and vice versa

Geneva - Nov Eurostat - UNECE worksession Methodological issues Measure of risk and threshold In addition, external information on population uniques (ONS) is used to cross check protection measures (for instance, 5+ households with age, sex of its members are often population unique up to high level of geographic aggregation)

Geneva - Nov Eurostat - UNECE worksession Methodological issues Longitudinal data The follow up of individuals through time generates rare transitions in some key variables. These transitions are potentially disclosive if attacker database is updated with the same frequency Corresponding risk is not easily estimated Matching of longitudinal and cross sectional data files For rotational panel and pure panel designs, the longitudinal and cross sectional files can be matched on the basis of common variables

Geneva - Nov Eurostat - UNECE worksession Methodological issues Sampling design information Design weights and strata identifiers are potentially disclosive because correlated with disaggregated geographical information Register information Few variables (income components) in EU-SILC are obtained directly from registers The availability of register to attackers is limited except in rare situation (Income Register Norway and Tax register in Finland)

Geneva - Nov Eurostat - UNECE worksession Methodological issues Methods of protection Global/ top recoding Usability of the database Requires arbitrage between variables Local suppressions May render uneasy statistical analysis Only if allow significant gain in global recoding of secondary variables

Geneva - Nov Eurostat - UNECE worksession Experiments Level of recoding significantly decreasing disclosure risk Geographic information needs to be coarsened depending on the size of the country (For large countries, NUTS1 and degree of urbanisation could be released) Country of birth and Citizenship should be coarsened in 4 broad categories Age can be delivered in years but must be top coded (80+). This avoids the difficulty of ensuring coherence of protection of longitudinal and cross sectional data Number of rooms must be top coded (5+) ISCED levels 5 and 6 must be regrouped NACE is regrouped at 19 levels ISCO 2 digit code can be released

Geneva - Nov Eurostat - UNECE worksession Implementation Remaining risks Identification of large households remains Rare transition in longitudinal data Sampling design information Specific national circumstances Researcher needs Household structure Longitudinal data for longitudinal analysis Design information for proper inference (not only variable but causal models) Harmonisation and flexibility

Geneva - Nov Eurostat - UNECE worksession Implementation ECHP experience Large dissemination in research community under license release Less protection No observed breach of confidentiality For EU-SILC Developing a responsible management of risk through controlled release and possibly audit provision and follow up.

Geneva - Nov Eurostat - UNECE worksession Implementation Eurostat approach Common rules for anonymisation of national databases Residual flexibility is allowed to adapt to national situations following national assessment according to common standards (measure of risk and thresholds, …)

Geneva - Nov Eurostat - UNECE worksession Conclusions Anonymisation is a matter of trade off Among national perception of disclosure risk Between right for privacy and researcher need Between presence of risk and monitoring of risk Value added of EU-SILC TF These trade off have been debated and made explicit