Download presentation
Presentation is loading. Please wait.
Published byAbner Watson Modified over 9 years ago
1
Anonymisation of the EU-SILC database: result of the work of the EU-SILC TF Jean-Marc Museux The Statistical Office of the European Communities Unit F3: Living conditions and social protection Jean-Marc.Museux@cec.eu.int
2
Geneva - Nov. 2005 Eurostat - UNECE worksession Outline EU-SILC Task Force on Anonymisation EU-SILC instrument and database Methodological issues Implementation Conclusions
3
Geneva - Nov. 2005 Eurostat - UNECE worksession EU-SILC Task Force on Anonymisation Objective To come up with best practices and recommendations for anonymisation of EU-SILC databases Participants B. Benard (Eurostat), L. Coppola (Istat), P. Feuvrier (INSEE), Ph. Gublin/J. Longhurst (ONS), N. Jukic (Stat of Slovenia), H. Minkel (Destatis), JM Museux (Eurostat), E. Schulte Nordholt (CBS), H. Sauli (Stat Fin)
4
Geneva - Nov. 2005 Eurostat - UNECE worksession EU-SILC instrument Instrument: - gathering ex post harmonised micro data - on income and living conditions - from 27 European States Regulatory framework Harmonised definitions Minimum methodological requirements (probability sampling, fieldwork, …) Methodological recommendations Main source for EU (income) poverty indicators
5
Geneva - Nov. 2005 Eurostat - UNECE worksession EU-SILC instrument Variables Income (Canberra recommendations) Demographic Labour status Living conditions – housing – deprivation - health Measurement units Households and individuals
6
Geneva - Nov. 2005 Eurostat - UNECE worksession EU-SILC instrument Databases Annual cross sectional data from 2004 onwards (households and individuals) Longitudinal data (subset of individual variables) minimum 3 years spell (4 waves) Data collection Implementation under the responsibility of EU+ National Statistical Institutes Flexibility Rotational design, pure panel or independent components Survey data and/or register data
7
Geneva - Nov. 2005 Eurostat - UNECE worksession Release policy Interest of the database Social and employment policy monitoring (EU Commission services and Study centres) Social research (Universities, Research centres) Legal issues Eu legislation allows for micro data release for scientific purpose Micro data have to be anonymised in order to minimise the risk of disclosure of individual information EU-SILC regulation plans scientific release according to a strict timetable
8
Geneva - Nov. 2005 Eurostat - UNECE worksession Release policy Eurostat main orientations Right for information collected with public money Maximise utility of data collected and social return of money invested (20 Mo € /year) Significant improvement of the quality through user feedback Implementation Encrypted CD-ROM with anonymised EU-SILC database released under licence to researchers Centralised (Luxembourg) Safe Centre with limited capacity Decentralised access under study Remote access not yet developed
9
Geneva - Nov. 2005 Eurostat - UNECE worksession Anonymisation – Main issues Heterogeneous environment in EU Different perceptions of disclosure risk No one European best practice Various implementations of merely the same common principles Significant variations of disclosure risk (i.e. Norwegian income register available on Web) Harmonisation of procedures in order to ease international comparison
10
Geneva - Nov. 2005 Eurostat - UNECE worksession Anonymisation – Main issues Methodological issues Common disclosure/attacker scenarios for EU purpose Measures of risk Hierarchical files (household and individual levels) Longitudinal aspects Cross sectional and longitudinal files matching Sampling design information Register matching Methods of protection
11
Geneva - Nov. 2005 Eurostat - UNECE worksession Methodological issues Common disclosure/attacker scenarios Broad band approach considering combinations of 3 types of identifying/key variables Geographic information Sex Age | Activity | Education | Dwelling | Marital Status | Citizenship | Place of Birth Economic status | Employment | Sector of activity | Household size | Household type
12
Geneva - Nov. 2005 Eurostat - UNECE worksession Methodological issues Common EU disclosure/attacker scenarios 3 additional and more complex attacker scenarios EU1 (Simple attack with HH information (individual and household level) –REGION x SEX x YEAR OF BIRTH x MARITAL STATUS x HH SIZE x HH TYPE EU2 (Nosy neighbour individual attack) –REGION x URBANISATION x SEX x DATE OF BIRTH x BASIC ACTIVITY STATUS x BATH OR SHOWER x DO YOU HAVE A CAR? x EDUCATION x OCCUPATION x SECTOR OF ACTIVITY x HH SIZE x HH TYPE EU3 (Occupational group address book individual attack) –REGION x URBANISATION x SEX x DATE OF BIRTH x EMPLOYMENT STATUS x OCCUPATION x SECTOR OF ACTIVITY
13
Geneva - Nov. 2005 Eurostat - UNECE worksession Methodological issues Measure of risk and threshold For broad band approach, thresholds are expressed in sample frequencies (heuristic developed by CBS-NL) Sampling fraction : fCountriesThreshold = int (1+114 f) 1/50 – 1/2LU (f=2.5%)5 1/100 – 1/50MT, IS, CY3 1/200 – 1/100EE, SI2 < 1/200All other 21 MS1
14
Geneva - Nov. 2005 Eurostat - UNECE worksession Methodological issues Measure of risk and threshold for more complex scenario Probability of a correct match based the key variables between survey database and the attacker’s database Measure developed by Benedetti and Franconi and available in Mu-Argus Takes into account the hierarchical structure of the files : individuals/households In practice, due to software limitation, only six variables are handled simultaneously and various combinations using subset of key variables are tested.
15
Geneva - Nov. 2005 Eurostat - UNECE worksession Methodological issues Hierarchical structure of information Household and individual information are collected in EU-SILC Household and individual records share common identifiers (linkable) Possibility of linkage is required for many statistical studies Increased risk of disclosure: individual information can be disclosed through household information and vice versa
16
Geneva - Nov. 2005 Eurostat - UNECE worksession Methodological issues Measure of risk and threshold In addition, external information on population uniques (ONS) is used to cross check protection measures (for instance, 5+ households with age, sex of its members are often population unique up to high level of geographic aggregation)
17
Geneva - Nov. 2005 Eurostat - UNECE worksession Methodological issues Longitudinal data The follow up of individuals through time generates rare transitions in some key variables. These transitions are potentially disclosive if attacker database is updated with the same frequency Corresponding risk is not easily estimated Matching of longitudinal and cross sectional data files For rotational panel and pure panel designs, the longitudinal and cross sectional files can be matched on the basis of common variables
18
Geneva - Nov. 2005 Eurostat - UNECE worksession Methodological issues Sampling design information Design weights and strata identifiers are potentially disclosive because correlated with disaggregated geographical information Register information Few variables (income components) in EU-SILC are obtained directly from registers The availability of register to attackers is limited except in rare situation (Income Register Norway and Tax register in Finland)
19
Geneva - Nov. 2005 Eurostat - UNECE worksession Methodological issues Methods of protection Global/ top recoding Usability of the database Requires arbitrage between variables Local suppressions May render uneasy statistical analysis Only if allow significant gain in global recoding of secondary variables
20
Geneva - Nov. 2005 Eurostat - UNECE worksession Experiments Level of recoding significantly decreasing disclosure risk Geographic information needs to be coarsened depending on the size of the country (For large countries, NUTS1 and degree of urbanisation could be released) Country of birth and Citizenship should be coarsened in 4 broad categories Age can be delivered in years but must be top coded (80+). This avoids the difficulty of ensuring coherence of protection of longitudinal and cross sectional data Number of rooms must be top coded (5+) ISCED levels 5 and 6 must be regrouped NACE is regrouped at 19 levels ISCO 2 digit code can be released
21
Geneva - Nov. 2005 Eurostat - UNECE worksession Implementation Remaining risks Identification of large households remains Rare transition in longitudinal data Sampling design information Specific national circumstances Researcher needs Household structure Longitudinal data for longitudinal analysis Design information for proper inference (not only variable but causal models) Harmonisation and flexibility
22
Geneva - Nov. 2005 Eurostat - UNECE worksession Implementation ECHP experience Large dissemination in research community under license release Less protection No observed breach of confidentiality For EU-SILC Developing a responsible management of risk through controlled release and possibly audit provision and follow up.
23
Geneva - Nov. 2005 Eurostat - UNECE worksession Implementation Eurostat approach Common rules for anonymisation of national databases Residual flexibility is allowed to adapt to national situations following national assessment according to common standards (measure of risk and thresholds, …)
24
Geneva - Nov. 2005 Eurostat - UNECE worksession Conclusions Anonymisation is a matter of trade off Among national perception of disclosure risk Between right for privacy and researcher need Between presence of risk and monitoring of risk Value added of EU-SILC TF These trade off have been debated and made explicit
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.