Treatment of statistical confidentiality Part 1: Principles Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT.

Treatment of statistical confidentiality Part 1: Principles Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Principles: overview What is ‘confidentiality’? Useful terms
The ‘Five Safes’ Data-specific issues Confidentiality in European statistics

Breach of confidentiality: what is it?
What do you understand by a breach of confidentiality? Exercise 1 What are the common features of a breach of confidentiality? Is this an objective or subjective decision? A breach of confidentiality is not easy to detect – see Exercise 1 The common features of a breach of confidentiality are: the identification of an attribute being able to associate that with one or more individuals with a high degree of certainty

Breach of confidentiality: what is it?
Not a breach Probably not a breach Breach

Breach of confidentiality: who cares?
What are the consequences of a breach of confidentiality? What if it’s not a real breach but just perceived as one (i.e. seen as if it is real)? What would be the consequences if Eurostat lawfully released data which the MSs could not release? Eurostat restricted publication of data which MSs had already published? Consequences: could affect individuals or organisations supplying information, and individuals or organisations sharing information could be personally or corporately distressing may be financial, legal, reputational, physical for stats agencies, could affect data collection and so validity of functions In some cases, perception may be as important as an actual breach, even if it is not true or completely irrelevant Just because something is legal, does not mean it is acceptable Often there is a breach of procedure, not of confidentiality – established practice can be hard to distinguish from law Within EU complex legal arrangements make it difficult to tell – something lawful for an MS may not be lawful for Eurostat to do, and vice-versa (note that Eurostat has a particular legal position in respect of data access).

Breach of confidentiality: avoidance
Never publish anything Never distribute data or try to manage protection Question: which is the better position? assume everything can be published; check for reasons not to publish assume nothing can be published; check for reasons to publish Confidentiality breaches are easily avoided by never collecting data and never publishing This is not very sensible from society‘s perspective We need to manage procedures to get as much information as possible published while retaining confidentiality But your methodological stance affects your decision on what is acceptable: for good psychological reasons, the two positions stated above will in general have different outcomes (for a detailed discussion, see Ritchie, 2014) The trainer‘s recommendation would be the first. This is likely to result in more outputs being published. Hint: this one!

Some necessary terms What do the following terms mean? person
identification attribute disclosure statistical disclosure control (SDC) microdata output input SDC output SDC

Some necessary terms We use terms as follows person identification
A legal or natural individual Knowing that a record relates to a specific person Some necessary terms A piece of useful information about a person We use terms as follows person identification attribute disclosure statistical disclosure control (SDC) microdata output input SDC output SDC Attribute attached to an identified person unlawfully Reducing the risk of disclosure Record-level data for analysis Published statistics eg tables, analyses ‚Legal‘ people are corporate bodies, such as companies, government departments – or Eurostat. A ‚natural‘ person is a human. We called both ‚person‘ to keep the concepts simple. SDC applied to data before research SDC applied to statistical outputs

How do (attempted) confidentiality breaches occur?
Which is the most common? Malicious Not malicious Deliberate (as a result of someone deliberately misusing procedures) ? Accidental (as a result of a mistake in understanding or actions) -- In deliberate misuse, an authorised user is making use of data in a way which has not been approved and/or outside the environment that the data should be used in. If a breach is accidental, does it matter? That implies that no-one notices it. But the point is that someone who is not authorised to see information could find out something from a careless controlled output or a badly managed dataset. Almost all breaches are not malicious i.e. they are not done to harm the data; instead they result from data users not following procedures, either because they do not know them or because they do not like them It’s difficult to give numbers because there is an unwillingness to publicise mistakes. My personal perception would be about 80% accidental, about 20% deliberate but not malicious and deliberate malicious much less than 1% of all breaches, successful or otherwise. Note that this is across all facilities; training substantially reduces the total and the proportion of deliberate acts of misuse.

Which is the hardest to detect? Malicious Not malicious Deliberate (as a result of someone deliberately misusing procedures) ? Accidental (as a result of a mistake in understanding or actions) -- Accidental can be surprisingly hard to detect, as the fact that it is accidental implies it is not obvious, and the researcher has no reason to suspect an error

Which is the most serious? Malicious Not malicious Deliberate (as a result of someone deliberately misusing procedures) ? Accidental (as a result of a mistake in understanding or actions) -- Again, hard to tell as little evidence. Likely that ‘malicious, deliberate’ has the most potential to be damaging, but it could be argued that ‘accidental, non-malicious’ breaches are most likely to be missed and so be ‘successful’ breaches. Surprisingly, ‘deliberate, but non-malicious’ breaches may be the least harmful. This is because, although the user has done the wrong thing, they are still likely to apply their own standards of good data management. By definition, these are not as secure as the formal procedures, but, unlike the other two cases, damaging data is not being released without a care for its security.

Summary: breaches of confidentiality in statistical data are most likely to occur (and be successful) as a result of accidents by people who have authorised access the second most likely cause of (attempted) breaches is authorised users avoiding rules to make life easier for themselves

Identification: a formal definition
To determine whether a statistical unit is identifiable, account shall be taken of all relevant means that might reasonably be used by a third party to identify the statistical unit Important: ‘reasonable’, not ‘all possible’ judgment and experience matters Article 3 "Definitions" of the Commission Regulation (EC) No 223/2009 Note that it is means that might ‘reasonably’ be used to identify, not ‘all possible’. Most European countries have similar phrasing

The ‚Five Safes‘ framework
Framework for helping you to consider how you can apply alternative strategies to manage disclosure risk Each dimension asks a specific question about how much risk you want to accept alternatively, how much control you can apply to the way data is used All parts of the framework should be considered This framework has been developed as a way of looking at everything from internet release to management of confidential data labs. It is implicitly part of the expert recommendations: Brandt M., Franconi L., Guerke C., Hundepool A., Lucarelli M., Mol J., Ritchie F., Seri G. and Welpton R. (2010), Guidelines for the checking of output based on microdata research. revised version at For a detailed description, see Ritchie, Desai and Welpton (2014) The Five Safes. Mimeo

The ‘Five Safes’ framework
Safe projects Is this use of the data appropriate? Safe people Can people be trusted to use it in an appropriate manner? Safe data Is there a disclosure risk in the data itself? Safe settings Does the access facility limit unauthorised use? Safe outputs Are the statistical results non-disclosive? Safe projects can you/do you check the purposes for which this data is being used? do you check for statistical validity, for legality, for compliance with procedures of MSs? Safe people have people the competence to use the data appropriately? As commission staff, are you acting appropriately? How will the general public treat your publications? do you trust them to not breach confidentiality? have you done anything to make them liable to act properly (e.g. training sessions, handouts)? do they have incentives to do the right thing? is it easy to do the right thing? Safe data What would be the risk if this data was given to everyone? How much effort would it take for someone to find out detailed confidential information?

Safe projects Is this use of the data appropriate? Safe people Can people be trusted to use it in an appropriate manner? Safe data Is there a disclosure risk in the data itself? Safe settings Does the access facility limit unauthorised use? Safe outputs Are the statistical results non-disclosive? Safe settings Is it hard or easy to remove the data from your control? Safe outputs Even for the data users working with the best intentions, mistakes happen (your trainer is aware of some of his own mistakes)! What is the risk in poorly-executed outputs?

Preventing deliberate breaches The ‘Five Safes’ framework Safe projects Is this use of the data appropriate? Safe people Can people be trusted to use it in an appropriate manner? Safe data Is there a disclosure risk in the data itself? Safe settings Does the access facility limit unauthorised use? Safe outputs Are the statistical results non-disclosive? Note that breaches can occur as a result of deliberate actions and accidental output. Preventing accidental breaches

Safe projects Is this use of the data appropriate? Safe people Can people be trusted to use it in an appropriate manner? Safe data Is there a disclosure risk in the data itself? Safe settings Does the access facility limit unauthorised use? Safe outputs Are the statistical results non-disclosive? Input SDC appropriate Safe data In terms of making the data safe, this is where the microdata protection comes in. We will be looking at a program called mu-Argus (μArgus) which can help with protecting microdata Safe outputs We will concentrate on tabular outputs, as these are the most problematic. We will try manual adjustment, and use the program tau-Argus (τArgus) We will spend a large amount of time on tabular outputs because, as well as forming the likely bulk of your work, they usefully illustrate many aspects of data management – especially the need for you to exercise judgement in any decisions you make Output SDC appropriate

Five safes and European statistics
What are the ‘controls’ for these activities? Score each dimension on a scale of 0-5 where 0 means ‘no control’ and 5 means ‘controlled as strictly as possible’ Safe… Projects People Data Settings Outputs Tables published on the Eurostat website Microdata from MSs analysed by EC staff Scientific use files containing MS microdata Microdata at the Eurostat Safe Centre Things to think about: what are the processes involved? What training is involved? How is data stored and managed electronically? Hint: for anything published on the internet (data or tables) you have no control over who will use it, for what purposes

Planning the management of confidentiality
Use the ‘Five safes’ (or your own preferred framework) to consider what outcomes do I want to achieve? what risks to confidentiality exist? how large are they? what can I do to reduce them? Remember to consider accidental and deliberate release! Note that traditional confidentiality programmes often only focus on ‘intruders’ (deliberate, malicious actors). As we have seen already, accidental and non-malicious release is a much more serious problem. Consider these when planning your confidentiality strategy. Note that we are not just concerned with your published output. Are you personally managing your data correctly?

Data issues Exercise: contrast the characteristics of these types of data in respect of confidentiality business, personal (non-health) and health data hint: consider how the data are distributed, how sensitive it is, how easy it is to identify a person Business data very highly skewed distribution; large organisations very recognisable by virtue of size and the industry they operate in (in theory; in practice, this is hard); businesses often have to publish detailed data for regulatory reasons, so many options for matching to identify organisations; data commercially sensitive, but probably not for very long Personal non-health data individuals are alike rather than different – much harder to identify without some very specific details (e.g. age, gender, street name); some data (e.g. sexuality) may be permanently sensitive, some (education, type of employment) not very sensitive – but still confidential! Health data likely to be permanently sensitive; unusual health episodes can be identifying, and access to health records can be surprisingly widespread

Data issues Exercise: contrast the characteristics of these types of data in respect of confidentiality business, personal (non-health) and health data hint: consider how the data are distributed, how sensitive it is, how easy it is to identify a person For both personal and health data, European legislation (and some MSs) formally defines some data as particularly ‘sensitive’, such as race or ethnicity political, religious or sociological opinions (such as union membership) health and sexuality This doesn’t really help us, as we don’t want any confidential data to be released, irrespective of the ‘harm’ that it causes. In general assume that data release is a bad thing unless you know it to be non-disclosive

Data issues Exercise: contrast the characteristics of these types of data in respect of confidentiality survey and administrative data hint: consider quality of data, identification of individuals, and who has access to that data Survey data sampling a fraction of the population reduces risk of identification – ‘population uniques’ are very bad, but ‘sample uniques’ which are not ‘population uniques’ are not; data is only likely to be accessible to the data collection body and to researchers, and is usually anonymised as soon as practicable Administrative data Increases identification risk because you may have a lot of information about whether a person should be in the dataset or not; also, a relatively large number of people are likely to have access to the source data, and that is probably fully identified (names, tax numbers etc.) as that is necessary for administration

Data issues Exercise: contrast the characteristics of these types of data in respect of confidentiality cross-sectional, time-series, longitudinal/cohort, and census data hint: consider whether you are more likely to be able to identify someone in the data Cross-sectional data some personal information which might help with the identification of respondents Time-series data Aggregated data observed over time – should be no disclosure risk unless you could note (for example) a change in the series and be able to associate that with one contributor What happens if you aggregate differently over different periods? Longitudinal/cohort data More identifiable than cross-sectional data as events over time might be more noticeable – for example, year of leaving school, year of getting married etc.; in addition, because the selection criteria for inclusion in the cohort has to be fixed over the period, this might be identifying (for example, birth date, or everyone with a tax record number starting with “1”) Census data you know everyone should be in the dataset, so someone hunting to identify a record can start from that assumption

The European Statistical System What does it do?
What are the key functions of ESS, and which are most relevant to our course today?

The European Statistical System What does it do?
Methodological work research e.g. ESSNet guidelines e.g. ESA, CENEX development and co-ordination of the European Statistical Work Programme Statistics production harmonisation of aggregates production of harmonised data from MS tables production of harmonised data from MS microdata creation of cross-national research microdata Our interest

What does the ESS mean for confidentiality protection?
What features of the ESS could make confidentiality protection easier or harder?

What does the ESS mean for confidentiality protection?
multiple rules for ‘confidentiality’ from MS problems of separate MS and Commission publication country-specific sensitivities (local knowledge) not all information available States vary in what they consider ‘confidential’ and ‘risky’ – whose rules should we follow? As we will see later today, separate tables containing the same information can cause problems of disclosure-by-differencing. There may also be local variations which Commission employees might not be aware of; for example laws on abortion vary significantly and so different MSs may have very different rules for these specific variables. Finally, if the data comes from MSs already in tabular form, then we may not have all the information to make an appropriate judgement.

Is all data equally confidential?
MSs decide whether they are sending confidential or non-confidential data Eurostat cannot make its own decisions on the confidentiality of source data, only outputs There are formal exceptions some laws allow disclosure sometimes respondents can allow disclosure In general, ignore these cases – not helpful Data sent by Member States can not change their status, so confidential data must remain confidential non-confidential data may not be suppressed or hidden, unless it would in some way reveal confidential data There are some exceptions: trade statistics (imports and exports) are assumed to be non-confidential unless flagged up respondents can choose to have their data published there may be specific regulations for particular cases In general, these don’t help; be aware that they exist, but if you work on the assumption that they don’t apply, you’ll automatically end up following good practice – which you can then relax if you need to. This is better than finding an exception and trying to fit confidentiality rules to the exception.

Principles: a summary (1)
Good confidentiality protection is not about finding the bad guys applying rules restricting data or data access

Principles: a summary (2)
Good confidentiality protection is about knowing what are the real risks in data access knowing when and how to apply rules and realising when the rules might need to be changed or re-interpreted seeing restrictions on data or data access as some of the options for lowering data risk getting data out there for the users! Good confidentiality protection means the users don’t see it…

Useful references Expert guidelines
Hundepool et al (2010) Handbook on Statistical Disclosure Control v1.2 Brandt M et al (2010) Guidelines for the checking of output based on microdata research revised 2015 Papers on confidentiality management Desai T. and Ritchie F. (2010) "Effective researcher management" Desai T. Ritchie F. and Welpton R. (2014) “The Five Safes” Ritchie F. (2014) "Access to sensitive data: satisfying objectives, not constraints“ Eurostat best practice guidelines Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Lenz R, Naylor J, Schulte Nordholt E, Seri G, de Wolf P-P (2010) Handbook on Statistical Disclosure Control v1.2. Brandt M, Franconi L, Guerke C, Hundepool A, Lucarelli M, Mol J, Ritchie F, Seri G and Welpton R (2010) Guidelines for the checking of output based on microdata research. revised at document_output-checking-guidelines.pdf Other Desai T. and Ritchie F. (2010) "Effective researcher management", in Work session on statistical data confidentiality 2009; Eurostat Desai T. Ritchie F. and Welpton R. (2014) “The Five Safes”, working paper, University of the West of England Ritchie F. (2014) "Access to sensitive data: satisfying objectives, not constraints", Journal of Official Statistics, xml?format=INT

Eurostat resources Cybernews/Statistics/Methodology/Statistical confidentiality This is the source page for Eurostat documents on confidentiality. Note that it is only accessible on the Commission intranet

Questions? CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Treatment of statistical confidentiality Part 1: Principles Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT.

Similar presentations

Presentation on theme: "Treatment of statistical confidentiality Part 1: Principles Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Treatment of statistical confidentiality Part 1: Principles Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT.

Similar presentations

Presentation on theme: "Treatment of statistical confidentiality Part 1: Principles Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT."— Presentation transcript:

Similar presentations

About project

Feedback