Treatment of statistical confidentiality Part 4: Microdata Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT.

Treatment of statistical confidentiality Part 4: Microdata Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Microdata: overview Recap: Identification
The stages of microdata protection SDC methods for microdata Evaluating alternatives Implementation: mu-Argus We will consider five topics: First, we consider what we mean by identification, reconsidering the lessons of the first session Second we go through the recommended five stages of the Eurostat SDC handbook, discussing the choices available, of which SDC in the data is one solution Third, we look at methods of SDC for microdata Fourth, we consider how we can evaluate alternative methods for both their security and impact on data usefulness Finally, we consider implementation, and introduce mu-Argus Before that, we revisit the ‚five safes‘ introduced in the first session, and use them to think about different types of microdata files and risk scenarios. We will cover this in less detail than protection of tables and other outputs; partly this is because microdata protection is more complex and more technical, and the aim of the course is to give a useful overview rather than create immediate experts. The follow-up courses in Autumn provide more detail and hands-on practice.

Recap: the ‘Five Safes’
When faced with a data protection issue, consider projects what is the data going to be used for? people who will use it? settings how will it be used? data what level of detail will be made available? outputs how will confidentiality of results be assured? We will be focusing on the data dimension bear in mind we do not have to make data completely confidential for all cases The ‘Five Safes’ model reminds us that there are many different ways to protect data. We will focus on the data dimension, considering how we can reduce the disclosure risk in a dataset. We will consider it from an abstract perspective, thinking about what different techniques have to offer, not (at this stage) how much anonymisation should be seen as the key. We will focus on identification, as without identification the data is necessarily anonymous.

Stages of microdata assessment
Identify the need for confidentiality protection Analyse data characteristics and use release mechanism user needs (essential/irrelevant variables) Define disclosure scenario and assess risk Identify the most appropriate method(s) Implement These stages are taken from the Eurostat ESSNet Handbook (see references at the end). This is not the only way to break up the process, but it will do for us today. However, two other very readable guides are The Eurostat (2013) Guidelines for anonymisation of social survey microdata are very clearly written, and use a seven-stage process. Mateo-Sanz and Domingo-Ferrer (2008) Rationale for anonymisation of business microdata: lessons learned from CIS4 and SES has a very clear step-by-step exposition for the specific case of business microdata. Both of these are available from the Eurostat Methodology Unit We’ve already covered the first of these, by assumption, and we covered non-data aspects of control, so here we focus on the risk inherent in the data.

Analysing data characteristics

We broadly split variables into identifying (key) variables variables of interest (which could also be identifying) Identifying variables are the ones we focus on but you could also consider making the dataset less attractive by having no variables of interest or swapping them so you don’t know who they apply to We’ll talk later about how data is released; for now we just want to consider what someone with access to both this data and external data sources could find out. Key or identifying variables are those which allow a respondent to be identified, either singly or in combination (for example, age, gender, postcode). Variables of interest are those which someone misusing the data is assumed to be interested in (for example, income or investment expenditure). For business data, these are often also identifying. Avoiding re-identification will be our priority. We can also consider removing the disclosure risk by manipulating the variables of interest; see later

Direct identifiers rarely a problem little research value Indirect identifiers more problematic need to understand how variables can be combined Exercise: which of the variables in Tables 10 and 11 are directly /indirectly identifying? are there opportunities for spontaneous recognition? how might identification vary across countries? We do not consider direct identifiers (such as names or addresses) these rarely have analytical value and so can be removed in general confidential data should NEVER be stored with direct identifiers – convert into non-informative references (such as random index numbers) as soon as possible Looking at the variable lists in Tables 10 and 11, which of these are identifying? and does it matter which country these data sets are from? (hint: consider how some variables vary in detail across countries)

What happens if data files are linked at different levels? individuals and households health data and hospital data local units and enterprises Need to consider each level separately upper level may include relatively public data if so, treat as all public and proceed with lower levels only It not usual but not uncommon for data to be structured hierarchically – for example, a household survey may hold individuals and households; if either one is identified, so is the other, so identification at all levels needs to be considered. mu-Argus can take account of this for personal/household data However, some datasets have an upper hierarchy using relatively public data – for example, health events may be linked to hospital details, or personal data may be linked to geographical measures of deprivation. If this is the case, a conservative strategy would be to assume that all the semi-public data is fully public – proceed with checking the lower level on the assumption that the records are linked to a known piece of information.

Disclosure scenarios

Disclosure scenarios Exercise: propose disclosure scenarios for the datasets described in Tables 10 and 11 consider both accidental and deliberate disclosure think about disclosure from outputs as well as the researcher identifying respondents at this stage, only interested in what could happen One half of the group should consider the dataset described in Table 10, the other the dataset of Table 11

Disclosure scenarios Exercise: consider releasing the personal data described in Table 11 What factors might affect the risk of the identified disclosure scenarios arising? apart from measures we might take to reduce risk inherent in the data What factors might limit the risk of a disclosure arising, before any SDC has been applied to the data? For example, could you restrict access or require users to sign an agreement? think in terms of “if we could limit release, then we could block ‘nosy neighbours’…”

Disclosure scenarios - purpose
Aim of disclosure scenario modelling is to identify the ex ante risk from a particular release environment to work out what level of protection needs to be applied to the data, to reduce risk to an acceptable level If we have done scenario planning properly then we should have a clear idea of what is the purpose of any further controls on data.

Disclosure scenarios - example
2010 Community Innovation Survey Scientific Use Files business data from a stratified sample distributed to EU researchers on CD under licence What disclosure scenarios do you think were identified, and why those? In our recent anonymisation project of the CIS-SUF we identified the feasible disclosure scenarios as arising from the publication of tables with very small numbers the researchers identifying and commenting on particular firms in the dataset (irrespective of whether the identification was accurate or not) Why just those? (Hint: non-data controls, sampling, data accuracy, data structure)

Risk assessment

Risk assessment Scenarios can be helpful in identifying sources of risk But we can also generate mathematical models probability of a sample unique being a population unique probability of an arbitrary record being matched to an external database probability of any record being matched Scenario planning helps one understand the limits of disclosure risk, but it is a subjective measure. It is possible to create some mathematical models for re-identification risk, depending on the data and your willingness to make some assumptions your willingness to make assumptions is still subjective, but within that it is possible to get objective measures We don’t have time to cover specific measures in detail here as we could spend a day just looking at the set-up, but the basic model is common.

Formal risk assessment: factors affecting methods
Factors affecting our ability to calculate risk of re-identification sampling weights are known the population is known (or certain parameters) our data is available in external datasets matching data is accurate This is a list of some of the things that affect your ability to make formal assessments of the risk…

Formal risk assessment: factors affecting methods
Factors affecting an intruder’s ability to be certain about an identification data is sampled the specific units are known sampling weights are known population uniques are known matching data is believed to be accurate intruder has specific personal knowledge We don’t know about intruder’s knowledge – so we take our knowledge as the worst case …and these are factors affect an intruder’s confidence about whether the match is true or not (which is not the same thing as whether the match is likely to be found). Note that we have less information on the intruder’s information set. It is likely to be less than ours, so we tend to take the risk calculation from our information, and assume that this represents a worst case scenario. This set of assumptions almost exactly the set used mu-Argus, except that the population distribution is estimated from the sampling weights.

Formal risk assessment: advantages
Adds an element of objectivity Can be calculated pre- and post-SDC allows for effect of changes to be quantified also allows for different methods to be compared

Formal risk assessment: disadvantages
Still subjective, but can provide spurious objectivity Intruder’s knowledge set unknowable focus on worst case scenarios is likely to over-protect

Formal risk assessment: summary
Number of techniques available Good for providing comparison between SDC methods (relative risk of methods) More problematic when trying to define absolute risk Example from mu-Argus We don’t have time to cover methods in detail here, as they are also specific to datasets/scenarios but all broadly try to quantify re-identification probabilities by making some assumptions about sampling, external data sources, and knowledge. This is a particular useful tool for comparing the impact of different SDC methods. However, as an objective method for the absolute level of risk, can be problematic as it can over-emphasise worst cases. We will very briefly demonstrate formal risk assessment using mu-Argus andrecoding a variable.

Methods of SDC for microdata

Methods of SDC for microdata
Perturbative masking changing the data in some way Non-perturbative masking reducing the information content Synthetic data replacing genuine data with estimated equivalents not considered here in detail This classification comes from the Eurostat Handbook, and we will follow it. Perturbative masking involves changing the data in some way. For example, data may be microaggregated (several observations combined together and average), or records might have certain values swapped; the aim is to increase uncertainty about what is being looked at. Non-pertubative masking includes cell suppression and limiting values of variables; the aim is to reduce the chance of a re-identification being feasible or useful. Synthetic data aims to remove any disclosure risk by replacing real values with imitation ones. This is a very specialist (and controversial) area, so we will just note that it exists as another possibility

Non-perturbative masking
Exercise: how could you reduce the information content in qualitative and quantitative variables?

Non-perturbative masking (1) Global recoding
Refine variables such that there are fewer possibilities see Table 9 – how could the variables in this dataset be recoded? Applicable to both categorical and continuous variables but might be an unacceptable loss of information Top-coding and bottom-coding are subsets Global recoding means taking a variable and converting into a less disclosive set of values; for example, turning the cardinal value ‘income’ into €10000 bands. Categories need to be designed so that the recoding itself does not create frequency problems. The main problem with recoding is that it implies a loss of information – potentially a significant one. Top-coding and bottom-coding are subsets where only high and low values are limited – for example, incomes over €1m. For categorical variables, this only makes sense where the categories are ordinal.

Non-perturbative masking (2) Local suppression
Change single observations rather than all values of a variable less impact on the dataset overall more suited to categorical variables Problem – how do you find the values to change? Local suppression is just that – replacing one value with a missing value, the minimum necessary to remove the identification risk. for example, a particularly unlikely combination (judge aged 27) could have one or other value suppressed, both values do not need to be suppressed, as long as there are sufficient judges and people aged 27 This should have much less impact on the dataset overall. It’s best suited to categorical variables – because if a continuous variable required suppression (because it was identifying) any other continuous variables would surely also need suppressing, and so top or bottom-coding seems more appropriate and has less information loss. The problem is finding the cells to change.

Non-perturbative masking (3) Sampling
Only release a subset of the data Adds uncertainty about population uniques Not directly appropriate for ESS For owners of large population datasets such as a Census, this may be a way to release data without clearly identifying population uniques. It is also effectively what happens when we have business data sampled from national business registers. We note this; however, this has less relevance for ESS.

Perturbative masking We will look at five options:
Adding noise Combining (microaggregation) Rounding Swapping Randomisation of categorical variables What do each of these aim to achieve? We can group perturbative methods under five main headings. Before we go into detail, what do you expect each of these does, and why? What are the likely advantages and disadvantages?

Perturbative masking Note that these methods are normally assessed in terms of their impact on univariate statistics All generate more all less biased marginal analyses – which might be important Before we continue, note that the theoretical discussion of these measures almost always assessed them in terms of their impact on univariate statistics – means, variances etc. This is sensible as these are the only context-free measures. However, the main purpose of microdata release is normally for researchers to carry out marginal analyses, in which case, these measures lead to (theoretically) biased results; this compares with non-perturbative methods where the impact is to increase estimated variances but (generally) leave estimated parameters unbiased. This may be important for users – in general users prefer missing data to unreliable data.

Perturbative masking (1) Adding noise
Additive or multiplicative noise can be added 𝑥→𝑥+𝑒 or 𝑥→𝑥.𝑒 where e is a random variable Purpose is to increase uncertainty over variables of interest, so that if an observation is identified the value of disclosed information is less Not very useful for key variables Not suitable for categorical variables Adding noise is typically done to reduce the value of any information gained by an intruder. It doesn’t help much with blocking re-identification as this is primarily done through categorical variables, for which this method is really appropriate (although techniques have been devised).

Perturbative masking (1) Adding noise
Advantages: simple and preserves means Disadvantages: additive noise disproportionately affects small values multiplicative noise increases variance in finite samples

Perturbative masking (2) Microaggregation
Add multiple values together; for example 𝑥 1 , 𝑥 2 , 𝑥 3 → 𝑥 123 , 𝑥 123 , 𝑥 123 where 𝑥 123 ≡ 𝑥 1 + 𝑥 2 + 𝑥 3 /3 is microaggregation(k=3) Again, purpose is to reduce value of re-identification not really used for categorical variables but used for business data where continuous variables are often identifiers Microaggregation involves bringing together multiple values and replacing them with the mean of those values (in this case, the microaggregation parameter is 3). Again, generally used to lower consequences of disclosure although it has been used in business data to reduce identification risk (as continuous variables such as turnover are often the identifying variables).

Perturbative masking (2) Microaggregation
Relatively easy to implement if the group size is fixed Problem: how do you choose the records to combine? Choose records closest in value – less perturbation but not achieving security aims Choose closest in characteristics other than value – more confidentiality but likely to perturb data more As a statistical technique, this is relatively straightforward to implement – much like adding noise. Some implementations allow for variable group size, but this is much more tricky. The difficulty is in choosing which variables to combine. Choosing similar values will perturb data least – but aren’t you trying to perturb the data to hide values? Choosing records to merge which have similar characteristics (for example, within the same key categories) has some statistical logic, but is likely to lead to much larger perturbations – perhaps too much?

Perturbative masking (3) Rounding
Round to some value using a fixed rules Similar to global recoding but here you treat the rounded variables as continuous but measured with error Rounding is straightforward and in terms of the dataset looks just like global recoding. The difference is in the way they are used. In global recoding, you create categories and should include in estimates as such In rounding, you treat the rounded variables as measured-with-error; this produces biased estimates but is treatable with standard instrumental variables techniques

Perturbative masking (4) Swapping
Swapping record values between paired observations 𝑎 𝑏 → 𝑎 𝑏 63 75 Not all variables need be swapped Advantage: relationships between swapped variable maintained Problem: which records are to be swapped? Swapping involves replacing the values in one record with the values in another: in this case a and b have swapped two variables Not all variables need be swapped – but the advantage of this method is that, if only the swapped variables are used in analysis, relationships between variables in records are maintained and there is no bias. However, the same problem arises – how do you choose which records to swap?

Perturbative masking (5) Randomisation of keys (PRAM)
PRAM (Post RAndoMisation) randomises categorical values aim to reduce certainty of identification and/or value of data Note: literature discusses key variable PRAMing, but mu-Argus only allows it for non-keys Problems: allowing for linked variables substantial potential changes for multivariate analysis Changing the value of identification variables directly addresses the problem of re-identification; this makes a dataset safe, unlike the other methods with mainly try to reduce the likelihood of useful information being gained. A particular problem with this technique is when other variables are directly tied to the PRAMmed variables For example, if randomisation is applied to gender but gender-specific variables exist, inconsistencies can highlight the true values. This does not destroy the PRAM approach, but means it must be considered carefully.

Synthetic data Create data with same statistical characteristics as original data but mostly invented can include some real data Is this statistically valid? Still much debate over value Synthetic data is computed from the statistics of the original data but without reference to that data – although some implementations do incorporate original non-key data. Because synthetic data is created to fulfil certain statistical characteristics, it is debatable whether it should be used for general purposes. Its value is hotly debated, as is the supposed ‘non-disclosiveness’; hence, we merely note that it exists.

Summary of data protection options
Reduce chance of identification Reduce value of identified records Reduce information Change information Global recoding (Global recoding) Local suppression Top/bottom coding Rounding (Microaggregation) Microaggregation PRAM Rank swapping Adding noise These aren’t fixed categories (eg PRAM would go in the bottom right if we just took mu-Argus’ perpsective), just in the most likely place Synthetic data Microaggregation Global recoding Top/bottom coding Local suppression Rank swapping PRAM Rounding Adding noise Synthetic data

Evaluating alternatives
How should we choose between protection options? The tools in mu-Argus etc give you some indication of risk protection, but no guide to whether the changes are important in terms of utility of the data This is going to be a very subjective judgement

No hard-and-fast rules Most methods preserve univariate values means and variances Some preserve multivariate relationships most don’t Most important: what do your users want? When evaluating alternatives, there are no unequivocal guidelines as to which method to prefer if there were, we wouldn‘t have competing methods Methods have different advantages and disadvantages In general, all preserve basic univariate statitics, but most damage multivariate relationships to a greater or lesser degree

All struggle to balance confidentiality protection against data damage confidentiality protection can be quantified ‘damage’ cannot – it depends upon likely use But remember the first session on principles: where you start from affects where you end up Judgment is important! A common theme running through all the methods is the conflict between maintaining confidentiality and maintaining the value of the data This is not a fair fight: you can quantify likelihood of disclosure, at least to some degree (and you can definitely compare methods for relative effectiveness) ‚damage‘ is a function of how the data is used, and it is not possible to iterate (or even agree on) the ways that the data might be used Therefore, you should be very wary of claims that one method has the ‚best‘ combination, as this is essentially unprovable. However, recall that in the first session we talked about the starting point affecting the outcome. If the default position is that information must not be lost unless it is demonstrably necessary for confidentiality purposes, this can act as a counterbalance to the quantification of confidentiality arguments. You judgment ultimately is the most important factor – and an awareness that there is no unambiguously ‚right‘ answer.

Evaluating alternatives in ESS
We have assumed one set of confidentiality principles not a good assumption should you go with the minimum agreed set, or tailor results for each country’s needs? The discussion so far has assumed there is common agreement on what is disclosive or not. This is a reasonable assumption for outputs, but not for microdata – countries‘ laws will often specifiy what can be treated as ‚disclosive‘ and so be distributed. It is not clear that Eurostat has any mandate to define ‚confidentiality‘ in microdata. One option – choose the strictest definition and apply everywhere – simple but limits use of data from countries with a more relaxed definition of disclosure Alternative – apply rules to each country‘s data separately – good for researchers, but terrible for ESS administrators. What do we do..?

Microdata anonymisation example: CIS2010 SUF
Previous method: recoding of employment into size classes reduction in NACE categories microaggregation applied to all continuous variables independently of each other microaggregation contributors were taken as the nearest value in the entire dataset What problems does this solve and not solve? We conclude with an example of a recent piece of anonymisation.

Microdata anonymisation example: CIS2010 SUF
Revised method: recoding of employment into size classes reduction in NACE categories all domains assessed for dominance problems microaggregation applied only to top turnover values in problematic domains other variables adjusted to be consistent with perturbed turnover contributors were taken from the domain as far as possible method evaluated by running regressions as well as inspection of means percentiles What problems does this solve and not solve? In the revised method less than 1% of observations were microaggregated. The evaluation included running simple linear and non-linear regressions on the orignal and microaggregated data (but not the original employment – this was kept recoded).

Implementation Eurostat tool is mu-Argus
We conclude with a very brief display of mu-Argus and its capabilities. The advanced courses in Autumn will provide more hands-on experience of mu-Argus.

References Published Available from Eurostat
Hundepool et al (2010) Handbook on Statistical Disclosure Control v1.2 Brand et al (2009) mu-Argus User Manual Available from Eurostat Eurostat (2013) Guidelines for anonymisation of social survey microdata guidelines for social data Mateo-Sanz & Domingo-Ferrer (2008) Rationale for anonymisation of business microdata: lessons learned from CIS4 and SES Guidelines for business data Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Lenz R, Naylor J, Schulte Nordholt E, Seri G, de Wolf P-P (2010) Handbook on Statistical Disclosure Control v1.2. Brand R, Capobianchi A, Domingo-Ferrer J, Franconi F, Giessing S, Hundepool A, Polettini S, Ramaswamy R, van de Wetering A, Torra V, de Wolf P-P (2009) mu-Argus User Manual. Eurostat (2013) Guidelines for anonymisation of social survey microdata are very clearly written, and use a seven-stage process (available from Eurostat) Mateo-Sanz and Domingo-Ferrer (2008) “Rationale for anonymisation of business microdata: lessons learned from CIS4 and SES”

Practical example Available from Eurostat:
Domingo-Ferrer and Mateo-Sanz (2009) CIS4+ Base Anonymisation Procedure original method Hafner et al (2014) 2010 Community Innovation Survey Anonymisation procedure revised method Hafner H-P, Lenz R, and Ritchie F (2014) 2010 Community Innovation Survey Anonymisation procedure. (available from Eurostat) Domingo-Ferrer J and Mateo-Sanz H (2009) CIS4+ Base Anonymisation Procedure (available from Eurostat)

Questions? CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Treatment of statistical confidentiality Part 4: Microdata Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT.

Similar presentations

Presentation on theme: "Treatment of statistical confidentiality Part 4: Microdata Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Treatment of statistical confidentiality Part 4: Microdata Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT.

Similar presentations

Presentation on theme: "Treatment of statistical confidentiality Part 4: Microdata Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT."— Presentation transcript:

Similar presentations

About project

Feedback