Download presentation
Presentation is loading. Please wait.
Published byVeronica Bernadetta Perini Modified over 5 years ago
1
Dealing with confidential data Introductory course Part 2: Tables Trainer: Felix Ritchie
CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
2
Tabular SDC: overview Tables Tables as inputs tau-Argus
primary and secondary disclosure solutions Tables as inputs tau-Argus Sources of information This course covers three main topics tabular data: we deal with this because it allows us to draw out many of the key concepts in SDC; also, these are likely to be the main statistics produced by Eurostat staff. In this initial section, we will assume that we have all the source data for the tables available to us tables as inputs: much ESS data is likely to come in the form of tables from MSs, some of which will be marked as ‚confidential‘ or missing – how do we deal with this when creating EU aggregates? we will then briefly look at tau-Argus as a tool for identifying and addressing tabular data protection
3
Tables
4
Types of tables Frequency – numbers of contributors only
Magnitude – sums/means etc. of contributor values Linked – cells in one table also appear in another Hierarchical – some tables are subsets of others We’ll look at each in turn – they pose more problems for SDC as we go along.
5
Frequency tables: primary disclosure
Look at Table 1 – are there any confidentiality problems? hint: there are at least three problematic cells This is an example of potential primary disclosure why is this qualified as ‘potential’? ‘Disclosure’ is specific to the context Primary disclosure is where the cell in the table is disclosive, without the need for any further information beyond that contained in the cell classification. Why is the cell with two observations problematic? But this is initially only ‘potential’ – to be disclosive one needs to understand the context eg is any of these disclosive (numbers are imaginary): number of companies in Finland called ‘Nokia’ = 1 number of businesses in country X with turnover over €10bn = 1 number of deaths from Creutzfeld-Jakob Disease in Manchester = 2 Conclusion: a cell is only disclosive or not in a specific context.
6
Frequency tables: primary disclosure
The problematic cells can be hidden (primary) cell suppression Does this solve the problem? yes – in this specific case not in general
7
Frequency tables: secondary disclosure
Consider Table 2: what confidentiality problems are there? does cell suppression help? How can we make the table safe? ‘secondary suppression’ see Table 2a and Table 2b Suppressing individual cells doesn’t help – because you can recover the missing values from the ‘margins’ (totals). We can make the table by removing margins, recalculating margins, or by removing other cells – which is better? No clear rules, but probably better to remove cells rather than totals – other tables may well produce totals – see next slide
8
Frequency tables: secondary disclosure
How do we know if sufficient values have been suppressed? look for single suppressed items treat it as a set of equations: s suppressed cells r row totals, c column totals (for rows/cols with suppressed cells only) r+c-1 independent equations if s≤r+c-1 then the equations can be solved to find the suppressed values Not sufficient – consider Table 3 If the original totals are being used, carry out two quick checks: are there any single entries in a row or column? does this look like a simultaneous equation system? Absence of these does NOT guarantee safety – see Table 3
9
Frequency tables: secondary disclosure
Consider Table 4 is Table 2a/b still non-disclosive, even after suppression? Problem is ‘disclosure by differencing’ difficult to spot impossible to prove it can’t happen judgment is necessary Why might this be a particular problem for ESS? Clearly the data from Table 3 can be combined with Table 2b to fill in the missing values; we could also have devised a Table 3a which would allow us to unpick the primary and secondary suppressions in Table 2a. Disclosure-by-differencing cannot be proved to not exist - to do so would require comparing a table against all possible past and future tabulations of the same data – and so a judgment must be made about whether it is a risk or not. For ESS, this is a particular problem as Eurostat and MSs will both be publishing tables based on the same data.
10
Frequency tables: alternatives to suppression (1)
Table re-design collapse categories produce tables with fewer categories Exercise: how could Table 2 be re-designed? Suppression may not be sensible – you might lose so many variables that the Table becomes useless, and every suppressed cell creates potential for disclosure by differencing. In addition, what do you do to empty cells? One alternative: redesign the table with fewer categories or fewer dimensions so that fewer cells have to be suppressed.
11
Frequency tables: alternatives to suppression (2)
Controlled rounding round all values to a base x ‘controlled’ because you retain original totals as far as possible do not round following a strict rule be flexible depending on totals provides uncertainty about true frequencies can also include zero values but be careful – one-way round gives indication of rounding method! Controlled rounding rounds numbers up or down to a ‘base’, while maintaining totals. Rounding does not happen in the normal way (ie if base is 5, are rounded to 5, 7.50 to 9.99 are rounded up). In this case, any number between 5 and 10 could be rounded either way. One can also round to higher multiple e.g. 7 could be round to 5, 10, 15 or 20.
12
Frequency tables: alternatives to suppression (2)
Controlled rounding problem: knowledge of base might help unpicking larger base stops this but damages the table more See Table 5 what is the difference between the rounding methods can the true values be derived? An ‘attacker’ could workout the base value and try to find original values – a larger base value prevents this, but also damages the table much more.
13
Frequency tables: alternatives to suppression (3)
Add noise to the tables Makes any specific value uncertain but totals remain consistent Problems: inconsistency with other tables Not recommended for ESS An alternative technique is to add some ‘noise’ to the tables – done in such a way that totals remain correct but that specific values might be higher or lower than their real value, adding uncertainty. As for controlled rounding, the problem is that these results may not be consistent with other tables – and if table cells can be compared, you can unpick the added noise but the main problem is the inconsistency across tables; hence this is less likely to be appropriate for ESS.
14
Frequency tables: class disclosure
Consider Tables 6 and 7 What disclosure issues do they present? Do they matter? Tables 6 and 7 provide information about a class of respondents: drug use of survey respondent, wages of doctors and nurses. Do they matter?
15
Frequency tables: class disclosure
Tables provide information about a class of respondents Importance depends on context – are they ‘structural’ or informative? No general guidelines, but be aware of empty or full cells/columns Class disclosure is a context-sensitive outcome; for example, no-one in mid-Wales (NUTS2 region UKL2) earns over £100m => not very informative no-one in mid Wales earns over £100k => quite informative about the region Where the full or empty cells are known to be part of the structure of the data (all doctors have a university degree; the hourly wages of state nurses have a minimum value of €10.50 and a maximum of €27.90), this is not disclosive; but where the data is not necessarily 0% or 100% (there is no reason why all respondents in a country should have tried illegal drugs), class disclosure becomes a problem. So check all zeros or full cells!
16
Magnitude tables: dealing with dominance
Why does dominance matter? what is a ‘disclosure’? Consider Table 8 What are the problems with this table? As for frequencies: consider both one and two outliers potentially problematic Disclosure doesn’t necessarily mean exact. If you released Phillip’s R&D expenditure to the nearest €1/€1k/€1m/€100m/€1bn – would any of those be disclosive? (trick question – NEVER reference a specific respondent). Dominance is a particular problem for business data – large outliers accounting for much of value. Table 8: First problem – no idea how many observations Second problem – potential for dominance
17
Magnitude tables: concentration rules
The (n,k) rule (‘dominance rule’) cell is unsafe if n largest contributors represent over k% of the cell total The p%-rule cell is unsafe if the cell total minus the two largest contributors exceeds p% of the largest Where does these rules come from? What is the logic behind them? The (n, k) rule is a simple measure against the cell being near enough the sum of those units Why is this rule insufficient on its own? The p% rule is to make sure there is sufficient uncertainty that the second-largest competitor cannot determine the size of the largest competitor What is the minimum amount of uncertainty that surrounds the value of the largest contributor, when viewed from the perspective of the second largest?
18
Magnitude tables: concentration rules
Exercise: how should Table 8 be presented? Use the spreadsheet with the source data Let (n, k) be (3, 90%) and p% be 10% Apply the rules separately and jointly The (n, k) rule is a simple measure against the cell being near enough the sum of those units; the p% rule is to make sure there is sufficient uncertainty that the second-largest competitor cannot determine the size of the largest competitor.
19
Magnitude tables: problems with concentration rules
What if the cell contains negative values? What if the cell is not a simple linear sum? What if contributors waive the right to anonymity? Negative values – the dominance rules make no sense dominance rules only apply to simple linear sums e.g. what if you had sums of log earnings? what about a Herfindahl Index (sum of squared values)? If firms are not bothered about their publication, does this mean we have no concerns?
20
Magnitude tables: protection inherent in the table
What assumptions are being made in concentration rules? The concentration rules are worst-case scenarios: they assume that a person looking at a table knows which is/are the largest contributor(s) knows roughly the proportions accounted for by the large contributors wants to try to use this information to get an estimate fro a specific respondent How likely are these?
21
Magnitude tables: problems specific to business data
What do we do about holding companies? For some statistics, companies which account for more than x% of a cell can refuse publication How do we deal with this? Business reporting units are often not the whole enterprise – can we mistakenly read several parts of the same business as one business? In some countries/stats it gets more complicated – firms may have rights to request non-publication if they form over x% of a cell Particular problem in trade statistics (why?) Regulation 638/2004
22
Magnitude tables: sampling and weights
Is sampling good or bad for SDC? Does SDC need to be applied to weighted data? Randomised sampling is good for SDC – it reduces risk considerably. However, for business data stratified sampling is often used, with a census of the largest businesses. A census is very bad for SDC! When tables are weighted to produce population totals, SDC problems are reduced: the value of the weights is uncertain smaller values (more likely to be sampled) are more likely to increase in total size, reducing the dominance of the largest contributors (who are likely to be in a census and the have weights close to 1).
23
Magnitude tables: finally…
Table re-design still an option!
24
Linked tables Linked tables have a linear dependency between them
for example, grand totals must be equal Where does this occur in ESS? What problems does this create? Data are often broken down several ways, and the data producer wants to keep them consistent; within the ESS, country-EU breakdowns add an additional dimension. Problem is similar to disclosure by differencing – but slightly easier, as we know the tables we are producing. We won’t go into this in detail, as to some extent we’ve covered it and the hierarchical tables (next) will fill in some of the gaps.
25
Hierarchical tables Some data categories have a natural hierarchical structure for example, industry, occupation, region What problems can this cause? Many categorical variables, particularly business data, have a hierarchical structure For example, you might want to present information by broad industrial groups, and then by detailed NACE categories Alternatively, health data could be presented as national statistics, with detailed regional variation available within each country The problem is that you might not have enough of the lower level categories – so should you not produce the higher level ones? If you don’t take account of the structure, you create the chance of disclosure by differencing, again.
26
Hierarchical tables: top-down or bottom-up SDC?
fill all the categories at the top level move to the next level suppress empty categories Advantages higher-level categories are full Problems: may need to backtrack Top-down tries to ensure that the highest-level categories (which shouldn’t have frequency or dominance problems) are completed first, so that the broad picture is accurate. This is most likely to generate tables which are comparable across different breakdowns. The difficulty is that there is the potential for an empty lower-level category to cause the hierarchy to unravel, as suppressed cells lead to differencing across hierarchy levels.
27
Hierarchical tables: top-down or bottom-up SDC?
produce the most detailed level that you need add those up to get the next level totals, and so on Advantages internal consistency no disclosure-by-differencing within the hierarchy Problems: grand totals likely to be lower than they should be Bottom-up involves taking the no-suppressed data and adding up to higher categories. Because the suppressed cells do not contribute to the total, there is no change of disclosure by differencing within the hierarchy. However, because suppressed are excluded all the way up the ‘tree’, we would expect totals to be lower than they would be if we started from the top down.
28
Hierarchical tables: top-down or bottom-up SDC?
From a research perspective, always bottom-up …but for national statistical tabulations, top-down is best practice Note that neither addresses disclosure-by-differencing across data sets this is always a problem! Personally, it is recommendable that researchers, when preparing descriptive statistics for their work, always take the bottom-up approach; this is because researchers are less interested in comparability than with description. However for formal statistical aggregates, where consistency across different presentations is the key, better to take the top-down approach – harder but provides comparable statistics.
29
Tables as inputs
30
How might data be sent from MSs?
microdata tables with gaps tables with confidential cells identified with or without reasons So far, we have assumed that we are constructing tables from microdata. But what if we only have other tables to build our data from – and some of those are confidential or missing? MSs might send table indicating that certain cells are problematic – it is for Eurostat to decide whether to publish or not. We will now look at how Eurostat can decide how to make decisions or not.
31
Input tables with missing cells
Consider Table 9 – what do we do about Estonia? Should we not publish? publish data only for MSs that sent data? Can we only publish if we have a complete set of data? Tables with missing cells are relatively straightforward to deal with – publish without data, or don’t publish. Preferred solution is to publish with notes describing the contributor set.
32
Input tables with confidential cells
Why would an MS send a table to Eurostat with confidential data in it? publish EU aggregates using confidential data safely Problems with this? Suppose an MS sends a table to Eurostat with some cells marked as ‘confidential’ – surely that’s a waste of time, as it can’t be used? but maybe you could hide your own data in the Eurostat aggregates – so the confidential information can be used safely. What is the problem with this?
33
Input tables with confidential cells
Problems: primary disclosure – a confidential cell could still dominate an EU aggregate secondary disclosure – differencing between published and unpublished MS totals First problem is that the EU aggregate still isn’t enough to deal with primary disclosure problems – a firm might still dominate an EU aggregate. More significant problem is that you have to assume the data not marked as confidential – see Table 9a: 2/3 of the total is published
34
Input tables with confidential cells: method
Take away all non-confidential data assume it has been published and so can be subtracted from totals see Table 9b Consider the rest as a set Within this set, apply relevant rules use all the information available where no information exists, assume worst case The approach is straightforward: assume anything not marked as confidential has been or will be published, and so it can be subtracted from any total; so take it out the equation completely Table 9b has the information we require Then we treat the confidentiality information as single set of observations go through each cell, and consider whether the confidentiality rules relevant for that cell (i.e. as stated by the MS) still apply when considered in the context of a total for all the confidential cells if all the cells pass, you can publish if not – do not publish, or publish without problematic cells
35
Input tables with confidential cells: example of method using Table 9b
Ireland (IE): no information about cell contents worst case: assume it relates to one firm does that breach confidentiality limits? No guidance from IE, so use Eurostat thresholds e.g. (n, k) = (2, 90%) IE is 3% of total so probably not a problem Consider Irish data first. There is no information on what formed this total of €3620, so we take the worst case and assume it is one company. Does this breach confidentiality? It only accounts 3% of the total for confidential cells, so this would not seem to be problematic.
36
Input tables with confidential cells: example of method using Table 9b
Netherlands (NL): Told that one company has value €24034 Assume one other company valued at €2404 Rule is confidential if value >80% of total Neither company breaks limit - publishable For the Netherlands, told there is one company valued at €24034 Worst case: assume rest of data is one company valued at €26437-€24035=€2404 Neither accounts for >80% of €135657 Therefore publishable under Dutch rules
37
Input tables with confidential cells: example of method using Table 9b
Norway (NO): Don’t know how many companies Worst case: assume it’s one company But this company accounts for 71% of the total can’t be published under NO rules For Norway, we only know the total Worst case: assume all data is one company This accounts >70% of total Therefore NOT publishable
38
Input tables with confidential cells: example of method using Table 9b
Can we just leave out Norway and publish? No – because NL breaks thresholds in remainder See Table 9c Can we publish IE and NL(rest)? No – potentially only two organisations In this case, none of the confidential data can be included If we leave out Norway, the large Dutch company dominates the remaining confidential cells. If we left this company out, then potentially only 2 companies account for 100% of the remainder – again, a breach. So in this case, looks like we can’t publish any of the confidential cells.
39
Input tables: summary Complex because Method
several rules might apply to one table not all information available need to assume worst cases => potential for over-protection Method identify known cases already published, or not publishable consider remaining cells on case-by-case basis every time you change the set of known results, check again So, getting tables as inputs presents a particular problem because we may not have enough information to make efficient choices – have to make worst-case ones different MSs might want to apply different rules This is therefore an iterative approach work out what is assumed to be published already and set aside check each of the confidential cells if you decide that one of them can’t be published, re-check the ones you’ve already checked, as the set of options has now shrunk
40
Tau-Argus: demonstration
This will involve the trainer bringing up tau-argus and showing how it works. Users will then get a chance to explore on their own (using the introductory dataset) for the rest of the session. The advanced course in Autumn will cover extended use of tau-Argus with more hands-on experience.
41
Useful references Hundepool et al (2010) Brandt et al (2010)
Castro et al (2009) Tau-Argus User Manual v3.5 Castro J, Fischetti M, Giessing S, Hundepool A, Lowthian P, Ramaswamy R, Salazar J-J, van de Wetering A, de Wolf P-P (2009) tau-Argus User Manual v3.5. See Recommendations on the treatment of statistical confidentiality in tabulated business data in Eurostat and Recommendations on the treatment of statistical confidentiality in tabulated personal data in Eurostat, both available from Unit B-1 (Methodology)
42
Questions? CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.