Download presentation
Presentation is loading. Please wait.
Published byBruno Stanley Modified over 6 years ago
1
Treatment of statistical confidentiality Table protection using Excel and tau-Argus Practical course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
2
Tabular SDC: overview Recap of table protection methods
primary and secondary disclosure frequency and dominance rules linked and hierarchical tables creating safe tables from microdata Use of appropriate tools Checking tables manually in Excel Using tau-Argus This course has three topics recap of tabular data problems and solutions; in this, we will review topics covered in introductory courses if you didn‘t attend the introductory course (and don‘t have similar knowledge) please ASK MANY QUESTIONS! creating a safe table manually: we will use the tools in Excel to carry out automatic checks using tau-Argus: we will learn the program basics and try to create safe tables from both microdata and input tables
3
Review: types of tables
What are these tables? frequency tables magnitude tables linked tables hierarchical tables To start with, only consider frequency tables Four main types of tables to consider Frequency – numbers of contributors only Magnitude – sums/means etc. of contributor values Linked – cells in one table also appear in another Hierarchical – some tables are subsets of others What are examples of each? We’ll look at each in turn – they pose more problems for SDC as we go along.
4
Review: types of disclosure
What problems does Table 1 below present? There are several problem cells – certainly all the 1s and 2s Why are the ones with two observations problematic? Because each contributor could find out about the other should we preserve details or totals?
5
Review: types of disclosure
Two types of problem: Primary disclosure Secondary disclosure Distinguish between actual disclosure and potential disclosure Primary disclosure is where the cell in the table is disclosive, without the need for any further information beyond that contained in the cell classification. Secondary disclosure is where that cell, in combination with others, could lead to a disclosure Remember, problems show ‘potential’ – to be disclosive one needs to understand the context e.g. is any of these disclosive (numbers are imaginary): number of companies in Finland called ‘Nokia’ = 1 number of businesses in Malta with turnover over €10bn = 1 number of deaths from Creutzfeld-Jakob Disease in Manchester = 2 Conclusion: a cell is only disclosive or not in a specific context.
6
Review: cell suppression
Primary suppression Secondary suppression what different methods exist? what are the advantages and disadvantages? One option is just to suppress problematic values – but this might not be enough on its own. If other cells (which were safe) need to be suppressed as well, this is called secondary suppression Note the implication is that you are hiding valuable and safe information – is this a trade-off you want to make?
7
Review: other solutions
What are the advantages and disadvantages of these? Table re-design Controlled rounding Adding noise Suppression may not be sensible – you might lose so many variables that the Table becomes useless, and every suppressed cell creates potential for disclosure by differencing. In addition, what do you do to empty cells? Some alternatives: redesign the table with fewer categories or fewer dimensions so that fewer cells have to be suppressed controlled rounding rounds numbers up or down to a ‘base’, while maintaining totals add some ‘noise’ to the tables – done in such a way that totals remain correct but that specific values might be higher or lower than their real value, adding uncertainty All have advantages and disadvantages – but a key one for ESS is that totals might be inconsistent in tables produced from MSs and other parts of the ESS
8
Frequency tables: class disclosure
Tables provide information about a class of respondents Importance depends on context – are they ‘structural’ or informative? No general guidelines, but be aware of empty or full cells/columns Class disclosure is a context-sensitive outcome; for example, no-one in mid-Wales (NUTS2 region UKL2) earns over £100m => not very informative no-one in mid Wales earns over £100k => quite informative about the region Where the full or empty cells are known to be part of the structure of the data (all doctors have a university degree; the hourly wages of state nurses have a minimum value of €10.50 and a maximum of €27.90), this is not disclosive; but where the data is not necessarily 0% or 100% (there is no reason why all respondents in a country should have tried illegal drugs), class disclosure becomes a problem. So check all zeros or full cells!
9
Review: magnitude tables
What new problems do magnitude tables bring? Dominance is a particular problem for business data – large outliers accounting for much of value. This only occurs in magnitude tables – by construction, every record in a frequency is just one unit
10
Review: concentration rules for magnitude tables
The (n,k) rule (‘dominance rule’) cell is unsafe if n largest contributors represent over k% of the cell total The p%-rule cell is potentially unsafe if the cell total minus the two largest contributors is less than p% of the largest What problems occur with these rules? The (n, k) rule is a simple measure against the cell being near enough the sum of those units This rule is insufficient on its own because it doesn’t stop those at the top from knowing about each so, so we have… The p% rule is to make sure there is sufficient uncertainty that the second-largest competitor cannot determine the size of the largest competitor i.e., it is trying to demonstrate the minimum amount of uncertainty that surrounds the value of the largest contributor, when viewed from the perspective of the second largest In the examples, we shall see exactly how this works The main problem is negative values – the dominance rules make no sense However, you might also run into problems when dealing with non-linear sums e.g. what if you had sums of log earnings? what about a Herfindahl Index (sum of squared values)?
11
Practice session 1: Creating and checking a simple table in Excel and tau-Argus
See workbook exercises 1-4
12
Review: linked tables How might tables be linked?
Data are often broken down several ways, and the data producer wants to keep them consistent; within the ESS, country-EU breakdowns add an additional dimension. Problem is similar to disclosure by differencing – but slightly easier, as we know the tables we are producing. We won’t go into this in detail, as to some extent we’ve covered it and the hierarchical tables (next) will fill in some of the gaps.
13
Review: linked tables Data are often broken down several ways, and the data producer wants to keep them consistent; within the ESS, country-EU breakdowns add an additional dimension. Problem is similar to disclosure by differencing – but slightly easier, as we know the tables we are producing.
14
Review: hierarchical tables
Some data categories have a natural hierarchical structure for example, industry, occupation, region What problems can this cause? Hierarchies can be addressed ‘bottom-up’ or ‘top-down’ the second is better for ESS as it maintains totals Many categorical variables, particularly business data, have a hierarchical structure For example, you might want to present information by broad industrial groups, and then by detailed NACE categories Alternatively, health data could be presented as national statistics, with detailed regional variation available within each country The problem is that you might not have enough of the lower level categories – so should you not produce the higher level ones? If you don’t take account of the structure, you create the chance of disclosure by differencing, again. And when building such tables, do you build them from the bottom up or the top down? Top-down tries to ensure that the highest-level categories (which shouldn’t have frequency or dominance problems) are completed first, so that the broad picture is accurate. This is most likely to generate tables which are comparable across different breakdowns. The difficulty is that there is the potential for an empty lower-level category to cause the hierarchy to unravel, as suppressed cells lead to differencing across hierarchy levels. Bottom-up involves taking the no-suppressed data and adding up to higher categories. Because the suppressed cells do not contribute to the total, there is no change of disclosure by differencing within the hierarchy. However, because suppressed are excluded all the way up the ‘tree’, we would expect totals to be lower than they would be if we started from the top down. For formal statistical aggregates, where consistency across different presentations is the key, better to take the top-down approach – harder but provides comparable statistics.
15
Practice session 2: Creating and checking a linked table in Excel and tau-Argus
See workbook exercises 5-6
16
Review: tables as inputs
We have so far assumed we have the complete data As part of ESS, Eurostat might receive tables with missing cells confidential cells identified with or without reasons So far, we have assumed that we are constructing tables from microdata. But what if we only have other tables to build our data from – and some of those are confidential or missing? MSs might send table indicating that certain cells are problematic – it is for the Commission to decide whether to publish or not. We will now look at formal guidance from Eurostat on how to make decisions or not. Consider Table 3 in the exercise sheet – how do we handle this?
17
Review: dealing with input tables with confidential cells
Take away all non-confidential data assume it has been published and so can be subtracted from totals Consider the rest as a set Within this set, apply relevant rules use all the information available where no information exists, assume worst case The approach is straightforward: assume anything not marked as confidential has been or will be published, and so it can be subtracted from any total; so take it out the equation completely Then we treat the confidentiality information as single set of observations go through each cell, and consider whether the confidentiality rules relevant for that cell (i.e. as stated by the MS) still apply when considered in the context of a total for all the confidential cells if all the cells pass, you can publish if not – do not publish, or publish without problematic cells
18
Practice session 3: Checking aggregate input tables
Exercise 4: checking in tau-Argus See workbook
19
Practice session 3: Checking an aggregate table in tau-Argus
See workbook exercise 7
20
Useful references Hundepool et al (2010) Brandt et al (2010)
Castro et al (2009) Tau-Argus User Manual v3.5 Formal Commission guidelines Castro J, Fischetti M, Giessing S, Hundepool A, Lowthian P, Ramaswamy R, Salazar J-J, van de Wetering A, de Wolf P-P (2009) tau-Argus User Manual v3.5. See Recommendations on the treatment of statistical confidentiality in tabulated business data in Eurostat and Recommendations on the treatment of statistical confidentiality in tabulated personal data in Eurostat, both available from Unit B-1 (Methodology)
21
Questions? CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.