Presentation is loading. Please wait.

Presentation is loading. Please wait.

Treatment of statistical confidentiality Part 3: Generalised Output SDC Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK.

Similar presentations


Presentation on theme: "Treatment of statistical confidentiality Part 3: Generalised Output SDC Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK."— Presentation transcript:

1 Treatment of statistical confidentiality Part 3: Generalised Output SDC Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

2 Generalised output SDC:
Relevant for you if you are producing non-tabular outputs using the Eurostat Safe Centre responsible for microdata available to researchers If not in one of these groups… note the principles! tables are a subset of statistics discussed here In this section, we will show how dealing with tables fits within a general framework designed to deal with all types of output; and we will discuss how the simple ‚rules‘ introduced in the first part should be seen as a specific instances of a more general approach. We will show how tables fit into this approach, which should help to explain why we spend so much time on tables and so little on other outputs; and we will consider how we might define rules for new outputs we haven‘t discussed yet

3 Key concept: ‘safe statistics’
Generalised approach to deciding whether statistics can be released or not based on recognising different types of output Method: identify the type of output (counts, totals, correlation coefficients, odds ratios) if the output is a ‘safe’ type then release if not, release only if the specific context allows ‘Safe statistics’ is a methodological framework for approaching SDC of any outputs, irrespective of whether any ‘rules’ have been defined for that output or not. The method is based on the fact that different types of output present different confidentiality risks. Rather than treating everything as problematic, we should try to sort outputs into classes so that we can concentrate on the most problematic. The method: identify the type of output (for example, a frequency table consists of a series of counts) check whether the type is a ‘safe statistic’ if so (for example a regression coefficient), release with minimal checking if not (for example, a mean), release only if the context allows

4 ‘Safe statistics’: decision chart
Is the statistic of a ‘safe’ type? Eg is this regression or table a safe type? Yes This regression is a safe type No This table is not a safe type Is the specific output safe? Yes This particular table is safe No This particular table is not safe Can protection measures be applied? Yes No release redo / re-evaluate reject

5 How are ‘safe statistics’ defined?
‘Safe’ is defined by functional form if the mathematics cannot be undone to reveal a record, then it is ‘safe’ additional rules might be needed for exceptional cases the maths might be undone by direct analysis, or by differencing A safe statistics is one where there is no significant likelihood of a disclosure occurring because of the nature of the statistic itself, not because of the data or number of observations

6 What are ‘safe statistics’?
Suppose you consider two functions f() and g(), and two sets of data [x] and [x and a] let s1 = f(x), s2 = f(x, a), s3 = g(x), s4 = g(x, a) if, given s1 and any of the others, you can’t determine ‘a’, then f() is ‘safe’ Supposing you have a function f(x) of a set of values x. Define y as the set of x plus an additional value, a. Let g(x) be an alternative function which can take any form other than one specifically defined to attack f(x). Finally let four statistical outcomes be defined s1 = f(x), s2 = f(y), s3 = g(x), s4 = g(y) If, with access to s1, s2, s3 and s4 only (not direct access to x), it is not possible to determine a, then f(x) is ‘safe’. Note that this definition is independent of the value of x and a. If this results depends upon specific values of x, the statistic is not safe.

7 Safe statistics: examples
Unsafe statistic: simple total 𝑓 𝑥 ≡ 𝑥 𝑖 𝑓 𝑥,𝑎 ≡ 𝑥 𝑖 +𝑎 ⇒𝑓 𝑥 −𝑓 𝑥,𝑎 =𝑎 Safe statistic: regression coefficient 𝑓 𝑥 = 𝛽 ≡ 𝑥 𝑖 2 −1 𝑥 𝑖 𝑦 𝑖 You can see that a total is unsafe as it can be unpicked by differencing. The regression coefficient cannot be differenced as the additional value will be included in the inverted square.

8 Safe statistics: what about exceptions?
Some statistics are ‘safe, with qualifiers’ for example, regression coefficients are safe except in the case of repeated regression with one additional observation and all categorical variables for all analytical outputs, must be more degrees of freedom than results presented for odds ratios, need at least three observations qualifiers must be few, rare, specific and relate to the form of the data, not any specific data type ‘Safe’ statistics are generally not safe in every conceivable circumstance – nothing could be – so there must be some qualifiers. But to count a ‘safe’ statistic, any qualifiers must be few – if there are many exceptions, treat it as unsafe rare – they must be unlikely outcomes specific – they must be easily checkable related to the form – if it’s an exception that only relate to Census and health data but not to business data, then it is clearly sensitive to the context and so can’t be safe

9 Safe statistics: some general rules
All linear combinations are ‘unsafe’ means, sums, counts Ranking marks are `unsafe’ maxima, minima, percentiles, medians Non-linear combinations are generally safe Combinations of safe outputs are generally safe an odds ratio is ‘safe’, so a table showing mean odds ratios is ‘safe’

10 Safe statistics relating to the SDC literature
Almost all SDC literature concentrates on tabular output why? Tables are usually linear combinations unsafe Therefore, we have publication of a large amount of problematic output sensitive to the context hence, SDC literature concentrates on tables

11 Safe and unsafe statistics Relevance for ESS
Most outputs from government departments are tables – surely all ‘unsafe’? Recall: safe/unsafe combinations are safe not all tables are equally risky not all tables demand the same scrutiny you can be selective in your confidentiality checks – focus on the problematic tables

12 Safe statistics: practice guidelines
Expert guidelines: Brandt et al, (2010 rev 2015) Guidelines for the checking of output based on microdata research Not all statistics are defined default categorisation is ‘unsafe’ community of support in NSIs Eurostat published expert guidelines in 2010, as an addendum to the ESSNet Handbook of Best Practice. A revised version was published in 2015, currently available at However, not all stats are defined, and there have been more recent changes. If in doubt, treat things as unsafe until proved otherwise There is a community of expertise in MSs, although nowadays it tends to reside in academia rather than NSIs.

13 Other material Background papers on ‘safe statistics’:
Ritchie F. (2008) “Disclosure detection in research environments in practice” Ritchie F. (2014) “Operationalising safe statistics: the case of linear regression” Ritchie F. (2008) “Disclosure detection in research environments in practice”, in Work session on statistical data confidentiality 2007; Eurostat; pp Ritchie F. (2014) “Operationalising safe statistics: the case of linear regression”, Working papers in Economics no. 1410, University of the West of England, Bristol, September

14 Questions? CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION


Download ppt "Treatment of statistical confidentiality Part 3: Generalised Output SDC Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK."

Similar presentations


Ads by Google