Treatment of statistical confidentiality Part 3: Generalised Output SDC Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK.

Slides:



Advertisements
Similar presentations
Introduction Describe what panel data is and the reasons for using it in this format Assess the importance of fixed and random effects Examine the Hausman.
Advertisements

Inferential Statistics and t - tests
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
A2 coursework What do I have to do? What is required? You have to carry out a piece of research that is related to the specification You have to carry.
Specifying an Econometric Equation and Specification Error
Chi-square Basics. The Chi-square distribution Positively skewed but becomes symmetrical with increasing degrees of freedom Mean = k where k = degrees.
Operationalising ‘safe statistics’ the case of linear regression Felix Ritchie Bristol Business School, University of the West of England, Bristol.
Lec 6, Ch.5, pp90-105: Statistics (Objectives) Understand basic principles of statistics through reading these pages, especially… Know well about the normal.
Chapter 11 Multiple Regression.
Developing a Statistical Disclosure Standard for Europe Tanvi Desai LSE Research Laboratory Data Manager Research Laboratory IASSIST 2010: Cornell.
8/15/2015Slide 1 The only legitimate mathematical operation that we can use with a variable that we treat as categorical is to count the number of cases.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
Statistics for the Social Sciences Psychology 340 Fall 2013 Tuesday, November 19 Chi-Squared Test of Independence.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Correlation Scatter Plots Correlation Coefficients Significance Test.
F TEST OF GOODNESS OF FIT FOR THE WHOLE EQUATION 1 This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates.
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Disclosure detection & control in research environments Felix Ritchie.
Copyright © 2009 Pearson Education, Inc LEARNING GOAL Interpret and carry out hypothesis tests for independence of variables with data organized.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 26.
Section 9-1: Inference for Slope and Correlation Section 9-3: Confidence and Prediction Intervals Visit the Maths Study Centre.
Hypothesis Testing. Why do we need it? – simply, we are looking for something – a statistical measure - that will allow us to conclude there is truly.
MBF1413 | Quantitative Methods Prepared by Dr Khairul Anuar 8: Time Series Analysis & Forecasting – Part 1
Independent Samples ANOVA. Outline of Today’s Discussion 1.Independent Samples ANOVA: A Conceptual Introduction 2.The Equal Variance Assumption 3.Cumulative.
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8… Where we are going… Significance Tests!! –Ch 9 Tests about a population proportion –Ch 9Tests.
Copyright © 2009 Pearson Education, Inc LEARNING GOAL Interpret and carry out hypothesis tests for independence of variables with data organized.
Matrix form of Linear Regression The F distribution ANOVA approach to Linear Regression ANOVA approach to t-test (One way ANOVA with two levels)
Stats Methods at IC Lecture 3: Regression.
Module 10 Hypothesis Tests for One Population Mean
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
Multiple Regression.
Nonparametric Statistics
Spontaneous recognition: Risk or distraction
Physics 114: Lecture 13 Probability Tests & Linear Fitting
BUS 308 mentor innovative education/bus308mentor.com
6. Simple Regression and OLS Estimation
Chi-square Basics.
Presentation 12 Chi-Square test.
26134 Business Statistics Week 5 Tutorial
Confidentiality in Published Statistical Tables
Chapter 12 Tests with Qualitative Data
Pure Serial Correlation
Inferential Statistics
Nonparametric Statistics
Research Methods: Unit 4: Past Paper
Comparing Several Means: ANOVA
Data Presentation Carey Williamson Department of Computer Science
Multiple Regression.
Treatment of statistical confidentiality Table protection using Excel and tau-Argus Practical course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER.
Treatment of statistical confidentiality Table protection using Excel and tau-Argus Practical course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER.
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
MBF1413 | Quantitative Methods Prepared by Dr Khairul Anuar
Treatment of statistical confidentiality Part 5 Summary & reflection: rules versus principles Introductory course Trainer: Felix Ritchie CONTRACTOR IS.
Treatment of statistical confidentiality Part 1: Principles Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT.
Data from statistical modeling (e. g
The Math Studies Project for Internal Assessment
One-Way Analysis of Variance
Carey Williamson Department of Computer Science University of Calgary
Inferential testing.
Implementation of Learning Systems
Treatment of statistical confidentiality Part 5: Rules versus principles Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK.
Dealing with confidential data Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION.
Treatment of statistical confidentiality Part 1: Principles Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT.
Treatment of statistical confidentiality Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE.
MGS 3100 Business Analysis Regression Feb 18, 2016
Chapter 13 Excel Extension: Now You Try!
F test for Lack of Fit The lack of fit test..
Presentation transcript:

Treatment of statistical confidentiality Part 3: Generalised Output SDC Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Generalised output SDC: Relevant for you if you are producing non-tabular outputs using the Eurostat Safe Centre responsible for microdata available to researchers If not in one of these groups… note the principles! tables are a subset of statistics discussed here In this section, we will show how dealing with tables fits within a general framework designed to deal with all types of output; and we will discuss how the simple ‚rules‘ introduced in the first part should be seen as a specific instances of a more general approach. We will show how tables fit into this approach, which should help to explain why we spend so much time on tables and so little on other outputs; and we will consider how we might define rules for new outputs we haven‘t discussed yet

Key concept: ‘safe statistics’ Generalised approach to deciding whether statistics can be released or not based on recognising different types of output Method: identify the type of output (counts, totals, correlation coefficients, odds ratios) if the output is a ‘safe’ type then release if not, release only if the specific context allows ‘Safe statistics’ is a methodological framework for approaching SDC of any outputs, irrespective of whether any ‘rules’ have been defined for that output or not. The method is based on the fact that different types of output present different confidentiality risks. Rather than treating everything as problematic, we should try to sort outputs into classes so that we can concentrate on the most problematic. The method: identify the type of output (for example, a frequency table consists of a series of counts) check whether the type is a ‘safe statistic’ if so (for example a regression coefficient), release with minimal checking if not (for example, a mean), release only if the context allows

‘Safe statistics’: decision chart Is the statistic of a ‘safe’ type? Eg is this regression or table a safe type? Yes This regression is a safe type No This table is not a safe type Is the specific output safe? Yes This particular table is safe No This particular table is not safe Can protection measures be applied? Yes No release redo / re-evaluate reject

How are ‘safe statistics’ defined? ‘Safe’ is defined by functional form if the mathematics cannot be undone to reveal a record, then it is ‘safe’ additional rules might be needed for exceptional cases the maths might be undone by direct analysis, or by differencing A safe statistics is one where there is no significant likelihood of a disclosure occurring because of the nature of the statistic itself, not because of the data or number of observations

What are ‘safe statistics’? Suppose you consider two functions f() and g(), and two sets of data [x] and [x and a] let s1 = f(x), s2 = f(x, a), s3 = g(x), s4 = g(x, a) if, given s1 and any of the others, you can’t determine ‘a’, then f() is ‘safe’ Supposing you have a function f(x) of a set of values x. Define y as the set of x plus an additional value, a. Let g(x) be an alternative function which can take any form other than one specifically defined to attack f(x). Finally let four statistical outcomes be defined s1 = f(x), s2 = f(y), s3 = g(x), s4 = g(y) If, with access to s1, s2, s3 and s4 only (not direct access to x), it is not possible to determine a, then f(x) is ‘safe’. Note that this definition is independent of the value of x and a. If this results depends upon specific values of x, the statistic is not safe.

Safe statistics: examples Unsafe statistic: simple total 𝑓 𝑥 ≡ 𝑥 𝑖 𝑓 𝑥,𝑎 ≡ 𝑥 𝑖 +𝑎 ⇒𝑓 𝑥 −𝑓 𝑥,𝑎 =𝑎 Safe statistic: regression coefficient 𝑓 𝑥 = 𝛽 ≡ 𝑥 𝑖 2 −1 𝑥 𝑖 𝑦 𝑖 You can see that a total is unsafe as it can be unpicked by differencing. The regression coefficient cannot be differenced as the additional value will be included in the inverted square.

Safe statistics: what about exceptions? Some statistics are ‘safe, with qualifiers’ for example, regression coefficients are safe except in the case of repeated regression with one additional observation and all categorical variables for all analytical outputs, must be more degrees of freedom than results presented for odds ratios, need at least three observations qualifiers must be few, rare, specific and relate to the form of the data, not any specific data type ‘Safe’ statistics are generally not safe in every conceivable circumstance – nothing could be – so there must be some qualifiers. But to count a ‘safe’ statistic, any qualifiers must be few – if there are many exceptions, treat it as unsafe rare – they must be unlikely outcomes specific – they must be easily checkable related to the form – if it’s an exception that only relate to Census and health data but not to business data, then it is clearly sensitive to the context and so can’t be safe

Safe statistics: some general rules All linear combinations are ‘unsafe’ means, sums, counts Ranking marks are `unsafe’ maxima, minima, percentiles, medians Non-linear combinations are generally safe Combinations of safe outputs are generally safe an odds ratio is ‘safe’, so a table showing mean odds ratios is ‘safe’

Safe statistics relating to the SDC literature Almost all SDC literature concentrates on tabular output why? Tables are usually linear combinations unsafe Therefore, we have publication of a large amount of problematic output sensitive to the context hence, SDC literature concentrates on tables

Safe and unsafe statistics Relevance for ESS Most outputs from government departments are tables – surely all ‘unsafe’? Recall: safe/unsafe combinations are safe not all tables are equally risky not all tables demand the same scrutiny you can be selective in your confidentiality checks – focus on the problematic tables

Safe statistics: practice guidelines Expert guidelines: Brandt et al, (2010 rev 2015) Guidelines for the checking of output based on microdata research Not all statistics are defined default categorisation is ‘unsafe’ community of support in NSIs Eurostat published expert guidelines in 2010, as an addendum to the ESSNet Handbook of Best Practice. A revised version was published in 2015, currently available at http://www.dwbproject.org/export/sites/default/access/doc/dwb_standalone-document_output-checking-guidelines.pdf However, not all stats are defined, and there have been more recent changes. If in doubt, treat things as unsafe until proved otherwise There is a community of expertise in MSs, although nowadays it tends to reside in academia rather than NSIs.

Other material Background papers on ‘safe statistics’: Ritchie F. (2008) “Disclosure detection in research environments in practice” Ritchie F. (2014) “Operationalising safe statistics: the case of linear regression” Ritchie F. (2008) “Disclosure detection in research environments in practice”, in Work session on statistical data confidentiality 2007; Eurostat; pp399-406 http://epp.eurostat.ec.europa.eu/portal/page/portal/conferences/documents/unece_es_work_session_statistical_data_conf/TOPIC%203-WP.37%20SP%20RITCHIE.PDF Ritchie F. (2014) “Operationalising safe statistics: the case of linear regression”, Working papers in Economics no. 1410, University of the West of England, Bristol, September http://www1.uwe.ac.uk/bl/research/bristoleconomics/research/economicspapers2014.aspx

Questions? CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION