The methodology used for the 2001 SARs Special Uniques Analysis Mark Elliot Anna Manning Confidentiality And Privacy Group (www.capri.man.ac.uk) University.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Page 1 Measuring Survey Quality through Representativity Indicators using Sample and Population based Information Chris Skinner, Natalie Shlomo, Barry.
Winter Education Conference Consequential Validity Using Item- and Standard-Level Residuals to Inform Instruction.
Statistical Disclosure Control Mark Elliot Confidentiality and Privacy Group CCSR University of Manchester.
Multilevel modelling short course
CAPRI CCSR Analysis of Information Loss: a Case Study From a UK Survey Mark Elliot Kingsley Purdam Confidentiality and Privacy Group (CAPRI) CCSR, University.
Data Monitoring Confidentiality and the Grid Mark Elliot Confidentiality And Privacy Group ( University of Manchester.
The Samples of Anonymised Records: Understanding Individual differences Mark Brown.
The Census Area Statistics Myles Gould Understanding area-level inequality & change.
Review bootstrap and permutation
Chapter 7: The Distribution of Sample Means
Probability and Samples: The Distribution of Sample Means
Running a model's adjoint to obtain derivatives, while more efficient and accurate than other methods, such as the finite difference method, is a computationally.
Statistics for Linguistics Students Michaelmas 2004 Week 5 Bettina Braun
Sampling: Final and Initial Sample Size Determination
3.3 Toward Statistical Inference. What is statistical inference? Statistical inference is using a fact about a sample to estimate the truth about the.
Why sample? Diversity in populations Practicality and cost.
Sampling and Randomness
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
Chapter 11 Multiple Regression.
Experimental Evaluation
Chapter 7 Selecting Samples
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Measures of Variability: Range, Variance, and Standard Deviation
8/15/2015Slide 1 The only legitimate mathematical operation that we can use with a variable that we treat as categorical is to count the number of cases.
A Measure of Disclosure Risk for Fully Synthetic Data Mark Elliot Manchester University Acknowledgements: Chris Dibben, Beata Nowak and Gillian Raab.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
PowerPoint presentation to accompany Research Design Explained 6th edition ; ©2007 Mark Mitchell & Janina Jolley Chapter 7 Introduction to Descriptive.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Understanding Statistics
Chapter Nine Copyright © 2006 McGraw-Hill/Irwin Sampling: Theory, Designs and Issues in Marketing Research.
Probabilistic and Statistical Techniques 1 Lecture 24 Eng. Ismail Zakaria El Daour 2010.
Introduction to Inferential Statistics. Introduction  Researchers most often have a population that is too large to test, so have to draw a sample from.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Sample Size Determination CHAPTER thirteen.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Health Datasets in Spatial Analyses: The General Overview Lukáš MAREK Department of Geoinformatics, Faculty.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Descriptive Statistics Prepared by: Asma Qassim Al-jawarneh Ati Sardarinejad Reem Suliman Dr. Dr. Balakrishnan Muniandy PTPM-USM.
Statistics Workshop Tutorial 5 Sampling Distribution The Central Limit Theorem.
Chapter 7 Probability and Samples: The Distribution of Sample Means.
Today - Messages Additional shared lab hours in A-269 –M, W, F 2:30-4:25 –T, Th 4:00-5:15 First priority is for PH5452. No TA or instructor Handouts –
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
Understanding Sampling
Chapter 14 Repeated Measures and Two Factor Analysis of Variance
Analyzing Statistical Inferences How to Not Know Null.
Two Main Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
Academic Research Academic Research Dr Kishor Bhanushali M
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Chapter 10 Correlation and Regression Lecture 1 Sections: 10.1 – 10.2.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.
Exploring Microsimulation Methodologies for the Estimation of Household Attributes Dimitris Ballas, Graham Clarke, and Ian Turton School of Geography University.
Sampling Theory and Some Important Sampling Distributions.
Independent Samples ANOVA. Outline of Today’s Discussion 1.Independent Samples ANOVA: A Conceptual Introduction 2.The Equal Variance Assumption 3.Cumulative.
Slide 7.1 Saunders, Lewis and Thornhill, Research Methods for Business Students, 5 th Edition, © Mark Saunders, Philip Lewis and Adrian Thornhill 2009.
Slide Slide 1 Chapter 10 Correlation and Regression 10-1 Overview 10-2 Correlation 10-3 Regression 10-4 Variation and Prediction Intervals 10-5 Multiple.
Kevin A Henry, Ph.D New Jersey Cancer Registry Cancer Epidemiology Services Frank Boscoe, Ph.D New York State Cancer Registry Estimating the accuracy of.
Data Analysis.
Selecting the Best Measure for Your Study
Sampling Distribution
Sampling Distribution
I. Statistical Tests: Why do we use them? What do they involve?
Types of Control I. Measurement Control II. Statistical Control
Simple Linear Regression
Correlation and Regression Lecture 1 Sections: 10.1 – 10.2
Presentation transcript:

The methodology used for the 2001 SARs Special Uniques Analysis Mark Elliot Anna Manning Confidentiality And Privacy Group ( University of Manchester

Overview Description of DIS Description of SUDA Description of DIS-SUDA Numerical Study

Data Intrusion Simulation(DIS) Uses microdata set itself to estimate risk at the file level Provides estimates of matching probabilities – matching probability particularly: probability of a correct match given a unique match: pr(cm|um). Special method: sub-sampling and re- sampling. General method: derivation from the partition structure of the microdata file.

The DIS Method Remove a small number of records Microdata sample

The DIS Method II Copy back a random number of the removed records (at a probability equivalent to the original sampling fraction)

The DIS Method III Match the removed fragment against the truncated microdata file

DIS Validation Numerical studies using population data: results: no bias and small error; Elliot (2000) Statistical validation; Skinner and Elliot (2002)

Levels of Risk Analysis DIS –Works at the file level –Very good for comparative analyses E.G. Small area microdata(SAM); Tranmer et. al. (2003) BUT: Record level risk is important –Variations in risk topography –Risky records

Special Uniques Original concept: Elliot, Skinner & Dale(1998) –Counterintuitive geographical effect, indicated two types of sample uniques –Random and special –Special Demographic peculiarity –Random Effect of sampling and variable definition

Special Uniques Definitions Changing definition: 1.Sample uniques which remain unique despite geographical aggregation. 2.Sample uniques which remain unique through any variable aggregation. 3.Sample uniques on small number of key variables.

Theoretical and empirical properties of special and random uniques

Special Uniques: Issues Problem: how to look at all the variables? –File may contain hundreds –Combinatorial explosion –Data storage issues (1)Storage requirements for locating minimal sample unique patterns(MSUs) (2)Storage of results for post-processing

HIPERSTAD Use of high performance computing –Enables comprehensive analysis of patterns of uniqueness within each record –Has allowed investigation of more complex grading systems

Risk Signatures Example –Unique pairs 3 –Unique Triples 2 –Unique fourfolds 0 –Unique fivefolds 1 –Unique sixfolds 0 –Unique sevenfolds 0 –………

An example of MSUs at record level Size 2Size 3Size 5 1,2(1,6,9)(2,5,6,8,11) 1,5(5,8,12) 1,8

Numerical Study Elliot et al. (2002), show strong relationship between SUDA output score (essentially a measure of the proportion of lattice that is unique) and Population Equivalence class However, SUDAs output score is ad hoc. Two SUDA output scores from different analyses do not mean the same thing.

DIS-SUDA DIS and SUDA outputs both relate to the underlying partition structure in the population. However, relating the two is tricky as SUDA is ad hoc. The method we have developed involves first running DIS to calibrate SUDA

DIS-SUDA It exploits the fact that DIS accurately estimates the mean reciprocal equivalence class. –this can be used to derive the number of population units corresponding to the sample uniques. –which can then be distributed using the SUDA score.

DIS-SUDA

DIS-SUDA Evaluation 1991 census data used Geographical area pop approximately 0.5m population. 50 parallel geographically stratified 2% samples drawn 12 key variables restricted to variables coded at 100% in 1991 DIS-SUDA run across all 50 samples Results summed across the 50 samples. Compare DIS-SUDA scores with population uniques and 1/Fj

Percentage of records population unique by DIS SUDA score (rounded up to one decimal place).

Mean reciprocal population equivalence class by DIS- SUDA score (grouped)

Conclusions Combination of DIS and SUDA give desired record level matching certainty metric Records DIS SUDA predicts are population unique are extremely likely to be so.