Download presentation
Presentation is loading. Please wait.
Published byMoris Ross Modified over 6 years ago
1
Perspective on User Needs for Government Data Where Do We Go From Here?
Natalie Shlomo Social Statistics, School of Social Sciences University of Manchester 1
2
Traditional forms of statistical outputs
Topics Covered Traditional forms of statistical outputs Disclosure risk and data utility Differential privacy/Inferential disclosure Future dissemination strategies Table generating servers Synthetic data Remote access Remote analysis Challenges and limitations 2
3
Traditional Statistical Outputs
Survey Microdata Social survey data generally released via data archives for registered users Business surveys have large sample fractions, eg. take-all strata, and highly skewed distributions and are generally not released Tabular Data Frequency Tables Census (whole population) counts with careful design of output variables Weighted sample counts Magnitude Tables Mainly for business statistics 3
4
Types of Disclosure Risk
Identity disclosure Identification is widely referred to in confidentiality pledges and code of practice, e.g. “…no statistics will be produced that are likely to identify an individual unless specifically agreed with them” (principle 5 of NS Code of Practice) Examples: Survey microdata – identify respondent through rare categories (population unique) and/or response knowledge Census tables – a small cell (1 or 2) 4
5
Types of Disclosure Risks
Individual attribute disclosure Confidential information about a data subject is revealed and can be attributed to the subject Identity disclosure a necessary condition for individual attribute disclosure Examples: Survey microdata - individual identified and survey target variables learnt, eg. health, income Census table - unique cell on the margin, i.e. structural zeros on the rows/columns 5
6
Types of Disclosure Risks
Group attribute disclosure Confidential information is learnt about a group and may cause harm, i.e. all adults in a village collect unemployment Examples: Survey microdata – difficult to find group attribute disclosure under survey conditions Census tables – caused by structural zeros, i.e. row/column consists of all zeros except one cell 6
7
Types of Disclosure Risks
Inferential Disclosure Confidential information may be revealed exactly or to a close approximation Examples: Survey microdata – a good prediction model with high Census tables – disclosure by differencing This type of disclosure has been largely ignored! 7
8
Survey Microdata from Social Surveys
Standard SDC Methods Survey Microdata from Social Surveys Identity disclosure main concern since it can lead to attribute disclosure Disclosure control methods generally non-perturbative: Deleting highly identifying variables (eg. geography) Recoding identifying variables (eg. age, ethnicity) Magnitude Tables Attribute disclosure (since identities are likely known) and concern is for dominance in a cell Disclosure control methods: Table design Cell suppression 8
9
Standard SDC Methods Census Tables
Identity disclosure, attribute disclosure and disclosure by differencing Disclosure control methods: Careful design of tables and threshold criteria Fixed variables spanning tables to avoid differencing In some countries, long form is a sub-sample Pre-tabular methods eg. record swapping Post-tabular methods eg. forms of rounding 9
10
Inferential Disclosure (Differential Privacy)
Differential privacy based on disclosure of a target unit where the intruder has knowledge of the entire database except for the target unit itself No distinction between key variables and sensitive variables, types of disclosure risks, or whether data arises from a sample or population Differential privacy similar to the notion of disclosure by differencing since in this case even a sum of counts or averages are disclosive
11
Inferential Disclosure (Differential Privacy)
Definition of Differential Privacy with respect to statistical databases (Dwork, et al. 2006, Shlomo and Skinner 2012) Assume a population database from which a sample is drawn Assume the agency releases a set of counts: where Assume the intruder knows the population database except for one target unit Let denote the probability of f with respect to an SDC mechanism where XU is treated as fixed
12
Inferential Disclosure (Differential Privacy)
Then differential privacy holds iff for and maximum taken over all possible pairs which differ by only one unit and across all possible vectors of f Guarantee of differential privacy by adding noise to all outputs Amount of noise depends on the number of units in the query but independent of the data
13
Inferential Disclosure (Differential Privacy)
Does sampling and the release of microdata guarantee differential privacy (Shlomo and Skinner, 2012)? No! Let fk be a sample count It is assumed that an intruder knows everything in the population table except for one unit If Fk=fk and we move one of the counts of Fk to another cell than we may obtain Fk<fk which is impossible Sampling is not differentially private How likely is it to get Fk=fk in a sample? Usually 2-3% Agencies will generally decide to allow the ‘slippage’ and issue the controlled release of microdata
14
Inferential Disclosure (Differential Privacy)
Does perturbation guarantee differential privacy? Assume a perturbation mechanism: Then the ratio in the definition will contain the elements: If the perturbation mechanism does not have a zero probability, then perturbation schemes are differentially private
15
Inferential Disclosure (Differential Privacy)
Examples of perturbation mechanisms: Recoding: Random data swapping: PRAM: In practice we control perturbation and add zeros to ensure edits
16
‘Safe Data’ vs ‘Safe Access’
In the last decade agencies are increasingly concerned about breaches of confidentiality, particularly with large number of open databases that can be used to attack statistical data Agencies are restricting access to data with more stringent licensing and the use of on-site data labs How can we make statistical data more available to users? Why aren’t agencies making more use of ‘modern’ dissemination strategies? 16
17
Future Dissemination Strategies
Census Tables On-line flexible table generation based on web package Input data are frequency counts in a multi-dimensional hypercube with small geographical areas Disclosure risk measures and SDC methods applied ‘on-the-fly’ Set of rules embedded in the package, eg. population thresholds, proportion of small cells, etc. To avoid disclosure by differencing, must add noise 17
18
Example: Simulation Hypercube
Shlomo, Antal and Elliot, 2015 Population N=1,500,000 NUTS2 Region - two regions Gender – 2 categories Banded age groups – 21 categories Current Activity Status – 5 categories Occupation – 13 categories Educational attainment – 9 categories Country of citizenship – 5 categories 18
19
Flexible Table Generating Servers
Based on restrictions of the server, define a 3- dimensional table with one variable to define the population: banded age group, education group and occupation group defined for NUTS2=1 Table has 2,457 cells, 854,539 individuals, average cell size of 347.8 Cell Value Number of Cells Percentage of Cells 1534 62.43% 1 44 1.79% 2 35 1.42% 3 27 1.10% 4 20 0.81% 5 and over 797 32.44% Total 2457 100.00% 19
20
Information Based Disclosure Risk and Data Utility Measures
To assess attribute disclosure in tables mainly caused by structural zeros, use the entropy where vector of frequency counts and Entropy bounded by 0 if all cells are zero except one cell, and log(K) if all cell values are equal, i.e. cell proportions are 1/K Risk measure: Combine with other measures (proportion of zeros and size of the population)and define weighted average: 20
21
Information Based Disclosure Risk and Data Utility Measures
Risk measure extended to account for perturbation and sampling based on conditional entropy Utility measure: Hellenger’s Distance where original counts perturbed counts Hellenger’s Distance bounded by 0 and and can be used to compare SDC methods: 21
22
Results Disclosure Risk in (3) Data Utility in (4) Table 1 Original 0.318 - Perturbed Input Record Swapping: 0.282 0.988 Semi-controlled Random Rounding 0.137 0.991 Stochastic Perturbation 0.239 0.995 Perturbed Output: Semi-Controlled Random Rounding 0.135 0.993 Comparing the rounding before and after shows that SDC ‘on the fly’ has lower disclosure risk and the highest utility out of all the methods 22
23
Future Dissemination Strategies
Synthetic Datasets Partially-synthetic micro data Preserves the record structure of the gold standard micro data Replaces data elements with synthetic values sampled from an appropriate probability model Future work to assess disclosure risk Fully-synthetic micro data Preserves some of the gold standard micro data Generates synthetic entities and data elements from appropriate probability models In practice, very difficult to capture all conditional relationships between variables and within sub- populations CTA (controlled tabular adjustment) where suppressed cells take imputed values 23
24
Future Dissemination Stategies
Data Enclaves A secure IT environment where researchers can access confidential data on-site, eg. Virtual Microdata Lab (VML) at the ONS Researchers apply to carry out a project and sign a contract and confidentiality agreement Minimise risk of disclosure: No removal of data, no printers, not linked to internet All outputs checked manually by staff Training course for understanding security rules Research needed on what is a disclosive output 24
25
Future Dissemination Stategies
Remote Access Access to data through remote connection to secure server, typically at Universities and Research Institutes Carry out analysis as if on personal PC and view results on screen Outputs dropped in a mail box to be manually checked and ed back to researchers 25
26
Future Dissemination Strategies
Remote Analysis Some agencies (eg. Census Bureau, ABS) developing platforms for remote analysis or allowing researchers to submit code to be run on-site Aim to protect outputs without the need for intervention Example (O’keefe and Shlomo, 2012): Comparison of confidential input versus confidential outputs 338 Sugar Canes Farm Data from a 1982 survey of sugar cane industry in Queensland, Australia: Region (4 categories) and 5 continuous variables: Area, Harvest, Receipts, Costs, Profits (=Receipts-Costs) Confidentialized input by additive noise and removing outliers 26
27
Future Dissemination Strategies
Remote Analysis Receipts: Original Input Output
28
Future Dissemination Strategies
Remote Analysis Receipts: Original Input Output
29
Example
30
Future Dissemination Strategies
Residuals: Original Input Output
31
Challenges and Discussion
In recent years, managing disclosure risk is about restricting access to data More government initiatives for ‘open data’ Agencies need to use modern dissemination strategies to accommodate increasing demands for ‘open data’ Need stricter and tighter definitions of disclosure risk but users will have to work with perturbative SDC methods Agencies should release the methods and parameters of the perturbation so researchers can cope with measurement error For ‘on the fly’ SDC methods, agencies should release utility measures based on the original file/tables
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.