G. Merola Winton Capital Management 1 UN/ECE Work Session On Statistical Data Confidentiality (Geneva, 9-11 November 2005) WP30: Safety rules in statistical.

Slides:



Advertisements
Similar presentations
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Advertisements

The Simple Linear Regression Model Specification and Estimation Hill et al Chs 3 and 4.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
CS4432: Database Systems II
Eurostat Statistical Disclosure Control. Presented by Peter-Paul de Wolf, Statistics Netherlands (CBS)
1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics.
Chapter 8 Interval Estimation Population Mean:  Known Population Mean:  Known Population Mean:  Unknown Population Mean:  Unknown n Determining the.
Infinite Sequences and Series
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Section 7-3 Estimating a Population Mean:  Known Created by.
Chapter Goals After completing this chapter, you should be able to:
Overview October 13- The Maximum Principle- an Introduction Next talks: - Драган Бeжановић (Dragan) - Gert-Jan Pieters - Kamyar Malakpoor.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
7-2 Estimating a Population Proportion
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
PSY 307 – Statistics for the Behavioral Sciences
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (28-30 October 2009) Accuracy evaluation of Nuts level 2 hypercubes with the adoption of.
Econ 3790: Business and Economics Statistics Instructor: Yogesh Uppal
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
16-1 Copyright  2010 McGraw-Hill Australia Pty Ltd PowerPoint slides to accompany Croucher, Introductory Mathematics and Statistics, 5e Chapter 16 The.
Chapter 7 Estimates and Sample Sizes
Estimating a Population Variance
Section 9.2 Testing the Mean  9.2 / 1. Testing the Mean  When  is Known Let x be the appropriate random variable. Obtain a simple random sample (of.
PARAMETRIC STATISTICAL INFERENCE
1 G Lect 10a G Lecture 10a Revisited Example: Okazaki’s inferences from a survey Inferences on correlation Correlation: Power and effect.
1 1 Slide Simple Linear Regression Coefficient of Determination Chapter 14 BA 303 – Spring 2011.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Copyright 2011 by W. H. Freeman and Company. All rights reserved.1 Introductory Statistics: A Problem-Solving Approach by Stephen Kokoska Chapter 8: Confidence.
Lesson Comparing Two Proportions. Knowledge Objectives Identify the mean and standard deviation of the sampling distribution of p-hat 1 – p-hat.
Copyright © 2004 Pearson Education, Inc.
Some ACS Data Issues and Statistical Significance (MOEs) Table Release Rules Statistical Filtering & Collapsing Disclosure Review Board Statistical Significance.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
1 Using Fixed Intervals to Protect Sensitive Cells Instead of Cell Suppression By Steve Cohen and Bogong Li U.S. Bureau of Labor Statistics UNECE/Work.
Statistical Methodology for the Automatic Confidentialisation of Remote Servers at the ABS Session 1 UNECE Work Session on Statistical Data Confidentiality.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 8 Interval Estimation Population Mean:  Known Population Mean:  Known Population.
Ex St 801 Statistical Methods Inference about a Single Population Mean.
The Application for Statistical Processing at SURS Andreja Smukavec, SURS Rudi Seljak, SURS UNECE Statistical Data Confidentiality Work Session Helsinki,
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
Econ 3790: Business and Economic Statistics Instructor: Yogesh Uppal
Estimating a Population Mean. Student’s t-Distribution.
Confidence Intervals for a Population Mean, Standard Deviation Unknown.
Chapter 7 Estimates and Sample Sizes 7-1 Overview 7-2 Estimating a Population Proportion 7-3 Estimating a Population Mean: σ Known 7-4 Estimating a Population.
Estimating a Population Variance
1 1 Slide © 2011 Cengage Learning Assumptions About the Error Term  1. The error  is a random variable with mean of zero. 2. The variance of , denoted.
6.3 One- and Two- Sample Inferences for Means. If σ is unknown Estimate σ by sample standard deviation s The estimated standard error of the mean will.
Two-Sample-Means-1 Two Independent Populations (Chapter 6) Develop a confidence interval for the difference in means between two independent normal populations.
Section 7-5 Estimating a Population Variance. MAIN OBJECTIIVES 1.Given sample values, estimate the population standard deviation σ or the population variance.
Combinations of SDC methods for continuous microdata Anna Oganian National Institute of Statistical Sciences.
Data disclosure control Nordic Forum for Geography and Statistics Stockholm, 10 th September 2015.
© 2010 Pearson Prentice Hall. All rights reserved Chapter Hypothesis Tests Regarding a Parameter 10.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
ESTP course, SBS module 13 March 2013 Structural Business Statistics Data reporting to Eurostat, transmission format and tools.
Lecture 1.31 Criteria for optimal reception of radio signals.
Copyright © Cengage Learning. All rights reserved.
Sampling Distributions and Estimation
Hypothesis Testing Review
Inferences Regarding Population Variances
Sections 6-4 & 7-5 Estimation and Inferences Variation
M A R I O F. T R I O L A Estimating Population Proportions Section 6-5
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture Slides Elementary Statistics Twelfth Edition
Estimating a Population Variance
Table 2. Regression statistics for independent and dependent variables
Anco Hundepool Sarah Giessing
Presentation transcript:

G. Merola Winton Capital Management 1 UN/ECE Work Session On Statistical Data Confidentiality (Geneva, 9-11 November 2005) WP30: Safety rules in statistical disclosure control for tabular data Giovanni Merola Winton Capital Management Ltd Partially written while at ISTAT and partially supported by EU project CASC.

G. Merola Winton Capital Management 2 Plan of the Talk 1. SDC for Magnitude tables; 2. Existing safety rules; 3. Generalised p-rule; 4. Rational estimates; 5. Prior distribution; 6. U-estimates; 7. Comparison on real SBS data; 8. MU-rules; 9. Concluding remarks.

G. Merola Winton Capital Management 3 1. SDC for Magnitude Tables Total T is published n is number contrib.n Contributions in non-increasing order Income £K YoungOldAll Ages Male Female All Sexes z1≥z1≥ z 2 ≥ z 3 ≥ z 4 ≥ ··· ≥ z n Total 600 (Old Males) Tables showing the sums of non-negative contributions in each cell. Example:

G. Merola Winton Capital Management 4 1. SDC for Magnitude Tables cont.d SDC policy: 1. If the categories are confidential, (likely) identification of respondents is disclosure; 2. else only the contributions of (likely) identifiable respondents cannot be disclosed (too precisely); 3. same rule for all cells, else microdata protection.

G. Merola Winton Capital Management 5 2. Existing Safety Rules Rare respondents are identifiable – threshold rule: n > m. Respondents with large contrib. are identifiable – Dominance: (z 1 +···+z m )/T  k. Largest contributor is identifiable, hence second largest must not estimate z 1 closely – p-rule: [(T-z 2 ) -z 1 ]/z 1 > p.

G. Merola Winton Capital Management 6 3. Generalised p-rule Group with largest sum identifiable; group with second largest sum must not estimate largest sum too closely; z1z1 z2z2 z3z3 z4z4 ··· znzn Total is T Includes the existence of groups of respondents t2t2 R 2,2

G. Merola Winton Capital Management 7 3. Generalised p-rule cont.d Gen. p-rule ((T-R m,l ) -t m )/t m > p Same estimate as p-rule: maximum possible value ^t m =T-R m,l t 1 =z 1 and R 1,1 =z 2 p-rule

G. Merola Winton Capital Management 8 3. Generalised p-rule cont.d If zero contributions are known (external intruder): Dominance rule with k=1/(1+p) If no groups: simple p-rule; If intruding group formed of (m-1) respondents: threshold rule n>m protects against exact estimation (p=0). Merola, G. M., 2003a. Generalized risk measures for tabular data. Proceedings of the 54th Session of the International Statistical Institute.

G. Merola Winton Capital Management 9 4. Rational Estimates An intruder can compute a lower and an upper bound for the value of t m : For example, if z 2 =40 and T=100: 40=z 2  z 1  T- z 2 =60; the bounds are different for different prior knowledge of the intruder.

G. Merola Winton Capital Management Rational Estimates cont.d for a well known property MSE is minimised by the mean t m can be estimated by minimising the Mean Square Error for some distribution F(t m ) :

G. Merola Winton Capital Management Prior Distribution: Uniform The ignorance about the distribution of t m can be modelled with a Uniform distribution: in this case the mean is simply: Note: same estimate for any symmetric F. t m ~U(t m -, t m + )

G. Merola Winton Capital Management Prior Distribution: maximising We refer to the Gen p-rule as M-rule, and to the that derived using the Uniform as U- rule. The Generalised p-rule can be derived by assuming a prior concentrated on the maximum value

G. Merola Winton Capital Management U-estimates knows T but not n: knows T and n, knows T and L contributions, knows T, L contributions and n, either as above or * for m=L=1 uniform p-rule is same as uniform dominance (Dominance); (Gen. p-rule*) Different prior knowledge of the intruder Merola, G., 2003b. Safety rules in statistical disclosure control for tabular data. Contributi Istat 1, istituto Nazionale di Statistica, Roma.

G. Merola Winton Capital Management U-estimates cont.d C=(970,376,274,253,203,169,161,121,86,62,21,10), T=2706 Rule Estimated z 1 RelErr Dom (t 2 /T=0.5) p-rule U-Dom U (1:n) U(1;1) Example

G. Merola Winton Capital Management Comparison on real SBS data We applied different rules to Italian SBS data, turnover by Region and SIC for the years ’94 and ‘97. We considered the SIC with 2 and 3 digits.

G. Merola Winton Capital Management Comparison on real SBS data cont.d Mean relative error for z 1

G. Merola Winton Capital Management Comparison on real SBS data cont.d Mean relative error for t 2

G. Merola Winton Capital Management U-rules The values for are intervals : Knowing only T (Dominance) Knowing T and L contributions (gen p-rule)

G. Merola Winton Capital Management MU-rules assuming both estimating approaches we obtain subadditive rules, analogous to p-rule but with stricter bounds

G. Merola Winton Capital Management MU-rules cont.d Safety rule when only T known (Dominance) Safety rule when T and L contributions known (gen p-rule)

G. Merola Winton Capital Management Conclusions The assumptions for the existing rules are unrealistic; using a simple noninformative distribution much smaller relative error of estimation; the corresponding rules are not subadditive; joining assumptions leads to stricter rules; identifiability of all largest respondents requires these rules; different prior can be used.