Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census.

Slides:



Advertisements
Similar presentations
Multistage Sampling.
Advertisements

© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Detection of Hydrological Changes – Nonparametric Approaches
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
1 ESTIMATION IN THE PRESENCE OF TAX DATA IN BUSINESS SURVEYS David Haziza, Gordon Kuromi and Joana Bérubé Université de Montréal & Statistics Canada ICESIII.
Statistical Significance and Population Controls Presented to the New Jersey SDC Annual Network Meeting June 6, 2007 Tony Tersine, U.S. Census Bureau.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Measurements and Their Uncertainty 3.1
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Copyright © 2010 Pearson Education, Inc. Slide
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
1 Discreteness and the Welfare Cost of Labour Supply Tax Distortions Keshab Bhattarai University of Hull and John Whalley Universities of Warwick and Western.
Chapter 7 Sampling and Sampling Distributions
Solve Multi-step Equations
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Chapter 4 Systems of Linear Equations; Matrices
ABC Technology Project
EU market situation for eggs and poultry Management Committee 20 October 2011.
Hash Tables.
Chapter 16 Goodness-of-Fit Tests and Contingency Tables
Chi-Square and Analysis of Variance (ANOVA)
Online Algorithm Huaping Wang Apr.21
5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.
1 Slides revised The overwhelming majority of samples of n from a population of N can stand-in for the population.
VOORBLAD.
Weighted moving average charts for detecting small shifts in process mean or trends The wonders of JMP 1.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Copyright © 2013, 2009, 2006 Pearson Education, Inc.
Measurements and Their Uncertainty 3.1
Constant, Linear and Non-Linear Constant, Linear and Non-Linear
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Graphing y = nx2 Lesson
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
25 seconds left…...
Subtraction: Adding UP
Equal or Not. Equal or Not
Arithmetic of random variables: adding constants to random variables, multiplying random variables by constants, and adding two random variables together.
Januar MDMDFSSMDMDFSSS
Statistical Inferences Based on Two Samples
© The McGraw-Hill Companies, Inc., Chapter 10 Testing the Difference between Means and Variances.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Chapter 8 Estimation Understandable Statistics Ninth Edition
©2006 Prentice Hall Business Publishing, Auditing 11/e, Arens/Beasley/Elder Audit Sampling for Tests of Controls and Substantive Tests of Transactions.
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Testing Hypotheses About Proportions
Multiple Regression and Model Building
Energy Generation in Mitochondria and Chlorplasts
January Structure of the book Section 1 (Ch 1 – 10) Basic concepts and techniques Section 2 (Ch 11 – 15): Inference for quantitative outcomes Section.
1 Functions and Applications
9. Two Functions of Two Random Variables
1 Volume measures and Rebasing of National Accounts Training Workshop on System of National Accounts for ECO Member Countries October 2012, Tehran,
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
Presentation transcript:

Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census Bureau Washington, DC

2 Talk Outline 1.Overview of EZS Noise 2.Measuring Effectiveness of Perturbative Protection 3.Noise Applied to Weighted Data 4.Noise Applied to Unweighted Data: Random vs. Balanced Noise 5.Conclusions and Future Research

3 The EZS Noise Method (Evans, Zayatz, Slanta) Developed by Tim Evans, Laura Zayatz, and John Slanta in the 1990s Multiplicative noise is added to the underlying microdata, before table creation A noise factor or multiplier is randomly generated for each record

4 The distribution of the multipliers should produce unbiased estimates, and ensure that no multipliers are too close to 1 Weights both known and unknown to users are combined with the noise factors to obtain noisy values for all records When tabulated, in general, sensitive cells are changed quite a bit and non-sensitive cells are changed only by a small amount The EZS Noise Method (Evans, Zayatz, Slanta)

5 Tables with noisy data are created in the same way as the original tables: simply: replace var X with var X-noisy Tables are automatically additive An approximate value could be released for every cell (depends on agency policy) No Complementary Suppressions Attractive Features of EZS

6 Linked tables and special tabs are automatically protected consistently EZS allows for protection at the company level (Census requirement) Ease of implementation compared to methods such as cell suppression Attractive Features of EZS

7 Measuring Effectiveness of the EZS Method Step 1: Determine which cells in a table are sensitive – e.g., using p% Sensitivity Rule Step 2: Measure level of protection to sensitive cells (using protection multipliers) Step 3: Measure amount of perturbation to non-sensitive cells (via % change graph)

8 The p% Sensitivity Rule Unweighted Data: Let T = cell total ; x1, x2 top 2 contributions Let rem denote remainder Set rem = T – (x1 + x2) Let prot denote suggested protection Set prot = (p/100) * x1 – rem if prot > 0, when Contributor 2 tries to estimate x1, rem does NOT provide enough uncertainty ; additional protection is needed; noise may provide this uncertainty

9 p% Sensitivity Rule Weighted Data: TA = Fully Weighted Cell Estimate X1 = Largest Cell Respondent Contribution X2 = 2 nd Largest Cell Contribution w kn = Known Weights w un = Unknown Weights

10 Extended p% rule w. weights & rounding rem = TA – (X1 * w kn1 + X2 * w kn2 ) prot = ( (p/100) * X1 * w kn1 ) – rem

11 Measuring the Effectiveness of a Perturbative Protection Method Protection of Sensitive Cells : Define Protection Multiplier (PM) PM = abs (perturbation) / prot Find how many (or %) have PM < 1 Data Quality: Important: % change for non-sensitive cells Less important: % over-pertubation for sensitive cells

12 EZS Noise Factors for Unweighted Data Let X = original microdata value Let Y = perturbed value Let M = noise multiplier; i.e. a draw from a specified noise distribution of EZS type Y = X * M

13 Noise Distribution used for all examples: (a=1.05, b=1.15) 5% to 15% noise

14 Noise Applied to Weighted Data Key idea: weights (e.g., sample weights) provide protection to microdata since users typically know weights only roughly (except when close to 1) Not necessary to apply full M factor to X unless w = 1

15 EZS Noise Factor for Weighted Data Weighted Data: For a simple weight w with associated uncertainty interval at least as wide as 2*b*w the noise factor S can be combined with w to form the Joint Noise-Weight Factor

16 Noise Formula for Known and Unknown Weights Calculation of Perturbed Values: w kn is the known weight w un is the unknown weight.

17 Noise for Weighted Data: Commodity Flow Survey (CFS) Measures flow of goods via transport system in U.S. Estimates volume and value of each commodity shipped: by origin, destination, modes of transport Used for transport modeling, planning,... Some users have objected to disclosure suppressions

18 Effect of Noise on High Level Aggregate Cells CFS Table: National 2-DigitCommodity Data Quality Measure: 43 cells; 0 are sensitive 41 cells change by [0 - 1] % 2 cells change by [1 - 2] %

19 CFS Test Table (Origin State by Destination State by 2 digit Commodity) 61,174 cells of which 230 are sensitive Data Quality and Protection Assessments (following slides)

20 CFS Noise Results Data Quality Assessment While some cells may receive large doses of noise, vast majority get less than 1% or 2%

21 CFS Random Noise Protection Assessment Most sensitive cells receive significant noise, i.e. 5% to 11% Only 2 out of 230 sensitive cells do not receive full protection from noise, as measured by Protection Multipliers (PM)

22 Noise for Unweighted Data Non-Employers Statistics Special Features of Microdata Unweighted adminstrative data Only 1 variable to protect: receipts Many small integers (after rounding to $1000) Special Features of Key Table Many cells have a small number of contributors; these include many safe cells Many sensitive cells with only 1 or 2 contributors

23 NE Noise Results Data Quality Assessment Lack of weights results in much more distortion to non-sensitive cells than occurs for CFS

24 NE Noise Results Protection Assessment Resembles noise factor distribution, due to prevalence of 1 respondent cells in NE test table and no weights

25 Noise Balancing Is there a way to improve data quality in this situation? Yes, if one can focus on one key table T Idea: balance noise at each cell in balancing sub-table B of T (defn: every micro value is in at most one cell of B) Choose noise directions to maximize noise cancellation for each cell of B

26 Noise Balancing Supportive NE Characteristics Balancing works especially well for NE because a high % of microdata is single unit After balancing interior cells, need to check noise effect on aggregate cells in same table Also need to check noise effect in higher and lower tables; these we call trickle up and trickle down effects For NE, there are few of these other tables; this makes balancing decision easier

27 NE – Balanced Noise Data Quality Assessment Vast improvement in data quality Resembles that of weighted data in CFS

28 NE – Balanced Noise Protection Assessment Very similar to Random Noise application 91.7% of sensitive cells fully protected

29 Random Noise vs. Balanced Noise Non Employer Test Data Data Quality is greatly improved Protection Level is not significantly reduced Thus Balanced Noise is a Good Choice Here Percent Fully Protected ( PM >= 1 ) Random92.14% Balanced91.70% PM density curves on [0,1] are nearly identical for 2 methods

30 Conclusions Conclusions: 1.EZS Noise is a useful method for protecting tables from a variety of economic programs 2.There are now several variations of the basic EZS method ; which is best for a survey depends on both microdata and table characteristics

31 Future Research 1. Should some sensitive cells be suppressed; high noise cells flagged ? 2. How to handle multiple variables ? 3. What is the most that users can be told about noise process without compromising data protection ? 4. How to handle company dynamics (births, deaths, mergers, ….) ? 5. How to coordinate survey protection ?

32