“Software for Tabular Data Protection”

Slides:



Advertisements
Similar presentations
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Advertisements

BTS Confidentiality Seminar Series June 11, 2003 FCSM/CDAC Disclosure Limiting Auditing Software: DAS Mark A. Schipper Ruey-Pyng Lu Energy Information.
Dragan Jovicic Harvinder Singh
Linear Inequalities and Linear Programming Chapter 5 Dr.Hayk Melikyan/ Department of Mathematics and CS/ Linear Programming in two dimensions:
In a Virtual Data Centre Protecting Confidentiality COMPUTATIONAL INFORMATICS Christine O’Keefe, Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO.
1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics.
QM B Linear Programming
Basic Mathematics for Portfolio Management. Statistics Variables x, y, z Constants a, b Observations {x n, y n |n=1,…N} Mean.
Today Wrap up of probability Vectors, Matrices. Calculus
Chapter : Software Process
1. “Software for Tabular Data Protection” Joe Fred Gonzalez, Jr. Lawrence H. Cox National Center for Health Statistics NCHS Data Users Conference July.
Overview of 2002 CIPSEA: Methods to Protect Confidential Tabular Data Amrut Champaneri, Ph.D. U.S. Department of Transportation Bureau of Transportation.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Confidentiality Issues with “Small Cell” Data Michael C. Samuel, DrPH STD Control Branch California Department of Public Health 2008 National STD Prevention.
GRADE 7 CURRICULUM SUMMARY. NUMBER AND OPERATION SENSE use models to express repeated multiplication using exponents express numbers in expanded form.
Introduction to Database Systems
2 2.1 © 2016 Pearson Education, Inc. Matrix Algebra MATRIX OPERATIONS.
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
8 th Grade Math Common Core Standards. The Number System 8.NS Know that there are numbers that are not rational, and approximate them by rational numbers.
Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.
Discussion of “ Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis” Nancy J. Kirkendall Energy Information Administration.
Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 5 Systems and Matrices Copyright © 2013, 2009, 2005 Pearson Education, Inc.
259 Lecture 13 Spring 2013 Advanced Excel Topics – Arrays.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
1 Chapter 11 A number of important scheduling problems... require the study of an astronomical number of arrangements to determine which one is best....
Innovations in Data Dissemination Thomas L. Mesenbourg, Jr. Acting Director U.S. Census Bureau United Nations Seminar on Innovations in Official Statistics.
Information Processing Notes for beginning our Excel Unit.
9-1 Chapter 9 Project Scheduling Chapter 9 Project Scheduling McGraw-Hill/Irwin Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved.
Chapter 1. Formulations 1. Integer Programming  Mixed Integer Optimization Problem (or (Linear) Mixed Integer Program, MIP) min c’x + d’y Ax +
If A and B are both m × n matrices then the sum of A and B, denoted A + B, is a matrix obtained by adding corresponding elements of A and B. add these.
Mechanical Engineering Department 1 سورة النحل (78)
Meeting 18 Matrix Operations. Matrix If A is an m x n matrix - that is, a matrix with m rows and n columns – then the scalar entry in the i th row and.
Chapter 2 Linear Programming Models: Graphical and Computer Methods
Matrices Section 2.6. Section Summary Definition of a Matrix Matrix Arithmetic Transposes and Powers of Arithmetic Zero-One matrices.
Operations Research The OR Process. What is OR? It is a Process It assists Decision Makers It has a set of Tools It is applicable in many Situations.
Large-Scale Matrix Factorization with Missing Data under Additional Constraints Kaushik Mitra University of Maryland, College Park, MD Sameer Sheoreyy.
LINEAR PROGRAMMING.
Integer Bounds on Suppressed Cells in Multi-Way Tables Stephen F. Roehrig Carnegie Mellon University For.
1 WP.31 Effects of Rounding on Data Quality Jay J. Kim, Lawrence H. Cox, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics.
DEPARTMENT/SEMESTER ME VII Sem COURSE NAME Operation Research Manav Rachna College of Engg.
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Essentials of Systems Analysis and Design Fourth Edition Joseph S. Valacich Joey F.
IT for quantitative information Working with Spreadsheets 1.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Advanced Excel Topics – Arrays
Workshop on Demographic Analysis Fertility: Reverse Survival of Children & Mothers With Introduction to Own Children Methods.
4.5 Matrices.
Matrices and Matrix Operations
Chapter 4: Linear Programming Lesson Plan
4.6 Matrices.
Linear programming Simplex method.
MBA 651 Quantitative Methods for Decision Making
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
Legal, political and methodological issues in confidentiality in the ESS Maria João Santos, Jean-Marc Museux Eurostat.
Census Data for Transportation Planning—Some Thoughts
1.3 Modeling with exponentially many constr.
2. Matrix Algebra 2.1 Matrix Operations.
SCIENCE AND ENGINEERING PRACTICES
Objectives Multiply two matrices.
Prisoner’s Dilemma (aka Reds & Blues)
Linear programming Simplex method.
1.3 Modeling with exponentially many constr.
DATABASES WHAT IS A DATABASE?
Chapter 6 Network Flow Models.
Applying Use Cases (Chapters 25,26)
Federal Statistical Office Germany Research Data Centre
Chapter 1. Formulations.
SAFE – a method for anonymising the German Census
BUS-221 Quantitative Methods
Presentation transcript:

“Software for Tabular Data Protection” Joe Fred Gonzalez, Jr. Lawrence H. Cox National Center for Health Statistics NCHS Data Users Conference July 17, 2002

NCHS Confidentiality Concerns A major responsibility of NCHS is the protection of identifiable data collected from survey respondents, persons or establishments. Prior to release of public use files, data that could be used to identify a respondent are perturbed or removed from microdata files. The other mechanism for statistical disclosure is the possible identification of individuals or establishments via tabular data.

Development of Software as a Demonstration Tool for Tabular Data Protection The National Center for Health Statistics has sponsored the development of disclosure limitation software for two-way tables by OptTek Systems, Inc.

Software Functions cell suppression controlled rounding unbiased controlled rounding controlled rounding subject to subtotal constraints

Cell Suppression A multiple-cell suppression technique by Cox (1995) is used as the cell suppression function in the STDP.

Cell Suppression (cont.) Hides from publication the values of all cells representing direct disclosure of confidential data on individual respondents (the disclosure cells), together with sufficiently many appropriately selected nondisclosure cells (the complementary cells) to ensure that a third party cannot reconstruct or narrowly estimate confidential respondent data by manipulating linear relationships between released and suppressed table values.

Cell Suppression (cont.) The challenge of this cell suppression problem is to select complementary suppressions that provide sufficient disclosure protection while minimizing the amount of information lost due to suppression.

Cell Suppression (cont.) The cell suppression approach used is based on mathematical networks which offer theoretical and practical advantages. A mathematical network is a specialized linear program defined over a mathematical graph.

Controlled Rounding The controlled rounding function that is used in the STDP is based on the methodology described by Cox and Ernst (1982) and by Causey, Cox, and Ernst (1985).

Controlled Rounding (cont.) The controlled rounding function is the problem of rounding all entries in a one or two-way tabular array A to integer multiples of a positive integer base B subject to the following requirements: (1) each entry in A is rounded to an adjacent integer multiple of B; that is, an entry a is rounded to either B[a/B] or B([a/B] + 1), where [ ] is the greatest integer function, and (2) the sum of the rounded values for any row (or column) of A equals the rounded value of the corresponding row (or column) total entry. Requirements (1) and (2) are referred to as controlled rounding of an array A.

Controlled Rounding (cont.) Additionally, optimal controlled roundings were achieved by presenting this problem as a capacitated transportation problem whose objective function is minimized with respect to the lp norm, 1 <= p < , where the objective function is the pth root of the sum of the pth powers of the absolute values of the differences between rounded and unrounded entries of A.

Objective Function to Minimize with respect to lp norm

Test Results for Controlled Rounding Function: Testing was done on a Pentium 4 processor with 261, 200 KB of Ram.

Test Results (cont.) : The total time to solve the problem is dependent on: The number of cells in the table that are not multiples of the base. The number of the rows and columns in the table.

Test Results (cont.) A table with 50 rows and 50 columns was rounded in less than a minute. A table with 100 rows and 100 columns was rounded in 24 minutes. A table with 1000 rows and 5 columns was rounded in 1 hour and 40 minutes.

Unbiased Controlled Rounding The unbiased controlled rounding function that is used in the STDP is based on the methodology described by Cox (1987).

Unbiased Controlled Rounding (cont.) First, we assume that we have a two-way table A that is additive, that is, entries sum along rows and columns to all corresponding totals entries.

Unbiased Controlled Rounding (cont.) The objective is to construct a second additive table R(A) whose internal and totals entries, denoted by R(a), are integer multiples of B that are adjacent to the corresponding entries of A, that is, R(a) = B[a/B] or B([a/B] + 1), where [a/B] denotes the integer part of a/B.

Unbiased Controlled Rounding (cont.) The conditions for unbiased controlled rounding are that that every entry a of A satisfies the following: 1. R(a) = B[a/B] or B([a/B] + 1) 2. R(a) is additive. 3. *R(a) - a* < B 4. E(R(a)) = a

Test Results for Unbiased Controlled Rounding A table with 50 rows and 50 columns was rounded in a second. A table with 100 rows and 100 columns was rounded in 4 seconds. A table with 400 rows and 25 columns was rounded in 5 seconds. A table with 2000 rows and 25 columns was rounded in 5 minutes and 45 seconds.

Controlled Rounding Subject to Subtotal Constraints The controlled rounding subject to subtotal constraints function that is used in the STDP is based on the methodology described by Cox and George (1987). The methodology used in this function is similar to that used for controlled rounding as discussed earlier. Recall that controlled rounding for a two-way table was presented as a capacitated transportation problem. This function extends that methodology to tables with subtotals along one, but not both, dimensions.

Future Research and Development As mentioned earlier, the software developed for this project is a tool which features some of the different mathematical functions for protecting potential disclosure cell values in two-way tables. The ultimate goal of this project is to develop production level software that can be an embedded into NCHS data systems, for example, the NCHS Research Data Center (RDC), where data analysts and researchers submit their statistical programs, such as SAS (1999) and/or SAS Callable SUDAAN (1996).

References 1. Cox, L.H. (1995). Network models for complementary cell suppression. Journal of the American Statistical Association 90, 1453-1462. Cox, L.H. (1996). Addendum. Journal of the American Statistical Association 91, 1757. 2. Cox, L.H. and L.R. Ernst (1982). Controlled rounding. INFOR 20, 423-432.

References (cont.) 3. Causey, B.D, L.H. Cox, and L.R. Ernst (1985). Applications of transportation theory to statistical problems. Journal of the American Statistical Association 80, 903-909. 4. Cox, L.H. (1987). A constructive procedure for unbiased controlled rounding. Journal of the American Statistical Association 82, 520-524. 5. Cox, L.H. and J.A. George (1989). Controlled rounding for tables with subtotals. Annals of Operations Research 20, 141-157.

References (cont.) 6. SAS Institute Inc., SAS/STAT User’s Guide, Version 8, Cary, NC: SAS Institute Inc (1999). 7. Shah, B., Barnwell, B., Bieler, G., SUDAAN Users’s Manual, Release 7.0, Research Triangle Park, NC: Research Triangle Institute (1996).