Techniques to apply cell suppression to large sparse linked tables and some results using those techniques on the 2012 (US) Economic Census Philip Steel,

Slides:



Advertisements
Similar presentations
Characterization and Management of Multiple Components of Cost and Risk in Disclosure Protection for Establishment Surveys Discussion of Advances in Disclosure.
Advertisements

Aggregate Data and Statistics
Adaptation of Evans, Zaytaz, and Slanta (EZS) Disclosure Method to Quarterly Census of Employment and Wages (QCEW) Shail Butani U.S Bureau of Labor Statistics.
Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
United Nations Statistics Division/DESA
BTS Confidentiality Seminar Series June 11, 2003 FCSM/CDAC Disclosure Limiting Auditing Software: DAS Mark A. Schipper Ruey-Pyng Lu Energy Information.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
2012 Economic Census Update 2014 MD SDC Affiliates Meeting June 12, 2014 Andy Hait U.S. Census Bureau 1.
17 September SME Statistics OECD Workshop SME data and methodologies in the EU - item 5 Paul Feuvrier / Eurostat.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Geo-referenced data and DLI aggregate data sources Chuck Humphrey University of Alberta September 29, 2008.
Online Market Research Part 1. The ABCs of the Federal Statistical System Presented by Janet Harrah, Director Center for Economic Development & Business.
Quantitative Evidence for Marketing Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library October 26, 2009.
Linear Goal Programming
© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.
Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.
© John M. Abowd 2005, all rights reserved Using the Economic Census and Business Register John M. Abowd February 2005.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Statistics and Data for Marketing Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library October 27, 2008.
EAS 293 Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library October 14, 2008.
An Integrated Approach to Economic Statistics “ The Canadian Experience” UNSD – IBGE Workshop on Manufacturing Statistics Kevin Roberts Rio de Janeiro,
Re-development of the Cell Suppression Methodology at the US Census Bureau Philip Steel, James Fagan, Paul Massell, Richard Moore Jr., John Slanta, Bei.
Example 15.3 Supplying Power at Midwest Electric Logistics Model.
Interval estimation ASW, Chapter 8 Economics 224, Notes for October 8, 2008.
Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.
Household Surveys ACS – CPS - AHS INFO 7470 / ECON 8500 Warren A. Brown University of Georgia February 22,
Statistical Quality Control/Statistical Process Control
1//hw Cherniak Software Development Corporation ARM Features Presentation Alacrity Results Management (ARM) Major Feature Description.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 13.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.
G-Confid: Turning the tables on disclosure risk Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality Ottawa, Canada 30 October 2013 Peter.
Example 5.8 Non-logistics Network Models | 5.2 | 5.3 | 5.4 | 5.5 | 5.6 | 5.7 | 5.9 | 5.10 | 5.10a a Background Information.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Units of Analysis The Basics. Outline Definitions Elements of the unit of analysis Data structure.
Objects for Business Reporting MIS 497. Objective Learn about miscellaneous objects required for business reporting. Learn about miscellaneous objects.
Small Area Economic Data from the 2007 Economic Census and Economic Surveys Presented by: Andrew W Hait and Patrice C. Norman U.S. Census Bureau Economic.
1 USING A QUADRATIC PROGRAMMING APPROACH TO SOLVE SIMULTANEOUS RATIO AND BALANCE EDIT PROBLEMS Katherine J. Thompson James T. Fagan Brandy L. Yarbrough.
Data and Social Research Chuck Humphrey Data Library Rutherford North Library.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
1 Chapter 11 A number of important scheduling problems... require the study of an astronomical number of arrangements to determine which one is best....
Software Design Design is the process of applying various techniques and principles for the purpose of defining a device, a process,, or a system in sufficient.
1 Using Fixed Intervals to Protect Sensitive Cells Instead of Cell Suppression By Steve Cohen and Bogong Li U.S. Bureau of Labor Statistics UNECE/Work.
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Topic ii New Methodologies for Protecting Data (Disclosure Limitation) Univ. Edinburgh:
Statistical Quality Control/Statistical Process Control
Household Surveys: American Community Survey & American Housing Survey Warren A. Brown February 8, 2007.
The Application for Statistical Processing at SURS Andreja Smukavec, SURS Rudi Seljak, SURS UNECE Statistical Data Confidentiality Work Session Helsinki,
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
Economic Census Update TN State Data Center Affiliate Workshop November 16, 2015 Presented by: Andy Hait U.S. Census Bureau 1.
The views expressed herein are those of the author and should not necessarily be attributed to the IMF, its Executive Board, or its management Data Confidentiality,
University of Colorado at Boulder Yicheng Wang, Phone: , Optimization Techniques for Civil and Environmental Engineering.
CENSUS GEOGRAPHY WORKSHOP Tim McMonagle Geography Los Angeles Regional Office 1.
Current Population Survey Joint BLS/Census Bureau Product Sampling design – About 60,000 occupied housing units monthly nationally – design National/Regional.
CTPP in TranStats The One-Stop Shop of Transportation Data
U.S. Census Data & TIGER/Line Files
Machine Vision Edge Detection Techniques ENT 273 Lecture 6 Hema C.R.
© John M. Abowd 2005, all rights reserved Using the Decennial Census of Population and Housing John M. Abowd February 2005.
Geo-referenced data and DLI aggregate data sources
Solver Feature Excel’s KY San Jose State University Engineering 10.
A Multiperiod Production Problem
Confidentiality in Published Statistical Tables
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
AP Statistics Chapter 3 Part 2
Martha Stinson. T. Kirk White. James Lawrence
Parallel Session: BR maintenance Quality in maintenance of a BR:
Modeling IDS using hybrid intelligent systems
Presentation transcript:

Techniques to apply cell suppression to large sparse linked tables and some results using those techniques on the 2012 (US) Economic Census Philip Steel, James Fagan, Vitoon Harusadangkul, Paul Massell, Richard Moore Jr., John Slanta, Bei Wang U.S. Census Bureau

Cell Suppression ca  Footnote: Alabama,1 establishment; Arkansas, 1; District of Columbia, 4 … New Jersey, 15; … Wisconsin, 5;  Headnote: [comparability warning] Statistics are given for all States for which separate figures can be published without disclosing, exactly or approximately, the data reported by individual establishments. One of the “Other States,” namely, New Jersey, is, however, more important in the industry than any of the States shown separately.

Outline Follow-up to P. Steel et al Re-development of the Cell Suppression Methodology at the US Census Bureau, UNECE Ottawa, Canada, October Some background 2.Production software capabilities 3.Large table results 4.Future research

Table Relations  A table is the set of cells defined by crossing 2 or more relations.  A relation specifies an additive relationship between sets of cells. Most frequently represented as a margin in a table.  Two tables are linked when a cell or set of cells appears in both tables, usually as a margin in one and in the interior of another.  A table group is a minimal set of tables such that no table is linked outside the group within the publication or release.  Our cell suppression software is driven by 2 kinds of input.  All of the additive relations, by convenience categorized separately into files for each table dimension.  Cell attribute file.

NAICS  North American Industry Classification System  6 digit classification that allows tabulation at 2 (industry) 3 (subindustry) and other levels  Example of an additive relation:  23611,236115,236116,236117,  The graph structure associated with a classification system of this type is a tree  Distinct codes used for the manufacturing segment: 518 (roughly 154 relations)

Geographic Publication Levels  Nation, State, County, Consolidated Statistical Metropolitan Area, Metropolitan Areas, Places, and balances.  Population varies from 100s to millions  Economic activity in a detailed NAICS code is often 0 at the Place and even County level  In excess of 16,000 codes (roughly 3000 relations)  Not a simple hierarchy

Definitions  Sensitivity rule (p-percent)  Primary suppression: a cell not shown in a publication because it failed the sensitivity rule  Secondary suppression: a cell not shown in a publication because its absence blocks recovery of a primary suppression  Target: a cell with a protection requirement  Supercell: a sensitive aggregate  An example: two single report cells in a row or column  LP: Linear Programing [problem]

m-LP model m pairs of (e)

Definitions continued  m-LP: an LP that protects m targets simultaneously  m-LP screener: criteria for determining whether a target cell is too close to other target cells in a group of m cells  Effective m: the actual as opposed to the requested number of targets  Cost: the weight given a particular cell in the LP’s objective function—roughly a measure of the desire to retain the cell  Capacity: how much of a cell’s value can be used to protect a target in secondary suppression  Frozen cell: capacity=0 if a cell has already been published  Infeasible: a target whose LP problem has no solution, a circumstance that can be caused by frozen cells or overlapping patterns in m-lp

Methods, Programming and Data  LP methodology is straightforward; most of the (very dry) complexity is in how to handle exceptions in calculating cost and capacity  Implementation in production software is a difficult problem and doesn’t fit well with most software development models  Differences in data can produce novel circumstances and performance varies dramatically

Program features  Data structure: graph vs array  Automatic global un-duplication  Additivity check  Decomposes input into table groups  Utilizes solver’s data structures  Scalar parameter  Individual cell cost adjustment  Accommodates linked data

Program features (cntd)  Freeze of publication status across linked tables  Tolerant of rounded data  Skip-P procedure  m for m-LP  Supercell (aggregate/company protection)  Negative values adjustment  Tolerant of non-additivity  Single “D” check

Processing the Manufacturing Geographic Area Series  NAICS ( 518 distinct)  Geography ( 16,000 distinct )  Content  standalone (hours, capital expenditures) 2D  In a relation (employment, payroll and value added) 3 dimensional (3D)  Processed 3D in 3 steps  2&3 digit NAICS by full geography and content  Freeze and process each full 3 digit tree  Cleanup of infeasibles

Performance on 2D tables Cartesian cells %0Nonzero%P m ave opts minutes per opt hours8.7 M , cextot8.3 M , Without skip P, ‘hours’ would have had to do 57,000 optimizations-- taking 109 days instead of 5!

Performance on 3D tables table Cartesian cells %0nonzero%P m average opts Minutes per opt emp2a1.9 M , emp326.7M89.72,760, The second step for processing employment variables does each subindustry, the first line represents the 1 st subindustry. The second line represents the 3 rd step where we unfreeze and redo the infeasibles that occurred in the (entire) 2 nd step.

Agenda for the 2017 Economic Census  Correct current defects  Test solver dependency  Adapt to a shift in emphasis from industry to product tables in the 2017 Census  Identify and review pivotal suppressions  Better employ m-LP, scaling  Investigate over-tabulation  Create a data product that synthesizes suppressed cells

Thanks for your attention