Re-development of the Cell Suppression Methodology at the US Census Bureau Philip Steel, James Fagan, Paul Massell, Richard Moore Jr., John Slanta, Bei Wang
Background Jewett’s network flow program Need for new program 2012 economic census LP (linear programming) methodology R&M cell suppression team
Processing Model Preprocessing – Create table description – Determine primaries – Unduplicate Sequential processing of primaries Queue reduction Test company protection (aggregate/supercell) Sequential processing of supercells
Table relations Marginals are the sum of interior cells Geographic relationships tend to generate our most complex sets of table relations – State is the sum of metropolitan areas within the state and the balance. – State is also the sum of counties Of the form A=B+..+Z where A,B,…,Z are (one of) rows columns or levels that define some Cartesian integer space (i,j,k) Duplicates are recorded as A=B (eg a county is also a place)
Objective Function
Additivity constraint generator (based on row relations) (b) for ii = 1,..., rr, j = 1,..,cols, k = 1,..., levs : limr(ii) ≥ 1, ws(ii,j,k) = 0
Bounds h i,j,k = max(0,v i,j,k ) for i = 1,..., rows, j = 1,..., col, k = 1,..., levs : (i,j,k) ⋲A
For the primary
Skip P Model changes only on the target primary constraints. How can the minimal solution for one target be transformed to be a solution for another target? By applying a scalar that converts the flow through the second P to the fixed value of the model! Can be done when the scalar does not violate the bounding conditions and the complementary flow in the target is 0. I.e. when the solutions flow through the secondary target exceeds its protection requirement.
Empirical confirmation In our large sparse tables, we would see a lot of objective 0 results. That is, the solver finds a 0 cost pattern to protect the primary … it is already protected! Skip P eliminated most objective 0 results and left intact the sequence of positive objectives their solutions.
Fat solution CPLEX is using a dual simplex method to find solutions. The solutions have a growing 0 cost component, with many more cells than are required to protect the target P. The flow in the 0 cost cells far exceeds what is required to protect the target P (except in very small or dense examples). The solution “lights up” the possible flows in the table’s current state, giving a “fat” solution.
Skip P and the fat solution Optimization number Count of P with flow Running total of skipped P
dg10 sector 44 Cartesian cells: 367,605 (2d) Non-zero cells: 159,849 Relations: 283 (row and column) – 14,000 potential tables, linked P: 95,062 LP problems: 10,604 Typical LP size – Reduced LP has rows, columns, and nonzeros Time: 8hr:37min (includes everything)
Comparison between network and LP on one (of hundreds) dataset from 2007 Network flowLP C14,55111,283 Cvalue1,813,213,710598,886,234 PubValue12,348,960,57813,563,288,054 undersuppressions #0 time24min8hrs 37min Statistics based on unduplicated data with an approximation of a published status flag
Thankyou!