Download presentation
Presentation is loading. Please wait.
Published byJonathan Holmes Modified over 9 years ago
1
1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana Badina Statistics Norway
2
2 Eurostat Census Hypercubes 60 Census 2011 frequency count hypercubes that all 32 EU+EEA countries must submit in 2014. Four to nine variables (breakdowns) in each cube. Each country is responsible for its own disclosure control method according to national legislation. Norway is the only country that wishes to use small count (1 and 2) rounding as the preferred disclosure control method. This presentation will show how. Hypercube 06 will be used for illustration.
3
3 The problem A B 1…b…L 1 : a00700 : K A og B combinations of variables. Value a for A implies value b for B. Idea: Want to create uncertainty about the surrounding zeroes.
4
4 Idea We want to create uncertainties about whether zeroes are real zeroes. Creating more zeroes from small counts (1 and 2) by rounding to 0 or 3 (unbiasedly) The rounding must be carried out to minimize perturbation on given aggregate counts. Counts of 1 and 2 are not necessarily considered problematic by themselves but will be removed by rounding.
5
5 Hypercube 06 Spanning variables and groups in hypercube 06 VariableExplanation No. of groups GEO.LRegion of residence according to NUTS2 7 regions SEXSex 2 FST.HFamily status. High detail 6 LMS.Marital status 4 CAS.LActivity status. Low detail 3 POB.MCountry of birth. Medium detail 9 COC.MCitizenship. Medium detail 9 AGE.MAge. Medium detail (5-year groups) 21 The hypercube spans 1 714 608 cells. 53 550 cells are populated.
6
6 Principal Marginal Distributions Either the entire HC or 6 PMDs must be submitted from HC 06. Principle Marginal Distributions of hypercube 06 Breakdowns 6.GEO.LSEXFST.HLMSCAS.LPOB.MCOC.MAGE.M 6.1GEO.LSEXFST.HLMSAGE.M 6.2GEO.LSEXFST.HLMSCAS.LPOB.M 6.3GEO.LSEXFST.HLMSCAS.LCOC.M 6.4GEO.LSEXFST.HCAS.LAGE.M 6.5GEO.LSEXFST.HPOB.LAGE.M 6.6GEO.LSEXFST.HCOC.LAGE.M There are 5-6 variables in each PMD. 3 variables are common for all six PMDs.
7
7 Reduce the hypercube STEP 1: Identifying small counts a.Reduce hypercube A by selecting a subset B consisting of All interior cells in A with counts 1 or 2 or all interior cells in A contributing to 1 or 2 in the PMDs of A. b.Calculate C = A – B STEP 2: Rounding. n B = total value of B Round [n B /3] interior counts in B to 3, the rest to 0. B*. IF the solution B* is good enough, STOP. ELSE, continue search for a better B*. STEP 3: Calculate A* = C + B*, the rounded cube.
8
8 Simple properties A* - A = B* - B = C A* is additive |n A – n A* | = |n A – 3[n A /3]| ≤ 1 All Primary Marginal Distributions will be consistently rounded.
9
9 The Norwegian HC 06 Number of small cells in full HC and with PMDs only Full hypercubePMDs only Principal small counts m 1 + m 2 25 8233 048 Internal small counts n 1 + n 2 25 8232 941 No. of internal 1s n1n1 18 7282 683 No. of internal 2s n2n2 7 095258 Pop.in small count cells n B = n 1 + 2n 2 32 9183 199 Prop in small count cells 100n B /N 0.660.064 No. of cells to round to 3 [n B /3] 10 9731 066
10
10 Rounding method used 1. Let n B = total count of B, e.g. n B = 3 199 2. From the non-zero cells in B, select (WOR) [ n B /3] (=1066) cells to be rounded to 3. Probabilities: P(2 3) = 2·P(1 3) Selection may be stratified. 3. Calculate distance m = max c M |b c * – b c | across a control set M of marginal cells of B. 4. The solution with the smallest value m is selected.
11
11 Test experiment Control set M : All one- and two-way marginal counts generated from the eight variables spanning HC 06. (1985 cells.) 10 000 runs are done. – For full HC 06 and for the PMDs only – With stratified and unstratified sampling.
12
12 Improvements in maximum deviation m by iterations in random search Full hypercubePMDs only Stratification IterNone GEO.L SEX GEO.L SEX AGE.M FST.H None GEO.L SEX GEO.L SEX AGE.M FST.H mmmmmm 1263149198686257 10186145154625652 10015313312350 40 100014013310045 37 10000133121100434135
13
13 Percent deviations for largest absolute deviations, m, in the best solutions. m True cell value Percent deviation Full HC 133208 5530.064 121166 7840.073 10064 4810.155 PMDs only 43275 3460.016 -41 7 464-0.549 -35 85 383-0.041
14
14 Discussion The method is not yet fully approved for the Census HCs. Is the method sufficient to prevent any kind of disclosure? The reduction of the problem (A B) absolutely required to make the method work. Advantage: –Can produce consistent results with acceptable (?) aggregate deviations for a number of linked cubes of some size. Problems: –With random search the result is subject to chance. –Diminishing return from increasing the number of iterations. –We need to find better and more stable search engines. –Generalization to rounding bases of more than 3 will increase the deviations in aggregates.
15
15 Further work Try better sampling procedures (Balanced sampling?) Try Mixed Integer Linear Programming software. Extend the experiment to round more hypercubes jointly. An idea: Merge the reduced rounded cells back into microdata: –A method for perturbing some variables in relation to others. –How many variables must be perturbed this way to make all hypercubes safe? –Creates a micro data set that produces the rounded tables directly.
16
16 Thank you very much for your attention
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.