Creating Synthetic Microdata from Official Statistics: Random Number Generation in Consideration of Anscombe's Quartet Kiyomi Shirakawa Hitotsubashi University.

Creating Synthetic Microdata from Official Statistics: Random Number Generation in Consideration of Anscombe's Quartet Kiyomi Shirakawa Hitotsubashi University / National Statistics Center Shinsuke Ito Chuo University

Outline 1. Synthetic Microdata in Japan 2. Problems with Existing Synthetic Microdata 3. Correcting Existing Synthetic Microdata 4. Creating New Synthetic Microdata 5. Comparison between Various Sets of Synthetic Microdata 6. Conclusions and Future Outlook 2

1. Synthetic Microdata in Japan Synthetic Microdata for educational use are available in Japan:  Generated using multidimensional statistical tables.  Based on the methodology of microaggregation (Ito (2008), Ito and Takano (2011), Makita et al (2013)) Created based on the original microdata from the 2004 ‘National Survey of Family Income and Expenditure’ Synthetic microdata are not original microdata. 3

Legal Framework New Statistics Act in Japan(April 2009) Enables the provision of Anonymized microdata (Article 36) and tailor-made tabulations (Article 34).  Allows a wider use of official microdata.  Allows use of official statistics in higher education and academic research.  However, permission process is required. 4 To provide an alternative to Anonymized microdata, the NSTAC has developed Synthetic microdata that can be accessed without a permission process.

Image of Frequency of Original and Synthetic Microdata 5 Source: Makita et al. (2013).

2. Problems with Existing Synthetic Microdata (1) All variables are subjected to exponential transformation in units of cells in the result table. 6 Number of Earners Structure of Dwelling Frequency Living ExpenditureFood MeanSDC.V.MeanSDC.V. One person 4,132302,492.8148,598.90.49171,009.025,089.50.353 Wooden1,436300,390.3170,211.40.56771,018.524,187.60.341 Wooden with fore roof 501298,961.0125,682.90.42073,507.324,947.70.339 Ferro-concrete1,624306,947.4131,895.00.43069,873.125,844.20.370 Unknown571298,209.7153,651.10.51572,024.125,125.10.349 Two persons 4,201346,195.7215,911.70.62478,209.125,288.10.323 Wooden1,962346,980.3172,673.20.49878,961.724,233.50.307 Wooden with fore roof 558356,021.5160,579.80.45181,039.424,628.20.304 Ferro-concrete1,120353,093.9313,837.80.88976,860.826,250.70.342 Others3260,759.837,924.30.14572,733.15,358.90.074 Unknown558320,224.5148,230.30.46375,468.527,241.10.361 Too large

2. Problems with Existing Synthetic Microdata (2) Correlation coefficients (numerical) between all variables are reproduced. In the below table, several correlation coefficients are too small. The reason is that correlation coefficients between uncorrelated variables are also reproduced. 7 Living expenditureFoodHousing Living expenditure1.000.500.28 Food0.431.00-0.03 Housing0.28-0.061.00 Top half: original data; bottom half: synthetic microdata Too small

2. Problems with Existing Synthetic Microdata (3) Qualitative attributes of groups having a frequency (size) of 1 or 2 are transformed to "Unknown" (V) or deleted. The information loss when using this method is too large. Furthermore, the variations within the groups are too large to merge qualitative attributes between different groups. 8 Note: "V" stands for "unknown". Source: Makita et al. (2013). Figure 1: Processing records with common values for qualitative attributes into groups with a minimum size of 3.

3. Correcting Existing Synthetic Microdata The following approaches can be used to correct the existing Synthetic microdata. (1)Select the transformation method (logarithmic transformation, exponential transformation, square-root transformation, reciprocal transformation) based on the original distribution type (normal, bimodal, uniform, etc.). (2)Detect non-correlations for each variable. (3)Merge qualitative attributes in groups with a size of 1 or 2 into a group that has a minimum size of 3 in the upper hierarchical level. 9

Box-Cox Transformation 10 λ = 0 logarithmic transformation λ = 0.5 square-root transformation λ = -1 reciprocal transformation λ = 1 linear transformation

4. Creating New Synthetic Microdata In order to improve problems with existing Synthetic microdata, new synthetic microdata were created based on the following approaches. (1) Create microdata based on kurtosis and skewness (2) Create microdata based on the two tabulation tables of the basic table and details table (3) Create microdata based on multivariate normal random numbers and exponential transformation 11 This process allows creating synthetic microdata with characteristics similar to those of the original microdata.

12 Original dataLog2 transformation Natural lognormal transformation Square-root transformation Reciprocal transformation Mean861.3709.1396.33526.4512.651 Standard deviation882.0571.3630.94512.9602.548 Kurtosis4.004 -0.448 0.9744.185 Skewness2.002 0.107 1.1151.943 Frequency27 λ -0.047 （ λ = 0 ） Differences of kurtosis and skewness (1) Microdata created based on Kurtosis and Skewness Original microdata and transformed indicators for each transformation

(2) Microdata created based on two Tabulation Tables (Basic Table and Details Table) 13 Living expenditureFood Housing Mean 195,624.854,647.81,648.8 Standard deviation 59,892.621,218.13,144.4 Kurtosis -1.0041641.6289746.918601 Skewness 0.3463050.9925792.605260 Frequency20 8 Correlation coefficients Living expenditureFoodHousing Living expenditure1 Food0.6431 Housing-0.335-0.4891 Basic Table (matches with original mean and standard deviation, approximate correlation coefficients for each variable) Groups Living expenditureFood FrequencyMeanStandard deviationFrequencyMeanStandard deviation 13185,499.965,680.5331,193.56,406.9 23150,424.828,599.3351,457.220,795.2 33269,749.043,611.7380,520.128,447.0 44209,347.850,580.8445,359.012,618.4 53236,587.840,679.9375,606.23,049.8 64137,080.215,119.7448,797.21,071.9 Details Table (means and standard deviations for creating synthetic microdata for multidimensional cross fields)

(3) Microdata created based on Multivariate Normal Random Numbers and Exponential Transformation 14 a random number that approximates the kurtosis and skewness of the original microdata was selected. λ in the Box-Cox transformation is required in order to change the distribution type of the original data into a standard distribution. Based on λ in the Box- Cox transformation However, approximately using exponential transformation

15 No. 1 Original microdata 2 Hierarchization, and kurtosis, skewness and λ of Box-Cox transformation 3 Kurtosis and skewness 4 Multivariate lognormal random numbers Living expenditure Food Living expenditure Food Living expenditure Food Living expenditure Food １ 125,503.529,496.1110,487.825,143.0107,684.023,459.9133,549.938,559.9 ２ 255,675.925,806.2232,691.837,905.5281,880.856,520.4123,716.642,930.1 ３ 175,320.438,278.2213,320.230,531.9254,267.337,419.4152,784.867,263.8 ４ 181,085.674,122.1183,430.475,469.1294,589.9112,843.9195,764.88,286.1 ５ 124,471.033,256.8134,867.639,568.9193,191.654,363.3202,865.875,558.0 ６ 145,717.746,992.8132,976.439,333.7189,242.753,980.3193,003.470,994.2 ７ 319,114.3113,177.1242,622.568,472.2151,183.655,303.2191,620.152,311.7 ８ 253,685.267,253.6320,055.9113,008.5271,338.179,991.472,773.713,621.6 ９ 236,447.661,129.8246,568.660,079.7157,306.950,650.9201,114.674,899.0 １０ 137,315.327,050.1144,192.632,572.9167,431.036,116.3217,530.760,736.0 １１ 253,393.747,205.6267,708.860,344.8270,301.878,246.4297,608.777,464.3 １２ 232,141.852,259.6212,050.737,656.3223,946.843,827.9175,993.671,416.6 １３ 214,540.454,920.9213,439.150,862.2225,103.263,861.2297,653.086,400.5 １４ 234,151.474,993.0205,595.073,919.1165,972.349,350.6123,197.131,645.5 １５ 278,431.078,916.1282,652.779,126.9249,749.173,474.1277,501.669,910.5 １６ 197,180.872,909.6221,515.673,772.7183,281.148,672.3235,221.158,700.6 １７ 118,895.148,821.6127,964.350,240.7115,639.371,059.5182,363.249,433.2 １８ 130,482.847,798.5159,328.048,533.5170,231.138,723.5158,939.445,131.8 １９ 147,969.150,277.9133,795.547,660.6125,789.222,188.5212,194.237,995.6 ２０ 150,973.748,291.0127,232.948,754.2114,366.442,903.1267,100.159,697.3 5. Comparison of Results Comparison of original microdata and each set of synthetic microdata

16 No. 1 Original microdata 2 Hierarchization, and kurtosis, skewness and λ of Box-Cox transformation 3 Kurtosis and skewness 4 Multivariate lognormal random numbers Living expenditure Food Living expenditure Food Living expenditure Food Living expenditure Food Mean195,624.854,647.8195,624.854,647.8195,624.854,647.8195,624.854,647.8 Standard deviation59,892.621,218.159,892.621,218.159,892.621,218.159,892.621,218.1 Kurtosis-1.0041641.628974-0.8102151.473853-1.2201851.721354-0.212358-0.052164 Skewness0.3463050.9925790.3109131.0505680.1606120.9491060.035785-0.709361 Correlation coefficients 0.6425110.6894470.642511 Maximum value319,114.3113,177.1320,055.9113,008.5294,589.9112,843.9297,653.086,400.5 Minimum value118,895.125,806.2110,487.825,143.0107,684.022,188.572,773.78,286.1 5. Comparison of Results The most useful microdata from the indicators in the below table are in column number 2. Note that for reference, column number 4 is the same as the trial synthetic microdata method.

17 5. Comparison of Results Hierarchizatio n, and kurtosis, skewness and λ of Box-Co x transformation Kurtosis and skewness Multivariate lognormal random numbers Scatter plots of living expenditure and food for each microdata Food living expenditure

18 Example Result Table for New Synthetic Microdata ItemsLiving expenditureFood No.ABCDEFFrequencyMeanSD FrequencyMeanSD 12112513185,499.965,680.5331,193.56,406.9 2 2113 6210,086.973,208665,988.727,387.3 2113613150,424.828,599.3351,457.220,795.2 2113713269,749.043,611.7380,520.128,447.0 3 3111 7221,022.145,197.7758,322.118,550.2 3111514209,347.850,580.8445,359.012,618.4 3111613236,587.840,679.9375,606.23,049.8 43112514137,080.215,119.7448,797.21,071.9 Mean195,624.854647.8 Standard deviation59,892.621218.1 Kurtosis-1.0041.629 Skewness 0.3460.993 Correlation coefficients 0.643 λ 0 A: 5-year age groups; B: employment/unemployed; C: company classification; D: company size; E: industry code; F: occupation code

6. Conclusions and Future Outlook Conclusions 1.We suggested improvements to synthetic microdata created by the National Statistics Center for statistics education and training. 2.We created new synthetic microdata using several methods that adhere to this disclosure limitation method. 3.The results show that kurtosis, skewness, and Box-Cox transformation λ are useful for creating synthetic microdata in addition to frequency, mean, standard deviation, and correlation coefficient which have previously been used as indicators. 19 Next Steps 1.Decide the number of cross fields (dimensionality) of the basic table and details table and the style (indicators to tabulate) of the result table according to the statistical fields in the public survey. 2.Expand this work to the creation and improvement of synthetic microdata from other surveys.

References 20 1.Anscombe, F.J.(1973), "Graphs in Statistical Analysis," American Statistician, 17-21. Bethlehem, J. G., Keller, W. J. and Pannekoek, J.(1990) “Disclosure Control of Microdata”, Journal of the American Statistical Association, Vol. 85, No. 409 pp.38-45. 2.Defays, D. and Anwar, M.N.(1998) “Masking Microdata Using Micro-Aggregation”, Journal of Official Statistics, Vol.14, No.4, pp.449-461. 3.Domingo-Ferrer, J. and Mateo-Sanz, J. M.(2002) ”Practical Data-oriented Microaggregation for Statistical Disclosure Control”, IEEE Transactions on Knowledge and Data Engineering, vol.14, no.1, pp.189-201. 4.Höhne(2003) “SAFE- A Method for Statistical Disclosure Limitation of Microdata”, Paper presented at Joint ECE/Eurostat Work Session on Statistical Data Confidentiality, Luxembourg, pp.1-3. 5.Ito, S., Isobe, S., Akiyama, H.(2008) “A Study on Effectiveness of Microaggregation as Disclosure Avoidance Methods: Based on National Survey of Family Income and Expenditure”, NSTAC Working Paper, No.10, pp.33-66 (in Japanese). 6.Ito, S.(2009) “On Microaggregation as Disclosure Avoidance Methods”, Journal of Economics, Kumamoto Gakuen University, Vol.15, No.3 ・ 4, pp.197-232 (in Japanese) 7.Makita, N., Ito, S., Horikawa, A., Goto, T., Yamaguchi, K. (2013) “Development of Synthetic Microdata for Educational Use in Japan”, Paper Presented at 2013 Joint IASE / IAOS Satellite Conference, Macau Tower, Macau, China, pp.1-9.

Creating Synthetic Microdata from Official Statistics: Random Number Generation in Consideration of Anscombe's Quartet Kiyomi Shirakawa Hitotsubashi University.

Similar presentations

Presentation on theme: "Creating Synthetic Microdata from Official Statistics: Random Number Generation in Consideration of Anscombe's Quartet Kiyomi Shirakawa Hitotsubashi University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Creating Synthetic Microdata from Official Statistics: Random Number Generation in Consideration of Anscombe's Quartet Kiyomi Shirakawa Hitotsubashi University.

Similar presentations

Presentation on theme: "Creating Synthetic Microdata from Official Statistics: Random Number Generation in Consideration of Anscombe's Quartet Kiyomi Shirakawa Hitotsubashi University."— Presentation transcript:

Similar presentations

About project

Feedback