Creating Synthetic Microdata from Official Statistics: Random Number Generation in Consideration of Anscombe's Quartet Kiyomi Shirakawa Hitotsubashi University.

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

Chapter 3 Properties of Random Variables
The Microdata Analysis System (MAS): A Tool for Data Dissemination Disclaimer: The views expressed are those of the authors and not necessarily those of.
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Prediction, Goodness-of-Fit, and Modeling Issues ECON 6002 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
Modeling Process Quality
Chapter 1 Displaying the Order in a Group of Numbers
MondayTuesdayWednesdayThursdayFriday 3 Benchmark – Practice Questions from Unit 1 – 3 and a chance to earn Bonus (Skills Check category) 4 Review Unit.
EOCT Review Unit 4: Describing Data. Key Ideas  Summarize, represent, and interpret data on a single count or measurable variable ( dot plots, histograms.
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
15 de Abril de A Meta-Analysis is a review in which bias has been reduced by the systematic identification, appraisal, synthesis and statistical.
GRAPHICAL DESCRIPTIVE STATISTICS FOR QUALITATIVE, TIME SERIES AND RELATIONAL DATA.
4. SMMAR(1) IMPLEMENTATION It was developed a Pascal routine for generating synthetic total monthly precipitation series. The model was calibrated with.
1 Econ 240A Power Outline Review Projects 3 Review: Big Picture 1 #1 Descriptive Statistics –Numerical central tendency: mean, median, mode dispersion:
Analysis of Research Data
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Steps in Using the and R Chart
Chapter 2 Summarizing and Graphing Data
Refined pea core collection based on qualitative and quantitative characteristics Clarice J. Coyne, USDA-ARS Western Regional Plant Introduction, Washington.
Statistics 3502/6304 Prof. Eric A. Suess Chapter 3.
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
WFM 5201: Data Management and Statistical Analysis © Dr. Akm Saiful IslamDr. Akm Saiful Islam WFM 5201: Data Management and Statistical Analysis Akm Saiful.
Ch2: Probability Theory Some basics Definition of Probability Characteristics of Probability Distributions Descriptive statistics.
The Normal Distribution Chapter 3 College of Education and Health Professions.
SW388R6 Data Analysis and Computers I Slide 1 Central Tendency and Variability Sample Homework Problem Solving the Problem with SPSS Logic for Central.
6-1 Numerical Summaries Definition: Sample Mean.
Chapter 21 Basic Statistics.
Brian Macpherson Ph.D, Professor of Statistics, University of Manitoba Tom Bingham Statistician, The Boeing Company.
The use of protected microdata in tabulation: case of SDC-methods microaggregation and PRAM Researcher Janika Konnu Manchester, United Kingdom December.
Chapter 13: Correlation An Introduction to Statistical Problem Solving in Geography As Reviewed by: Michelle Guzdek GEOG 3000 Prof. Sutton 2/27/2010.
Probabilistic and Statistical Techniques 1 Lecture 3 Eng. Ismail Zakaria El Daour 2010.
Oslo, 24–26 September 2012 Work Session on Statistical Data Editing APPLICATION OF THE DEVELOPED SAS MACRO FOR EDITING AND IMPUTATION AT.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
Chapter 8 Making Sense of Data in Six Sigma and Lean
Prediction, Goodness-of-Fit, and Modeling Issues Prepared by Vera Tabakova, East Carolina University.
5-1 ANSYS, Inc. Proprietary © 2009 ANSYS, Inc. All rights reserved. May 28, 2009 Inventory # Chapter 5 Six Sigma.
MMSI – SATURDAY SESSION with Mr. Flynn. Describing patterns and departures from patterns (20%–30% of exam) Exploratory analysis of data makes use of graphical.
Anonymization of longitudinal surveys in the presence of outliers Hans-Peter Hafner HTW Saar – Saarland University of Applied Sciences
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
STOCHASTIC HYDROLOGY Stochastic Simulation of Bivariate Distributions Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.
Pan-cancer analysis of prognostic genes Jordan Anaya Omnes Res, In this study I have used publicly available clinical and.
Outline of Today’s Discussion 1.Displaying the Order in a Group of Numbers: 2.The Mean, Variance, Standard Deviation, & Z-Scores 3.SPSS: Data Entry, Definition,
Chapter 20 Statistical Considerations Lecture Slides The McGraw-Hill Companies © 2012.
Introduction Exploring Categorical Variables Exploring Numerical Variables Exploring Categorical/Numerical Variables Selecting Interesting Subsets of Data.
BPS - 5th Ed. Chapter 231 Inference for Regression.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
CHAPTER 10 DATA EXPLORATION 10.1 Data Exploration Box 10.1 Data Visualization Descriptive Statistics Box 10.2 Descriptive Statistics Graphs.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Prof. Eric A. Suess Chapter 3
Exploratory Data Analysis
EHS 655 Lecture 4: Descriptive statistics, censored data
Central Limit Theorem Section 5-5
BAE 6520 Applied Environmental Statistics
BAE 5333 Applied Water Resources Statistics
Measures for Information Loss in Protected Data
Evidence Based Practice 3
Structural Business Statistics Data validation
Undergraduated Econometrics
Analysing quantitative data
Statistics II: An Overview of Statistics
HYDROLOGY Lecture 12 Probability
Higher National Certificate in Engineering
Data analysis LO: Identify and apply different methods of measuring central tendencies and dispersion.
Essentials of Statistics 4th Edition
Presentation transcript:

Creating Synthetic Microdata from Official Statistics: Random Number Generation in Consideration of Anscombe's Quartet Kiyomi Shirakawa Hitotsubashi University / National Statistics Center Shinsuke Ito Chuo University

Outline 1. Synthetic Microdata in Japan 2. Problems with Existing Synthetic Microdata 3. Correcting Existing Synthetic Microdata 4. Creating New Synthetic Microdata 5. Comparison between Various Sets of Synthetic Microdata 6. Conclusions and Future Outlook 2

1. Synthetic Microdata in Japan Synthetic Microdata for educational use are available in Japan:  Generated using multidimensional statistical tables.  Based on the methodology of microaggregation (Ito (2008), Ito and Takano (2011), Makita et al (2013)) Created based on the original microdata from the 2004 ‘National Survey of Family Income and Expenditure’ Synthetic microdata are not original microdata. 3

Legal Framework New Statistics Act in Japan(April 2009) Enables the provision of Anonymized microdata (Article 36) and tailor-made tabulations (Article 34).  Allows a wider use of official microdata.  Allows use of official statistics in higher education and academic research.  However, permission process is required. 4 To provide an alternative to Anonymized microdata, the NSTAC has developed Synthetic microdata that can be accessed without a permission process.

Image of Frequency of Original and Synthetic Microdata 5 Source: Makita et al. (2013).

2. Problems with Existing Synthetic Microdata (1) All variables are subjected to exponential transformation in units of cells in the result table. 6 Number of Earners Structure of Dwelling Frequency Living ExpenditureFood MeanSDC.V.MeanSDC.V. One person 4,132302, , , , Wooden1,436300, , , , Wooden with fore roof , , , , Ferro-concrete1,624306, , , , Unknown571298, , , , Two persons 4,201346, , , , Wooden1,962346, , , , Wooden with fore roof , , , , Ferro-concrete1,120353, , , , Others3260, , ,733.15, Unknown558320, , , , Too large

2. Problems with Existing Synthetic Microdata (2) Correlation coefficients (numerical) between all variables are reproduced. In the below table, several correlation coefficients are too small. The reason is that correlation coefficients between uncorrelated variables are also reproduced. 7 Living expenditureFoodHousing Living expenditure Food Housing Top half: original data; bottom half: synthetic microdata Too small

2. Problems with Existing Synthetic Microdata (3) Qualitative attributes of groups having a frequency (size) of 1 or 2 are transformed to "Unknown" (V) or deleted. The information loss when using this method is too large. Furthermore, the variations within the groups are too large to merge qualitative attributes between different groups. 8 Note: "V" stands for "unknown". Source: Makita et al. (2013). Figure 1: Processing records with common values for qualitative attributes into groups with a minimum size of 3.

3. Correcting Existing Synthetic Microdata The following approaches can be used to correct the existing Synthetic microdata. (1)Select the transformation method (logarithmic transformation, exponential transformation, square-root transformation, reciprocal transformation) based on the original distribution type (normal, bimodal, uniform, etc.). (2)Detect non-correlations for each variable. (3)Merge qualitative attributes in groups with a size of 1 or 2 into a group that has a minimum size of 3 in the upper hierarchical level. 9

Box-Cox Transformation 10 λ = 0 logarithmic transformation λ = 0.5 square-root transformation λ = -1 reciprocal transformation λ = 1 linear transformation

4. Creating New Synthetic Microdata In order to improve problems with existing Synthetic microdata, new synthetic microdata were created based on the following approaches. (1) Create microdata based on kurtosis and skewness (2) Create microdata based on the two tabulation tables of the basic table and details table (3) Create microdata based on multivariate normal random numbers and exponential transformation 11 This process allows creating synthetic microdata with characteristics similar to those of the original microdata.

12 Original dataLog2 transformation Natural lognormal transformation Square-root transformation Reciprocal transformation Mean Standard deviation Kurtosis Skewness Frequency27 λ ( λ = 0 ) Differences of kurtosis and skewness (1) Microdata created based on Kurtosis and Skewness Original microdata and transformed indicators for each transformation

(2) Microdata created based on two Tabulation Tables (Basic Table and Details Table) 13 Living expenditureFood Housing Mean 195, ,647.81,648.8 Standard deviation 59, ,218.13,144.4 Kurtosis Skewness Frequency20 8 Correlation coefficients Living expenditureFoodHousing Living expenditure1 Food Housing Basic Table (matches with original mean and standard deviation, approximate correlation coefficients for each variable) Groups Living expenditureFood FrequencyMeanStandard deviationFrequencyMeanStandard deviation 13185, , ,193.56, , , , , , , , , , , , , , , ,606.23, , , ,797.21,071.9 Details Table (means and standard deviations for creating synthetic microdata for multidimensional cross fields)

(3) Microdata created based on Multivariate Normal Random Numbers and Exponential Transformation 14 a random number that approximates the kurtosis and skewness of the original microdata was selected. λ in the Box-Cox transformation is required in order to change the distribution type of the original data into a standard distribution. Based on λ in the Box- Cox transformation However, approximately using exponential transformation

15 No. 1 Original microdata 2 Hierarchization, and kurtosis, skewness and λ of Box-Cox transformation 3 Kurtosis and skewness 4 Multivariate lognormal random numbers Living expenditure Food Living expenditure Food Living expenditure Food Living expenditure Food 1 125, , , , , , , ,559.9 2 255, , , , , , , ,930.1 3 175, , , , , , , ,263.8 4 181, , , , , , ,764.88,286.1 5 124, , , , , , , ,558.0 6 145, , , , , , , ,994.2 7 319, , , , , , , ,311.7 8 253, , , , , , , ,621.6 9 236, , , , , , , ,899.0 10 137, , , , , , , ,736.0 11 253, , , , , , , ,464.3 12 232, , , , , , , ,416.6 13 214, , , , , , , ,400.5 14 234, , , , , , , ,645.5 15 278, , , , , , , ,910.5 16 197, , , , , , , ,700.6 17 118, , , , , , , ,433.2 18 130, , , , , , , ,131.8 19 147, , , , , , , ,995.6 20 150, , , , , , , , Comparison of Results Comparison of original microdata and each set of synthetic microdata

16 No. 1 Original microdata 2 Hierarchization, and kurtosis, skewness and λ of Box-Cox transformation 3 Kurtosis and skewness 4 Multivariate lognormal random numbers Living expenditure Food Living expenditure Food Living expenditure Food Living expenditure Food Mean195, , , , , , , ,647.8 Standard deviation59, , , , , , , ,218.1 Kurtosis Skewness Correlation coefficients Maximum value319, , , , , , , ,400.5 Minimum value118, , , , , , ,773.78, Comparison of Results The most useful microdata from the indicators in the below table are in column number 2. Note that for reference, column number 4 is the same as the trial synthetic microdata method.

17 5. Comparison of Results Hierarchizatio n, and kurtosis, skewness and λ of Box-Co x transformation Kurtosis and skewness Multivariate lognormal random numbers Scatter plots of living expenditure and food for each microdata Food living expenditure

18 Example Result Table for New Synthetic Microdata ItemsLiving expenditureFood No.ABCDEFFrequencyMeanSD FrequencyMeanSD , , ,193.56, , ,208665, , , , , , , , , , , , , , , , , , , , ,606.23, , , ,797.21,071.9 Mean195, Standard deviation59, Kurtosis Skewness Correlation coefficients λ 0 A: 5-year age groups; B: employment/unemployed; C: company classification; D: company size; E: industry code; F: occupation code

6. Conclusions and Future Outlook Conclusions 1.We suggested improvements to synthetic microdata created by the National Statistics Center for statistics education and training. 2.We created new synthetic microdata using several methods that adhere to this disclosure limitation method. 3.The results show that kurtosis, skewness, and Box-Cox transformation λ are useful for creating synthetic microdata in addition to frequency, mean, standard deviation, and correlation coefficient which have previously been used as indicators. 19 Next Steps 1.Decide the number of cross fields (dimensionality) of the basic table and details table and the style (indicators to tabulate) of the result table according to the statistical fields in the public survey. 2.Expand this work to the creation and improvement of synthetic microdata from other surveys.

References 20 1.Anscombe, F.J.(1973), "Graphs in Statistical Analysis," American Statistician, Bethlehem, J. G., Keller, W. J. and Pannekoek, J.(1990) “Disclosure Control of Microdata”, Journal of the American Statistical Association, Vol. 85, No. 409 pp Defays, D. and Anwar, M.N.(1998) “Masking Microdata Using Micro-Aggregation”, Journal of Official Statistics, Vol.14, No.4, pp Domingo-Ferrer, J. and Mateo-Sanz, J. M.(2002) ”Practical Data-oriented Microaggregation for Statistical Disclosure Control”, IEEE Transactions on Knowledge and Data Engineering, vol.14, no.1, pp Höhne(2003) “SAFE- A Method for Statistical Disclosure Limitation of Microdata”, Paper presented at Joint ECE/Eurostat Work Session on Statistical Data Confidentiality, Luxembourg, pp Ito, S., Isobe, S., Akiyama, H.(2008) “A Study on Effectiveness of Microaggregation as Disclosure Avoidance Methods: Based on National Survey of Family Income and Expenditure”, NSTAC Working Paper, No.10, pp (in Japanese). 6.Ito, S.(2009) “On Microaggregation as Disclosure Avoidance Methods”, Journal of Economics, Kumamoto Gakuen University, Vol.15, No.3 ・ 4, pp (in Japanese) 7.Makita, N., Ito, S., Horikawa, A., Goto, T., Yamaguchi, K. (2013) “Development of Synthetic Microdata for Educational Use in Japan”, Paper Presented at 2013 Joint IASE / IAOS Satellite Conference, Macau Tower, Macau, China, pp.1-9.