24-26 September 2012 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Use of Machine Learning Methods to Impute Categorical.

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Non response and missing data in longitudinal surveys.
AIME03, Oct 21, 2003 Classification of Ovarian Tumors Using Bayesian Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1, J. A. K. Suykens.
Treatment of missing values
Brief introduction on Logistic Regression
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
When adjusting for bias due to linkage errors: a sensitivity analysis Q2014 Tiziana Tuoto 05/06/2014 Joint work with Loredana Di Consiglio.
Cognitive Psychology Chapter 7. Cognitive Psychology: Overview  Cognitive psychology is the study of perception, learning, memory, and thought  The.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Chapter Three Research Design.
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Eurostat Statistical Data Editing and Imputation.
Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection Takafumi Kanamori Shohei Hido NIPS 2008.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Use of Administrative Data in Statistics Canada’s Annual Survey of Manufactures Steve Matthews and Wesley Yung May 16, 2004 The United Nations Statistical.
THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS G. Bianchi, R. M. Lipsi, P. Francescangeli, G. Ruocco, A.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Rasch trees: A new method for detecting differential item functioning in the Rasch model Carolin Strobl Julia Kopf Achim Zeileis.
HSRP 734: Advanced Statistical Methods June 19, 2008.
New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, May 2005, Ottawa.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
1 Dealing with Item Non-response in a Catering Survey Pauli Ollila Statistics Finland Kaija Saarni Finnish Game and Fisheries Research Institute Asmo Honkanen.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Handling Attrition and Non- response in the 1970 British Cohort Study Tarek Mostafa Institute of Education – University of London.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Empirical Efficiency Maximization: Locally Efficient Covariate Adjustment in Randomized Experiments Daniel B. Rubin Joint work with Mark J. van der Laan.
Eurostat On the use of data mining for imputation Pilar Rey del Castillo, EUROSTAT.
GEE Approach Presented by Jianghu Dong Instructor: Professor Keumhee Chough (K.C.) Carrière.
Oslo, 24–26 September 2012 Work Session on Statistical Data Editing APPLICATION OF THE DEVELOPED SAS MACRO FOR EDITING AND IMPUTATION AT.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Oslo, Norway, September 2012 Jeroen Pannekoek and Li-Chun.
Machine Learning Documentation Initiative Workshop on the Modernisation of Statistical Production Topic iii) Innovation in technology and methods driving.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
Generalizing Observational Study Results Applying Propensity Score Methods to Complex Surveys Megan Schuler Eva DuGoff Elizabeth Stuart National Conference.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc.
Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Predictive Mean Matching using a Factor Model, an application.
Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
A shared random effects transition model for longitudinal count data with informative missingness Jinhui Li Joint work with Yingnian Wu, Xiaowei Yang.
Generic Statistical Data Editing Models (GSDEMs) Workshop on the Modernisation of Official Statistics The Hague, 24 November 2015.
Machine Learning 5. Parametric Methods.
Pre-Processing & Item Analysis DeShon Pre-Processing Method of Pre-processing depends on the type of measurement instrument used Method of Pre-processing.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 9, 2012.
1 General Recommendations of the DIME Task Force on Accuracy WG on HBS, Luxembourg, 13 May 2011.
Research and Evaluation Methodology Program College of Education A comparison of methods for imputation of missing covariate data prior to propensity score.
Machine Learning with Spark MLlib
Chapter 7. Classification and Prediction
Theme (i): New and emerging methods
Multiple Imputation using SOLAS for Missing Data Analysis
Intro to Machine Learning
Maximum Likelihood & Missing data
How to handle missing data values
Implementation of the Bayesian approach to imputation at SORS Zvone Klun and Rudi Seljak Statistical Office of the Republic of Slovenia Oslo, September.
Presentation transcript:

24-26 September 2012 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Use of Machine Learning Methods to Impute Categorical Data Pilar Rey del Castillo* EUROSTAT, Unit B1: Quality, Research and Methodology

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Use of Machine Learning Methods to Impute Categorical Data September 2012  Problem non-response in statistical surveys missing information in machine learning different approaches evaluation criteria Aim: show the commitment to the almost exclusive use of probabilistic data models prevents statisticians from using the most convenient technologies Case of categorical variables: practical recommendations from the statistical approach just reuse procedures designed for numeric variables

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Outline of the presentation September Review non-response treatments  imputation procedures: evaluation criteria 2.Recommendations for categorical data imputation from the statistical community: why these are not appropriate 3.Results of comparisons with two machine learning methods 4.Final remarks

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Non-response treatments September 2012 Deletion procedures: using only the units with complete data for further analysis Tolerance procedures: internal, not removing incomplete records or completing them Imputation procedures: replacing each missing value by an estimate

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Imputation procedures September 2012 Algorithmic methods: use an algorithm to produce the imputations (cold and hot-deck, nearest-neighbour, mean, machine learning classification & prediction techniques…) Model-based methods: the predictive distributions have a formal statistical model  state of the art: MI

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Criteria for evaluating the imputation results September 2012 Statistical surveys: valid & efficient inferences, being treatment part of the overall procedure "… Judging the quality of missing data procedures by their ability to recreate the individual missing values (according to hit-rate, mean square error, etc.) does not lead to choosing procedures that result in valid inference, which is our objective" (Rubin, 1996) Machine learning: general artificial intelligence framework (empirical results through simulating missing data and measuring the closeness between real & imputed)

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Categorical data imputation in statistical surveys September 2012 State of the art: MI or other model-based  Log-linear model : not always possible Logistic regression models: sometimes problems at the estimation step Binary case: Rubin & Schenker (1986), Schafer (1997): to approximate by using a Gaussian distribution Non-binary case: Yucel & Zaslavsky (2003), Van Gingel et al. (2007): rounding multivariate normal distribution Criticisms from the practical perspective (Horton (2003), Ake (2005), Allison (2006), Demirtas (2008)) Contradiction (theoretical framework: focus on model adequacy)  (practical recommendations: models clearly not adequate)

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Problem of categorical data imputation to be solved September 2012 Survey microdata file: opinion poll (no.2750 in CIS catalogue) ‒ Quantitative variables (8): ideological self-location; rating of three specific political figures; likelihood to vote; likelihood to vote for three specific political parties… ‒ Ordered categorical variables (2): government and opposition party ratings (converted to quantitative) ‒ Categorical variables with non-ordered categories (7): voting intention; voting memory; the autonomous community; the political party the respondent would prefer to see win… Voting intention to be imputed: 11 categories (biggest political parties, "blank vote", "abstention", "others") interviews with no missing values

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Imputation methods to be compared September 2012 MI logistic regression Classifiers (matching each class with one of the Voting intention categories) ‒ Fuzzy min-max neural network classifier recently extended to deal with mixed numeric & categorical data as inputs (Rey del Castillo & Cardeñosa, 2012) ‒ Bayesian network classifier: not Naïve Bayes classifier but a more complex architecture learnt with a score + search paradigm

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Comparison criterion September 2012 Not possible classical surveys inference criterion because no models EUREDIT project: Wald statistic for categorical variables: but none of the methods overcome the proposed test! Correctly imputed rate is used (ten-fold cross-validation)

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Results of the comparison September 2012 Imputation method Correctly imputed rate % MI logistic regression 66.0 Fuzzy min-max neural network classifier 86.1 Bayesian network classifier 87.4

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Conclusions & final remarks September Always similar differences between machine learning / MI logistic 2.Simplest case with missing data exclusively on one variable 3.Extensible to numeric variables ? 4.Machine learning procedures easier to automate Non-dependence on model assumptions Don't break down when large number of variables ? More robust to outliers ? 5.Machine learning may be used for massive imputation tasks

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Thank you !!! September 2012

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing References (1) September 2012 Ake, C. F. (2005), Rounding After Multiple Imputation with Non-Binary Categorical Covariates, SAS Conference Proceedings: SAS User Group International 30, Philadelphia, PA, April Allison, P. (2006), Multiple Imputation of Categorical Variables under the Multivariate Normal Model, paper presented at the Annual Meeting of the American Sociological Association, Montreal Convention Center, Montreal, Quebec, Canada, August Demirtas, H. (2008), On Imputing Continuous Data When the Eventual Interest Pertains to Ordinalized Outcomes Via Threshold Concept, Computational Statistics and Data Analysis, vol. 52, pp Horton, N. J., Lipsitz, S. R. and Parzen, M. (2003), A Potential for Bias when Rounding in Multiple Imputation, The American Statistician, vol. 57, no. 4, pp , November Rey-del-Castillo, P., and Cardeñosa, J. (2012), Fuzzy Min–Max Neural Networks for Categorical Data: Application to Missing Data Imputation, Neural Computing and Applications, vol. 21, no. 6 (2012), pp , DOI /s00521 ‐ 011 ‐ 0574 ‐ x, Springer-Verlag London. Rubin, D. B. (1996), Multiple Imputation After 18+ Years, Journal of the American Statistical Association, vol. 91, no. 434, Applications and Case Studies, June 1996.

UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing References (2) September 2012 Rubin, D. B. and Schenker, N. (1986), Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse, Journal of the American Statistical Association, vol. 81, no. 394, Survey Research Methods, June Schafer, J. L. and Graham, J. W. (2002), Missing Data: Our View of the State of the Art, Psychological Methods, vol. 7, no. 2, pp Van Ginkel, J. R., Van der Ark, L. A. and Sijtsma, K. (2007), Multiple Imputation of Item Scores when Test Data are Factorially Complex, British Journal of Mathematics and Statistical Psychology, vol. 60, pp Yucel, R. M. and Zaslavsky, A. M. (2003), Practical Suggestions on Rounding in Multiple Imputation, Proceedings of the Joint American Statistical Association Meeting, Section on Survey Research Methods, Toronto, Canada, August 2003.