Eurostat On the use of data mining for imputation Pilar Rey del Castillo, EUROSTAT.

Slides:



Advertisements
Similar presentations
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Advertisements

Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University.
Ch11 Curve Fitting Dr. Deshi Ye
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Decision Support Systems
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Data Mining Techniques Outline
Radial Basis Functions
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 15: Introduction to Artificial Neural Networks Martin Russell.
Neural Networks Chapter Feed-Forward Neural Networks.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Data mining and statistical learning - lecture 13 Separating hyperplane.
Chapter 11 Multiple Regression.
Business and Economics 7th Edition
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural.
Aula 4 Radial Basis Function Networks
Gini Index (IBM IntelligentMiner)
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Chapter 5 Data mining : A Closer Look.
Decision Tree Models in Data Mining
Radial-Basis Function Networks
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks II PROF. DR. YUSUF OYSAL.
Radial Basis Function Networks
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Simple Linear Regression
Radial Basis Function Networks
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Multi-Layer Perceptrons Michael J. Watts
Chapter 9 Neural Network.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Radial Basis Function Networks:
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
24-26 September 2012 UNECE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Use of Machine Learning Methods to Impute Categorical.
Applying Neural Networks Michael J. Watts
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Neural and Evolutionary Computing - Lecture 9 1 Evolutionary Neural Networks Design  Motivation  Evolutionary training  Evolutionary design of the architecture.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Chapter 13 Multiple Regression
Non-Bayes classifiers. Linear discriminants, neural networks.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Linear Discriminant Analysis (LDA). Goal To classify observations into 2 or more groups based on k discriminant functions (Dependent variable Y is categorical.
Robust Estimators.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Neural Networks Demystified by Louise Francis Francis Analytics and Actuarial Data Mining, Inc.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Randomized Assignment Difference-in-Differences
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
A Presentation on Adaptive Neuro-Fuzzy Inference System using Particle Swarm Optimization and it’s Application By Sumanta Kundu (En.R.No.
Business Intelligence and Decision Support Systems (9 th Ed., Prentice Hall) Chapter 6: Artificial Neural Networks for Data Mining.
Estimating standard error using bootstrap
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Chapter 7. Classification and Prediction
CEE 6410 Water Resources Systems Analysis
Applying Neural Networks
Artificial neural networks
Multiple Imputation Using Stata
Neuro-Computing Lecture 4 Radial Basis Function Network
The European Statistical Training Programme (ESTP)
Bootstrapping and Bootstrapping Regression Models
Presentation transcript:

Eurostat On the use of data mining for imputation Pilar Rey del Castillo, EUROSTAT

Outline Imputations to solve non-response in surveys; new problems for mass imputations State of the art: model-based imputations => MI Introduce data mining methods (for continuous data) Compare results in a simulation exercise following different criteria Raise questions on mass imputation (should data mining methods be considered?) 2 Eurostat

Imputations to solve non-response Replace each missing-value with an estimate Current problems in sample surveys –Small area estimation-> provide values for non- sampled units –Statistical matching-> provide joint statistical information based on 2 or more sources  A complete data set providing a basis for consistent analysis?...  Mass imputation as possible solution Model-based procedures making inferences based on the posterior distribution  Multiple Imputation (MI) (suited for computing variances) 3 Eurostat

Multiple Imputation 4 Eurostat ImputationAnalysisCombination Incomplete data Imputed data Statistics Combined statistic

Simulation exercise EU-SILC 2009: microdata on income, poverty, social exclusion and living conditions (Spain, Austria) Wages numerical variable to be imputed; Covariates (15) gender, age, country of birth, marital status, region, degree urbanisation of residential area, economic activity, highest level education, managerial position, occupation, temporary job, part-time job, hours usually worked per week, years education & years in main job Methods to be compared: –Least Median Squared Error Regressor (LMS) –M5P algorithm (M5P) –Multilayer Perceptron Regressor (MLP) –Radial Basis Function (RBF) –Regression (REG) –Predictive Mean Matching (PMM) 5 Eurostat

Least Median Squared Error Regressor (LMS) Outliers affect classical LS linear regression: squared distance accentuates influence of points far away from regression line More robust: minimise median of squares of differences from regression line Standard linear regression, solution with smallest median-squared errors 6 Eurostat

M5P algorithm (M5P) Decision tree: supervised classifier with uses a tree- like graph or model of decisions and their possible consequences (decision nodes, leaves…) Model tree: for continuous variables, with a linear regression model at each leaf Reconstruction of Quinlan's algorithms 7 Eurostat

Multilayer Perceptron Regressor (MLP) Neural networks based on structure of the brain; learning by adjusting connections MLP Feed forward network 1 hidden layer Delta rule as learning algorithm  w ij = -   E(w ij )/  w ij Logistic function as transfer function f(x) = 1/(1+e -x ) Output layer: 1 node with linear activation 8 Eurostat

Radial Basis Function (RBF) Neural network similar to MLP Differing in way hidden layer performs computations Activation for an input depends on distance to hidden unit Parameters to be learnt weights + centres 9 Eurostat

Regression (REG) Regression forecast for each input of covariate variables from regression estimated using training set Categorical treated by constructing appropriate dummy variables for each category Baseline for comparisons 10 Eurostat

Predictive Mean Matching (PMM) Similar to regression For each missing imputes a value randomly chosen from the set of observed values having the closest predicted value to the forecast obtained by the regression model Identified as providing best imputations 11 Eurostat

Data mining evaluation criteria 12 Correlation coefficient Mean Absolute Error Root Mean Squared Error Relative Absolute Error Root Relative Squared Error Eurostat

13 COUNTRYMETHODCorrelationMAERMSERAERRSE ESLMS ESM5P ESMLP ESPMM ESRBF ESREG ATLMS ATM5P ATMLP ATPMM ATRBF ATREG Eurostat

Statistical inference evaluation criteria 14 Output of mean & other parameters estimates, e. g. Similarity between original distribution & that with imputed values,, Eurostat

15 COUNTRYMETHODMeanModeMedianSTD ESORIGINAL ESLMS ESM5P ESMLP ESPMM ESRBF ESREG ATORIGINAL ATLMS ATM5P ATMLP ATPMM ATRBF ATREG Eurostat

16 Imputation errors for the original Wages variable in one of the simulated files using M5P imputation method Shrinkage to the mean!! Eurostat

17 CountryMethodHellinger distance Kolmogorov-Smirnov distance ESLMS ESM5P ESMLP ESPMM ESRBF ESREG ATLMS ATM5P ATMLP ATPMM ATRBF ATREG Eurostat

18 Histograms of the Log (wages) variable Eurostat

But… When the purpose is obtaining complete files free of missing data… What happens with the results at a more detailed level of disaggregation? Do the comparative advantages and disadvantages remain? 19 Eurostat

Example (region of Extremadura in Spain)(1) 20 METHODCorrelationMAERMSERAERRSE LMS M5P MLP PMM RBF REG Eurostat

Example (region of Extremadura in Spain)(2) 21 METHODMeanModeMedianSTD LMS M5P MLP ORI PMM RBF REG METHODHellinger distance Kolmogorov-Smirnov distance LMS M5P MLP PMM RBF REG Eurostat

Thus… Results at a more detailed level of disaggregation can be reversed…!!! 22 Eurostat

Final remarks (1) Data mining procedures provide imputations which reproduce the original individual values sign. better PMM produces sign. better estimates of means & other statistical parameters for the whole population Imputations by regression are slightly worse than those of data mining procedures 23 Eurostat

Final remarks (2) Paradoxical result: Given an original distribution one imputed-population has more similar individual values another imputed-population has more similar distribution parameters PMM produces random imputations (from regressions) designed to improve   estimates: at the cost of closeness to individual values!! Different possibilities to improve data mining imputations Might it be worth considering also individual one-to-one likeness when assessing similarities between distributions? 24 Eurostat Maybe valid inference in the era of data integration, data matching, small area estimation… should be another thing?

25 Eurostat Thanks for your attention !!

26 Donald B. Rubin, "Multiple Imputation After 18+ Years", JASA, vol. 91, no. 434, June 1996 "…Judging the quality of missing data procedures by their ability to recreate the individual missing values (according to hit- rate, mean square error, etc.) does not lead to choosing procedures that result in valid inference, which is our objective"