Q2008 - ROME, 09-11 JULY 2008 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from.

Slides:



Advertisements
Similar presentations
Some considerations on developing a DWH for SBS estimates Orietta Luzi – Mauro Masselli Istat - Italy march 2013.
Advertisements

A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Lecture 3: A brief background to multivariate statistics
Chapter 10 Curve Fitting and Regression Analysis
Deborah Cobb-Clark (U Melbourne) Mathias Sinning (ANU) Steven Stillman (U Otago)
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment.
Tree-based methods, neutral networks
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Lecture 5 (Classification with Decision Trees)
Chapter 11 Multiple Regression.
1 1 Slide © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Analysis of Variance & Multivariate Analysis of Variance
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
Tables, Figures, and Equations
Classification and Prediction: Regression Analysis
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
 Catalogue No: BS-338  Credit Hours: 3  Text Book: Advanced Engineering Mathematics by E.Kreyszig  Reference Books  Probability and Statistics by.
Presented By Wanchen Lu 2/25/2013
The literacy divide: territorial differences in the Italian education system Claudio QUINTANO, Rosalia CASTELLANO, Sergio LONGOBARDI University of Naples.
Validation of the Assessment and Comparability to the PISA Framework Hao Ren and Joanna Tomkowicz McGraw-Hill Education CTB.
by B. Zadrozny and C. Elkan
بسم الله الرحمن الرحیم.. Multivariate Analysis of Variance.
Chapter Eighteen Discriminant Analysis Chapter Outline 1) Overview 2) Basic Concept 3) Relation to Regression and ANOVA 4) Discriminant Analysis.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
Chapter 9 – Classification and Regression Trees
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Explaining variation in CCE outcomes (Chapters 7 & 8) National Research Coordinators Meeting Madrid, February 2010.
Sampling distributions chapter 7 ST210 Nutan S. Mishra Department of Mathematics and Statistics University of South Alabama.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Chapter 13 Multiple Regression
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
CpSc 881: Machine Learning
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Lecture Notes for Chapter 4 Introduction to Data Mining
Tutorial I: Missing Value Analysis
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
An assessment of the robustness of weights in the Famille et Employeurs survey Nicolas Razafindratsima & Elisabeth Morand.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.
Slide Slide 1 Chapter 10 Correlation and Regression 10-1 Overview 10-2 Correlation 10-3 Regression 10-4 Variation and Prediction Intervals 10-5 Multiple.
Conditional Classification Trees using Instrumental Variables Roberta Siciliano Valerio Aniello Tutore Department of Mathematics and Statistics University.
1 Perspectives on the Achievements of Irish 15-Year-Olds in the OECD PISA Assessment
Introduction to Vectors and Matrices
JMP Discovery Summit 2016 Janet Alvarado
Data Transformation: Normalization
Clustering CSC 600: Data Mining Class 21.
Chapter 7. Classification and Prediction
Multiple Imputation using SOLAS for Missing Data Analysis
Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Ch9: Decision Trees 9.1 Introduction A decision tree:
CH 5: Multivariate Methods
Chapter 12 Using Descriptive Analysis, Performing
Checking Regression Model Assumptions
Matrices Definition: A matrix is a rectangular array of numbers or symbolic elements In many applications, the rows of a matrix will represent individuals.
CHAPTER 29: Multiple Regression*
Checking Regression Model Assumptions
The European Statistical Training Programme (ESTP)
Nonlinear regression.
15.1 The Role of Statistics in the Research Process
Text Categorization Berlin Chen 2003 Reference:
Introduction to Vectors and Matrices
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Presentation transcript:

Q ROME, JULY 2008 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) Claudio Quintano, Rosalia Castellano, Sergio Longobardi University of Naples “Parthenope”

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 2 IMPROVING THE ACCURACY OF ITALIAN DATA FROM OECD’s “Programme for International Student Assessment” (PISA 2003) BY DEVELOPING IMPUTATION STRATEGIES TO REDUCE THE NON-SAMPLING ERROR OF PARTIAL NON RESPONSES OUTLINES

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 3 PISA 2003 The OECD’s PISA “Programme for International Student Assessment” survey is an internationally standardised assessment administered to 15 years old students The survey involves students ( in Italy) schools (406 in Italy) 41 Countries (20 European Union members)

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 4 The survey assesses the students’ competencies in three areas Reading literacy Scientific literacy Mathematical literacy PISA 2003

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 5 The OECD collects data on FAMILY ENVIRONMENT OF STUDENT STUDENT DATASET SCHOOL DATASET SCHOOL CHARACTERISTICS AVAILABLE DATA

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 6 ITALY: EXCLUDED STUDENT UNITS (8%) AS ONE OR MORE STUDENT OR SCHOOL VARIABLES ARE MISSING Multilevel (school and student) model with 4 covariates

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 7 Multilevel (school and student) model with 29 covariates ITALY: EXCLUDED STUDENT UNITS (81%) AS ONE OR MORE STUDENT OR SCHOOL VARIABLES ARE MISSING

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 8 STEPS OF ANALYSIS Missing data pattern Imputation strategies Evaluation of results

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 9 TWO SUBSETS OF VARIABLES OECD’S PISA DATASET COLLECTEDVARIABLES DERIVED VARIABLES Computed on collected variables (by linear combination or factorial analysis). This increases the potentialities of the survey Data collected by student and school questionnaires

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 10 EXAMPLE OF DERIVED VARIABLES The PISA 2003 index of confidence in ICT internet tasks is derived from students’ responses to the five items. All items are inverted for IRT scaling and positive values on this index indicate high self-confidence in ICT internet tasks The PISA 2003 index of school size (SCHLSIZE) is derived from summing school principals’ responses to the number of girls and boys at a school The PISA 2003 index of availability of computers (RATCOMP) is derived from school principals’ responses to the items measuring the availability of computers. It is calculated by dividing the number of computers at school by the number of students at school

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 11 Distribution of variables classified as “collected” and “derived” in the school and student dataset of PISA 2003 SCHOOL DATASETSTUDENT DATASET “COLLECTED” VARIABLES “DERIVED” VARIABLES TOTAL “COLLECTED” AND “DERIVED” VARIABLES

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 12 VARIABLE TYPOLOGY CATEGORICAL VARIABLES (n.197) 91,7% CONTINUOUS VARIABLES (n. 15) 6,9% DISCRETE VARIABLES (n. 3 ) 1,4% TOTAL OF COLLECTED VARIABLES AT STUDENT LEVEL (n. 215) 100% STUDENTS’ DATASET VARIABLES WITH > 5% OF MISSING (n.39) 18% VARIABLES WITHOUT MISSING (n.3) 1,5%

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 13 Iterative and sequential multiple regression applied to whole dataset FIVE IMPUTATION PROCEDURES Iterative and sequential multiple regression applied to each section of student questionnaire Iterative and sequential multiple regression applied to imputation classes computed by a regression tree Random selection of donors within imputation classes computed by a regression tree Random selection of donors within imputation classes computed by a regression tree for each section of the student questionnaire PROCEDURE A PROCEDURE B PROCEDURE C PROCEDURE D PROCEDURE E

USUAL ASSOCIATIONS AND ANTINOMIES OF ADOPTED IMPUTATION PROCEDURES (A-E) ALL PROCEDURES ARE BELONGING TO CATEGORIES USUALLY WELL KNOWN TWO CATEGORIES ARE INVOLVED: REGRESSION METHODS (A,B,E) AND DONORS METHODS (C,D) TWO CATEGORIES ARE INVOLVED: REGRESSION METHODS (A,B,E) AND DONORS METHODS (C,D) DIMENSION OF TREATED DATA MATRIX. THE IMPUTATION PROCEDURE IS (A,D) / IS NOT (B,C,E) PUT ON EACH SECTIONS OF THE QUESTIONNAIRE TWO DATA MATRIX SIDES ARE INVOLVED: UNITS (Classification And Regression Tree B,C,D) AND VARIABLES (A,D) MISSING DATA MECHANISM IS (A,E) / IS NOT CONSIDERED (B,C,D)

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 15 Iterative and sequential multiple regression (Raghunatahan et al. 2001) on each section of student questionnaire PROCEDURE A The data matrix is partitioned in the seven sections of student questionnaire The features of each section, as partition of data matrix: Strong logical links between the questions Homogeneous structure of association and relationship Homogeneous presence of missing data

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 16 PROCEDURE A Subset Section of the questionnaire Categorical variables Continuous variables Discrete variables Variables Average of missing data for each section 1 Family context (Sect. B) Educational level (Sect. C) School context (Sect. D) Learning mathematics (Sect. E) Mathematics classes (Sect.F) ICT confidence (Sect. ICT) Educational career (Sect.EC)

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 17 Iterative and sequential multiple, regression applied to imputation classes computed by a regression tree PROCEDURE B Computation of regression tree (14 terminal nodes) DEPENDENT VARIABLE Missing data for each student PREDICTORS Selected from five categories of derived indicators θ: Family background Scholastic context Approach to study Attitudes toward ICT struments Performance scores Each terminal node of the tree is considered as imputation class Their missing values are imputed by iterative and sequential regression model (Raghunatahan et al. 2001) STEP I UNITS CLASSIFICATION STEP II IMPUTATION

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 18 Random selection of donors inside of imputation classes computed by a regression tree PROCEDURE C Computation of regression tree (14 terminal nodes) DEPENDENT VARIABLE Missing data for each student PREDICTORS Selected from five categories of derived indicators θ: Family background Scholastic context Approach to study Attitudes toward ICT struments Performance scores A different donor is selected to impute each missing value of each student The donor is selected randomly from the same node STEP I UNITS CLASSIFICATION STEP II IMPUTATION

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 19 THE DATA MATRIX IS PARTITIONED IN THE SEVEN SECTIONS OF STUDENT QUESTIONNAIRE A REGRESSION TREE IS PRODUCED WITHIN EACH PARTITION OF THE MATRIX (see the next slide) WITHIN ALL LEAVES, A DIFFER DONOR IS SELECTED TO IMPUTE EACH MISSING VALUE OF EACH STUDENT THE DONOR IS SELECTED RANDOMLY FROM THE SAME NODE STEP II Units Classificatio n STEP III Imputation Random selection of donors within imputation classes computed by a regression tree for each section of the student questionnaire PROCEDURE D STEP I Matrix partition

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 20 PROCEDURE D REGRESSION TREES FOR EACH MATRIX PARTITION SECTION DEPENDENT VARIABLE PREDICTORS TERMINAL NODES B Number of missing data for each record (student) ISCO code Mother, Disciplinary climate in maths lessons, Mathematics self-efficacy, Computer facilities at home, Plausible value in problem solving 12 C Number of missing data for each record (student) Expected educational level of student (ISCED), Mathematics anxiety, Mathematics self-concept, ICT: Confidence in routine tasks, Plausible value in math 11 E Number of missing data for each record (student) Home educational resources, Mathematics anxiety, Mathematics self-concept, ICT: Confidence in routine tasks, Plausible value in problem solving 7 IC Number of missing data for each record (student) Index of Socio-Economic and Cultural Status, Mathematics anxiety, Mathematics self-efficacy, Computer facilities at home, Plausible value in problem solving 16 D+F+EC Number of missing data for each record (student) Expected educational level of student (ISCED), Disciplinary climate in maths lessons, Mathematics self-efficacy, Computer facilities at home, Plausible value in problem solving 12

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 21 ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (Raghunatahan et al. 2001) ON THE WHOLE DATASET (without any partition of units and variables) PROCEDURE E

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 22 METHODOLOGICAL DETAILS OF THE IMPUTATION PROCEDURES

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 23 The classification is obtained through the recursive binary partition of the measurement space and containing subgroups (NODES) of the target variable values internally homogeneous, correspond to imputation cells Classification and Regression Tree creates a tree-based classification model. It classifies cases into groups or predicts values of a dependent (target) variable based (Y) on values of independent (predictor) variables (X) Classification And Regression Tree PARENT NODE CHILD NODE TERMINAL NODE CREATE IMPUTATION CELLS

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 24 Example: A tree T composed of five nodes t i i=1,2,3,4,5 t1t1 t2t2 t3t3 t4t4 t5t5 Impurity of a node t STRUCTURE OF A REGRESSION TREE For any split s of t into t L and t R, the best split s* is such that

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 25 The variable with the fewest number of missing values -Y 1 – is regressed on the subset of variables without missing data U=X Each variable is imputed by using all available variables (completed or imputed) ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (1/2) PARTITION OF THE VARIABLES Variables with missing data -X- Variables without missing data -Y- STEP 1 STEP 2 Update U by appending Y 1 Then the next fewest missing values Y 2 is regressed on U = (X, Y 1 ) where Y 1 has imputed values STEP 3 STEP N ……..

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 26 ALL MISSING DATA ARE IMPUTED FOR EACH VARIABLE THE IMPUTATION PROCESS IS THEN REPEATED MODIFYING THE PREDICTOR SET TO INCLUDE ALL X AND Y VARIABLES EXCEPT THE ONE USED AS THE DEPENDENT VARIABLE ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (2/2) NEXT ROUND

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 27 EVALUATION OF IMPUTATION PROCEDURES IMPACT ON UNIVARIATE DISTRIBUTIONS RELATIONSHIP BETWEEN VARIABLES

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 28 IMPUTATION EFFECTS ON UNIVARIATE DISTRIBUTIONS CATEGORICAL VARIABLES N denotes the number of categorical variables CONTINUOUS VARIABLES ABSOLUTE RELATIVE VARIATION INDEX (AMONG MEANS) ABSOLUTE RELATIVE VARIATION INDEX (AMONG STANDARD DEVIATIONS) ABSOLUTE RELATIVE SQUARE DISSIMILARITIES INDEX (LETI 1983) the education survey data have analysed with multilevel models.

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 29 Mean difference for each imputed variable (Y j ) between the association pre and post imputation of Y j vs remaining n-1 categorical variables Variation Association Index (categorical variables) IMPUTATION EFFECTS ON RELATIONSHIP AMONG VARIABLES (1/2)

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 30 Mean difference for each imputed variable (Y j ) beetwen the correlation pre and post imputation of Y j vs remaining n-1 continuous variables Variation Association Index (continuous variables) IMPUTATION EFFECTS ON RELATIONSHIP AMONG VARIABLES (2/2)

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 31 MATRIX V x P - “VARIABLES x PROCEDURES” The matrices VxP are five, one for each evaluation index Example of structure of matrix VxP with a generic evaluation index G j ≡ (Im j or I μ j or I σ j or VAI N j or VAI C j )* Proc.AProc.BProc.CProc.DProc.E Var.1g 1a g 1b g 1c g 1d g 1e Var.2g 2a g 2b g 2c g 2d g 2e …...…. Var. Ng Na g Nb g Nc g Nd g Ne (*) According: a) the typology (categorical, etc.) of variables (the number of variables for each typology is denoted by N); b) type of impact on: univariate distributions and relationship between variables

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 32 Each of five matrix VP G (Nx5) -whose Gjs is a generic element- is transformed in a js (0,1) score matrix S I (Nx5) with a js Matrix VP G Nx5 Proc.A Proc.B Proc. C Proc. D Proc.E Var.10,60,40,50,10,7 Var.20,30,60,90,5,08 …. Var.N0,40,20,80,60,5 SCORES MATRICES 1 if g js is the minimum value in the row j 0 otherwise Matrix S G Nx5 Proc.A Proc.B Proc. C Proc. D Proc.E Var Var …. Var.N01000  j  min{ g js }  a js =1;  s:g js ≠min{ g js }  a js =0 

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 33 The ranking indicators measure the relative performance of each procedure according to each evaluation index BUILDING A RANKING INDICATOR (1/3)

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 34 The vector of 0,1 scores extracted from the S matrix (for each procedure and for each evaluation indicator) is reduced to a scalar as a sum of its elements This sum is divided by the number of vector elements to obtain a ranking index R whose range is 0,1 BUILDING A RANKING INDICATOR (2/3)

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 35 Lowest performance of s th procedure compared to other ones for generic evaluation index G Highest performance of s th procedure compared to other procedures for generic evaluation index G The ranking indicators measure the relative performance of each procedure according to each evaluation index BUILDING A RANKING INDICATOR (3/3)

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 36 Evaluation indexes AimRanking index Evaluating the imputation impact on marginal distributions (categorical variables) Evaluating the imputation impact on marginal distributions (continuous variables) Evaluating the imputation impact on the association between continuous variables Evaluating the imputation impact on the association between categorical variables FROM AN EVALUATION INDICATOR TO A RANKING INDICATOR

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 37 Ranking based on dissimilarities index (categorical variables) Absolute relative variation index (among means) Absolute relative variation index (among standard deviations) RankProced. Ranking index RankProced. Ranking index RankProced. Ranking index IC 0,695 IC0,538 IC0,385 IID 0,626 IIE0,308 IID0,385 IIIA 0,474 IIID0,077 IIIB0,231 IVE 0,442 IVB0,077 IVE0,000 VB 0,405 VA0,000 VA EVALUATING THE IMPACT ON MARGINAL DISTRIBUTIONS AND ON SOME DISTRIBUTIVE PARAMETERS

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 38 Ranking based on Variation Association Index for categorical variables Ranking based on Variation Association Index for continuos variables RankProcedure Ranking indicators RankProcedure Ranking indicators IB0,538IB0,542 IIA0,308IIA0,126 IIIE0,077IIIE0,121 IVD0,077IVD0,121 VC0,000VC0,089 EVALUATING THE IMPUTATION IMPACT ON THE VARIABLES ASSOCIATION

Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 39 CONCLUDING REMARKS MISSING DATA IMPUTATION IS AN EXTREMELY COMPLEX PROCESS EACH METHOD SHOWS CRITICAL ASPECTS IT IS IMPORTANT TO DEVELOP A RECONTRUCTION STRATEGY CONSIDERING SOME BASIC ASPECTS: THE MISSING DATA PATTERN THE IMPACT ON THE STATISTICAL DISTRIBUTIONS THE IMPACT ON THE ASSOCIATIONS AMONG VARIABLES