Presentation is loading. Please wait.

Presentation is loading. Please wait.

Q2008 - ROME, 09-11 JULY 2008 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from.

Similar presentations


Presentation on theme: "Q2008 - ROME, 09-11 JULY 2008 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from."— Presentation transcript:

1 Q2008 - ROME, 09-11 JULY 2008 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) Claudio Quintano, Rosalia Castellano, Sergio Longobardi University of Naples “Parthenope” claudio.quintano@uniparthenope.it; lia.castellano@uniparthenope.it sergio.longobardi@uniparthenope.it

2 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 2 IMPROVING THE ACCURACY OF ITALIAN DATA FROM OECD’s “Programme for International Student Assessment” (PISA 2003) BY DEVELOPING IMPUTATION STRATEGIES TO REDUCE THE NON-SAMPLING ERROR OF PARTIAL NON RESPONSES OUTLINES

3 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 3 PISA 2003 The OECD’s PISA “Programme for International Student Assessment” survey is an internationally standardised assessment administered to 15 years old students The survey involves 276.165 students (11.639 in Italy) 10.274 schools (406 in Italy) 41 Countries (20 European Union members)

4 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 4 The survey assesses the students’ competencies in three areas Reading literacy Scientific literacy Mathematical literacy PISA 2003

5 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 5 The OECD collects data on FAMILY ENVIRONMENT OF STUDENT STUDENT DATASET SCHOOL DATASET SCHOOL CHARACTERISTICS AVAILABLE DATA

6 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 6 ITALY: EXCLUDED STUDENT UNITS (8%) AS ONE OR MORE STUDENT OR SCHOOL VARIABLES ARE MISSING Multilevel (school and student) model with 4 covariates

7 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 7 Multilevel (school and student) model with 29 covariates ITALY: EXCLUDED STUDENT UNITS (81%) AS ONE OR MORE STUDENT OR SCHOOL VARIABLES ARE MISSING

8 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 8 STEPS OF ANALYSIS Missing data pattern Imputation strategies Evaluation of results

9 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 9 TWO SUBSETS OF VARIABLES OECD’S PISA DATASET COLLECTEDVARIABLES DERIVED VARIABLES Computed on collected variables (by linear combination or factorial analysis). This increases the potentialities of the survey Data collected by student and school questionnaires

10 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 10 EXAMPLE OF DERIVED VARIABLES The PISA 2003 index of confidence in ICT internet tasks is derived from students’ responses to the five items. All items are inverted for IRT scaling and positive values on this index indicate high self-confidence in ICT internet tasks The PISA 2003 index of school size (SCHLSIZE) is derived from summing school principals’ responses to the number of girls and boys at a school The PISA 2003 index of availability of computers (RATCOMP) is derived from school principals’ responses to the items measuring the availability of computers. It is calculated by dividing the number of computers at school by the number of students at school

11 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 11 Distribution of variables classified as “collected” and “derived” in the school and student dataset of PISA 2003 SCHOOL DATASETSTUDENT DATASET “COLLECTED” VARIABLES 154215 “DERIVED” VARIABLES 30109 TOTAL184324 “COLLECTED” AND “DERIVED” VARIABLES

12 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 12 VARIABLE TYPOLOGY CATEGORICAL VARIABLES (n.197) 91,7% CONTINUOUS VARIABLES (n. 15) 6,9% DISCRETE VARIABLES (n. 3 ) 1,4% TOTAL OF COLLECTED VARIABLES AT STUDENT LEVEL (n. 215) 100% STUDENTS’ DATASET VARIABLES WITH > 5% OF MISSING (n.39) 18% VARIABLES WITHOUT MISSING (n.3) 1,5%

13 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 13 Iterative and sequential multiple regression applied to whole dataset FIVE IMPUTATION PROCEDURES Iterative and sequential multiple regression applied to each section of student questionnaire Iterative and sequential multiple regression applied to imputation classes computed by a regression tree Random selection of donors within imputation classes computed by a regression tree Random selection of donors within imputation classes computed by a regression tree for each section of the student questionnaire PROCEDURE A PROCEDURE B PROCEDURE C PROCEDURE D PROCEDURE E

14 USUAL ASSOCIATIONS AND ANTINOMIES OF ADOPTED IMPUTATION PROCEDURES (A-E) ALL PROCEDURES ARE BELONGING TO CATEGORIES USUALLY WELL KNOWN TWO CATEGORIES ARE INVOLVED: REGRESSION METHODS (A,B,E) AND DONORS METHODS (C,D) TWO CATEGORIES ARE INVOLVED: REGRESSION METHODS (A,B,E) AND DONORS METHODS (C,D) DIMENSION OF TREATED DATA MATRIX. THE IMPUTATION PROCEDURE IS (A,D) / IS NOT (B,C,E) PUT ON EACH SECTIONS OF THE QUESTIONNAIRE TWO DATA MATRIX SIDES ARE INVOLVED: UNITS (Classification And Regression Tree B,C,D) AND VARIABLES (A,D) MISSING DATA MECHANISM IS (A,E) / IS NOT CONSIDERED (B,C,D)

15 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 15 Iterative and sequential multiple regression (Raghunatahan et al. 2001) on each section of student questionnaire PROCEDURE A The data matrix is partitioned in the seven sections of student questionnaire The features of each section, as partition of data matrix: Strong logical links between the questions Homogeneous structure of association and relationship Homogeneous presence of missing data

16 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 16 PROCEDURE A Subset Section of the questionnaire Categorical variables Continuous variables Discrete variables Variables Average of missing data for each section 1 Family context (Sect. B) 3900 107 2 Educational level (Sect. C) 146121508 3 School context (Sect. D) 1800 117 4 Learning mathematics (Sect. E) 406046237 5 Mathematics classes (Sect.F) 212124212 6 ICT confidence (Sect. ICT) 4900 713 7 Educational career (Sect.EC) 6017242

17 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 17 Iterative and sequential multiple, regression applied to imputation classes computed by a regression tree PROCEDURE B Computation of regression tree (14 terminal nodes) DEPENDENT VARIABLE Missing data for each student PREDICTORS Selected from five categories of derived indicators θ: Family background Scholastic context Approach to study Attitudes toward ICT struments Performance scores Each terminal node of the tree is considered as imputation class Their missing values are imputed by iterative and sequential regression model (Raghunatahan et al. 2001) STEP I UNITS CLASSIFICATION STEP II IMPUTATION

18 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 18 Random selection of donors inside of imputation classes computed by a regression tree PROCEDURE C Computation of regression tree (14 terminal nodes) DEPENDENT VARIABLE Missing data for each student PREDICTORS Selected from five categories of derived indicators θ: Family background Scholastic context Approach to study Attitudes toward ICT struments Performance scores A different donor is selected to impute each missing value of each student The donor is selected randomly from the same node STEP I UNITS CLASSIFICATION STEP II IMPUTATION

19 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 19 THE DATA MATRIX IS PARTITIONED IN THE SEVEN SECTIONS OF STUDENT QUESTIONNAIRE A REGRESSION TREE IS PRODUCED WITHIN EACH PARTITION OF THE MATRIX (see the next slide) WITHIN ALL LEAVES, A DIFFER DONOR IS SELECTED TO IMPUTE EACH MISSING VALUE OF EACH STUDENT THE DONOR IS SELECTED RANDOMLY FROM THE SAME NODE STEP II Units Classificatio n STEP III Imputation Random selection of donors within imputation classes computed by a regression tree for each section of the student questionnaire PROCEDURE D STEP I Matrix partition

20 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 20 PROCEDURE D REGRESSION TREES FOR EACH MATRIX PARTITION SECTION DEPENDENT VARIABLE PREDICTORS TERMINAL NODES B Number of missing data for each record (student) ISCO code Mother, Disciplinary climate in maths lessons, Mathematics self-efficacy, Computer facilities at home, Plausible value in problem solving 12 C Number of missing data for each record (student) Expected educational level of student (ISCED), Mathematics anxiety, Mathematics self-concept, ICT: Confidence in routine tasks, Plausible value in math 11 E Number of missing data for each record (student) Home educational resources, Mathematics anxiety, Mathematics self-concept, ICT: Confidence in routine tasks, Plausible value in problem solving 7 IC Number of missing data for each record (student) Index of Socio-Economic and Cultural Status, Mathematics anxiety, Mathematics self-efficacy, Computer facilities at home, Plausible value in problem solving 16 D+F+EC Number of missing data for each record (student) Expected educational level of student (ISCED), Disciplinary climate in maths lessons, Mathematics self-efficacy, Computer facilities at home, Plausible value in problem solving 12

21 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 21 ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (Raghunatahan et al. 2001) ON THE WHOLE DATASET (without any partition of units and variables) PROCEDURE E

22 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 22 METHODOLOGICAL DETAILS OF THE IMPUTATION PROCEDURES

23 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 23 The classification is obtained through the recursive binary partition of the measurement space and containing subgroups (NODES) of the target variable values internally homogeneous, correspond to imputation cells Classification and Regression Tree creates a tree-based classification model. It classifies cases into groups or predicts values of a dependent (target) variable based (Y) on values of independent (predictor) variables (X) Classification And Regression Tree PARENT NODE CHILD NODE TERMINAL NODE CREATE IMPUTATION CELLS

24 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 24 Example: A tree T composed of five nodes t i i=1,2,3,4,5 t1t1 t2t2 t3t3 t4t4 t5t5 Impurity of a node t STRUCTURE OF A REGRESSION TREE For any split s of t into t L and t R, the best split s* is such that

25 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 25 The variable with the fewest number of missing values -Y 1 – is regressed on the subset of variables without missing data U=X Each variable is imputed by using all available variables (completed or imputed) ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (1/2) PARTITION OF THE VARIABLES Variables with missing data -X- Variables without missing data -Y- STEP 1 STEP 2 Update U by appending Y 1 Then the next fewest missing values Y 2 is regressed on U = (X, Y 1 ) where Y 1 has imputed values STEP 3 STEP N ……..

26 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 26 ALL MISSING DATA ARE IMPUTED FOR EACH VARIABLE THE IMPUTATION PROCESS IS THEN REPEATED MODIFYING THE PREDICTOR SET TO INCLUDE ALL X AND Y VARIABLES EXCEPT THE ONE USED AS THE DEPENDENT VARIABLE ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (2/2) NEXT ROUND

27 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 27 EVALUATION OF IMPUTATION PROCEDURES IMPACT ON UNIVARIATE DISTRIBUTIONS RELATIONSHIP BETWEEN VARIABLES

28 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 28 IMPUTATION EFFECTS ON UNIVARIATE DISTRIBUTIONS CATEGORICAL VARIABLES N denotes the number of categorical variables CONTINUOUS VARIABLES ABSOLUTE RELATIVE VARIATION INDEX (AMONG MEANS) ABSOLUTE RELATIVE VARIATION INDEX (AMONG STANDARD DEVIATIONS) ABSOLUTE RELATIVE SQUARE DISSIMILARITIES INDEX (LETI 1983) the education survey data have analysed with multilevel models.

29 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 29 Mean difference for each imputed variable (Y j ) between the association pre and post imputation of Y j vs remaining n-1 categorical variables Variation Association Index (categorical variables) IMPUTATION EFFECTS ON RELATIONSHIP AMONG VARIABLES (1/2)

30 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 30 Mean difference for each imputed variable (Y j ) beetwen the correlation pre and post imputation of Y j vs remaining n-1 continuous variables Variation Association Index (continuous variables) IMPUTATION EFFECTS ON RELATIONSHIP AMONG VARIABLES (2/2)

31 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 31 MATRIX V x P - “VARIABLES x PROCEDURES” The matrices VxP are five, one for each evaluation index Example of structure of matrix VxP with a generic evaluation index G j ≡ (Im j or I μ j or I σ j or VAI N j or VAI C j )* Proc.AProc.BProc.CProc.DProc.E Var.1g 1a g 1b g 1c g 1d g 1e Var.2g 2a g 2b g 2c g 2d g 2e …...…. Var. Ng Na g Nb g Nc g Nd g Ne (*) According: a) the typology (categorical, etc.) of variables (the number of variables for each typology is denoted by N); b) type of impact on: univariate distributions and relationship between variables

32 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 32 Each of five matrix VP G (Nx5) -whose Gjs is a generic element- is transformed in a js (0,1) score matrix S I (Nx5) with a js Matrix VP G Nx5 Proc.A Proc.B Proc. C Proc. D Proc.E Var.10,60,40,50,10,7 Var.20,30,60,90,5,08 …. Var.N0,40,20,80,60,5 SCORES MATRICES 1 if g js is the minimum value in the row j 0 otherwise Matrix S G Nx5 Proc.A Proc.B Proc. C Proc. D Proc.E Var.100010 Var.210000 …. Var.N01000  j  min{ g js }  a js =1;  s:g js ≠min{ g js }  a js =0 

33 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 33 The ranking indicators measure the relative performance of each procedure according to each evaluation index BUILDING A RANKING INDICATOR (1/3)

34 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 34 The vector of 0,1 scores extracted from the S matrix (for each procedure and for each evaluation indicator) is reduced to a scalar as a sum of its elements This sum is divided by the number of vector elements to obtain a ranking index R whose range is 0,1 BUILDING A RANKING INDICATOR (2/3)

35 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 35 Lowest performance of s th procedure compared to other ones for generic evaluation index G Highest performance of s th procedure compared to other procedures for generic evaluation index G The ranking indicators measure the relative performance of each procedure according to each evaluation index BUILDING A RANKING INDICATOR (3/3)

36 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 36 Evaluation indexes AimRanking index Evaluating the imputation impact on marginal distributions (categorical variables) Evaluating the imputation impact on marginal distributions (continuous variables) Evaluating the imputation impact on the association between continuous variables Evaluating the imputation impact on the association between categorical variables FROM AN EVALUATION INDICATOR TO A RANKING INDICATOR

37 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 37 Ranking based on dissimilarities index (categorical variables) Absolute relative variation index (among means) Absolute relative variation index (among standard deviations) RankProced. Ranking index RankProced. Ranking index RankProced. Ranking index IC 0,695 IC0,538 IC0,385 IID 0,626 IIE0,308 IID0,385 IIIA 0,474 IIID0,077 IIIB0,231 IVE 0,442 IVB0,077 IVE0,000 VB 0,405 VA0,000 VA EVALUATING THE IMPACT ON MARGINAL DISTRIBUTIONS AND ON SOME DISTRIBUTIVE PARAMETERS

38 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 38 Ranking based on Variation Association Index for categorical variables Ranking based on Variation Association Index for continuos variables RankProcedure Ranking indicators RankProcedure Ranking indicators IB0,538IB0,542 IIA0,308IIA0,126 IIIE0,077IIIE0,121 IVD0,077IVD0,121 VC0,000VC0,089 EVALUATING THE IMPUTATION IMPACT ON THE VARIABLES ASSOCIATION

39 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003) C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 39 CONCLUDING REMARKS MISSING DATA IMPUTATION IS AN EXTREMELY COMPLEX PROCESS EACH METHOD SHOWS CRITICAL ASPECTS IT IS IMPORTANT TO DEVELOP A RECONTRUCTION STRATEGY CONSIDERING SOME BASIC ASPECTS: THE MISSING DATA PATTERN THE IMPACT ON THE STATISTICAL DISTRIBUTIONS THE IMPACT ON THE ASSOCIATIONS AMONG VARIABLES


Download ppt "Q2008 - ROME, 09-11 JULY 2008 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from."

Similar presentations


Ads by Google