Overview of Main Survey Data Analysis and Scaling National Research Coordinators Meeting Madrid, February 2010
NRC Meeting Madrid February 2010 Content of presentation Scaling and analysis of test items Scaling and analysis of questionnaire items Data analysis for the reporting of ICCS data
NRC Meeting Madrid February 2010 Steps in analysis Preliminary analysis of first data sets received –Review at JMC data analysis meeting in Hamburg in July 2009 Analysis of clean and uncleaned data sets from almost all participating countries –Review at PAC meeting in Tallinn (Oct 2009) and JMC data analysis meeting in Hamburg in early December 2009 Final scaling and analysis with clean data from all 38 countries
NRC Meeting Madrid February 2010 Test item analysis Review of missing data Analysis of item dimensionality Review of item statistics (international) Analysis of differential item functioning by gender Analysis of item-by-country interaction –Measurement equivalence Item adjudication
NRC Meeting Madrid February 2010 Scaling model Rasch one-parameter model P i ( ) is the probability for person n to score 1 on item i n is the estimated ability of person n and i
NRC Meeting Madrid February 2010 Probability curves
NRC Meeting Madrid February 2010 Partial credit model For open-ended items (and questionnaire items) with more than two categories the Partial Credit model was used: Here, t ij denotes an additional step parameter
NRC Meeting Madrid February 2010 Threshold curves
NRC Meeting Madrid February 2010 Response probabilities
NRC Meeting Madrid February 2010 Missing data issues Different categories of missing data Omitted responses –Somewhat higher percentages for open response items Invalid responses –Generally very low percentages Not reached responses –Omitted items at end of test booklets –Generally low, in few countries more considerable
NRC Meeting Madrid February 2010 Not reached % by region
NRC Meeting Madrid February 2010 Test characteristics Test items were generally a little easier than the average student abilities (pooled across countries) Test reliability was 0.84 (similar to CIVED assessment) Very high latent correlations between possible sub-dimensions –Decision not to pursue sub-scales
NRC Meeting Madrid February 2010 Mapping of test items to abilities
NRC Meeting Madrid February 2010 Review of item scaling properties Most items had excellent scaling properties –Weighted mean square item fit –Item-total correlation –Item characteristic curves Only on test item (CI2HRM2) was omitted from scaling
NRC Meeting Madrid February 2010 Item statistics
NRC Meeting Madrid February 2010 Item characteristic curves
NRC Meeting Madrid February 2010 Scoring reliabilities - 1 Open-ended items were scored according to international scoring guidelines Double-scoring of sub-samples On average, percentages of scorer agreement ranged between 84 and 92 across participating countries
NRC Meeting Madrid February 2010 Scoring reliabilities - 2 Only items accepted where scorer agreement was 70% or more Data for items where this criterion was not met were not included in scaling In two countries open-ended items were consistently easier than other items –Omitted from scaling and database
NRC Meeting Madrid February 2010 Gender DIF DIF estimates reflect the differences between item difficulties for males and females of equal ability –This may cause bias in favour of one group Generally, only few items with gender DIF were found
NRC Meeting Madrid February 2010 Cross-national measurement equivalence Occurrence of item-by-country interaction –Items relatively much harder in some countries but much easier in others In ICCS, national item calibrations were compared with those for the international calibration sample Standard errors were adjusted for sample design effects and multiple comparisons
NRC Meeting Madrid February 2010 Example for CI2HRM2
NRC Meeting Madrid February 2010 Item-by-country interaction Generally, items tended to behave in a similar way Number of items with parameter variance –Sometimes due to translation errors –Often due to other factors (national context, curricula) Occurrence of some parameter variation across countries –Similar results as in other cross-national studies
NRC Meeting Madrid February 2010 Item adjudication Based on results from scaling analysis (item statistics, item curves, item-by- country interaction etc.) International item adjudication –Omission of CI2HRM2 from scaling National item adjudication –Re-verification for items with larger discrepancies in item difficulty –Omission of item for national scaling with translation or scoring issues
NRC Meeting Madrid February 2010 Calibration of items Based on international calibration sample with 500 randomly selected students from each of the 36 participating countries that met sampling requirements ACER ConQuest was used for estimation Booklet effects adjusted by including booklet as a facet in the scaling model
NRC Meeting Madrid February 2010 Scaling methodology Plausible values were generated as student ability estimates –More information at workshop! Dummy indicators for classroom and all student level variables (international and regional) were included in the conditioning model Scale scores set to international metric with mean of 500 and SD of 100 for equally weighted countries
NRC Meeting Madrid February 2010 Estimation of changes in cognitive knowledge test items from CIVED included as intact cluster 17 countries with comparable data –Three countries with grade 9 in CIVED and additional grade 9 samples in ICCS Small number of items in some countries had to be discarded due to translation errors or differences between ICCS and CIVED
NRC Meeting Madrid February 2010 Estimation of changes in cognitive knowledge - 2 Comparison of item parameters showed high similarity (correlation of 0.95) Slight positioning effect due to different test designs –CIVED: One booklet –ICCS: CIVED link cluster in each of the three positions CIVED items at beginning slightly easier, at end slightly harder than in ICCS
NRC Meeting Madrid February 2010 Estimation of changes in cognitive knowledge - 3
NRC Meeting Madrid February 2010 Estimation of changes in cognitive knowledge - 4 Framework broadened since CIVED –Re-scaling CIVED data to equate with ICCS not appropriate Selection of CIVED items not representative for overall CIVED test –Equating link items with CIVED scale (or sub-scale) also not appropriate Solution: Establish new comparison scale based only on 17 link items
NRC Meeting Madrid February 2010 Estimation of changes in cognitive knowledge - 5 Concurrent calibration of item parameters based on calibration samples with 34 samples from 17 countries (CIVED and ICCS) Establishing a metric with a mean of 500 and SD of 100 for equally weighted 17 CIVED countries For results in tables, weighted likelihood estimates were used –Usually unbiased for country averages
NRC Meeting Madrid February 2010 Questionnaire item analysis Missing data issues Item dimensionality and scaling review Item/scale adjudication Scaling procedures
NRC Meeting Madrid February 2010 Missing data - 1 On average about 3 percent of students have missing scale scores –Only in two countries there are percentages of 18 and 12 percent Teacher survey data relatively low missing percentages were found (about 2 percent) Very low percentages of missing data in school questionnaire
NRC Meeting Madrid February 2010 Missing data - 2 Concerns about missing data for socio-economic indicators –Highest parental occupation: 5% –Highest parental education: 3% –Books at home: 1% However, in a few countries higher percentages of missing data were found (up to 15% for parental education)
NRC Meeting Madrid February 2010 Analysis of item dimensionality Exploratory and confirmatory factor analyses showed generally very similar results to those from the field trial These analyses will be described in detail in the ICCS technical report
NRC Meeting Madrid February 2010 Scaling analysis Scale reliabilities (Cronbach’s alpha) –Over 0.7 satisfactory internal consistency Item-total correlations: –Useful for reviewing translation errors Scaling with IRT Partial Credit Model –Item fit –Category characteristic curves
NRC Meeting Madrid February 2010 Item and scale adjudication Only three scales with median scale reliabilities below 0.7 –Democratic value beliefs, civic participation in community and at school Adjudication for student, teacher, school and each regional questionnaire Some items were removed from scale In some cases, single-item reporting
NRC Meeting Madrid February 2010 Scaling procedures - 1 IRT scaling with Partial Credit Model So-called weighted likelihood estimates as scale scores International metric with mean of 50 and a standard deviation of 10
NRC Meeting Madrid February 2010 Scaling procedures - 2 Item parameter calibration with ACER ConQuest Calibration samples: –500 students per country –250 teachers per country –All school data with equal weights for each country Only data from countries that met sampling requirements (categories 1 or 2) included in calibration
NRC Meeting Madrid February 2010 Questionnaire scales Advantages of IRT scales –Inclusion of students with at least two item responses per scale –Possibility to describe scale From IRT Partial Credit Model it is possible to map scale scores to expected item responses Item maps will be provided in appendix to international report
NRC Meeting Madrid February 2010 Example of item map
NRC Meeting Madrid February 2010 Data analysis for reporting Estimation of sampling variance Estimation of measurement variance Reporting of differences
NRC Meeting Madrid February 2010 Estimation of sampling variance Data from cluster samples are not simple random samples –Standard formula for estimating sampling error not appropriate Jackknife repeated replication technique used for ICCS IDB Analyser, WESVAR or SPSS/SAS macros may be used for applying this methodology
NRC Meeting Madrid February 2010 Estimation of measurement variance Using plausible values allows estimating the measurement error –The variation between the five PVs can be used for estimation IDB Analyser, WESVAR or SPSS macros (ACER replicates module) include features to do this More information will be provided at the training workshop on Wednesday
NRC Meeting Madrid February 2010
NRC Meeting Madrid February 2010 Reporting of differences - 1 The following types of significance tests will be reported: –For differences in population estimates between countries –For differences between a country and the international –in population estimates between subgroups within countries. –For differences between population estimates in ICCS and in CIVED (trend estimation)
NRC Meeting Madrid February 2010 Reporting of differences - 2 Adjustment for multiple comparisons with Dunn-Bonferroni method – increasing critical value (p>.05) from 1.96 to SE for differences between samples Estimation of SE for sub-group differences with JRR
NRC Meeting Madrid February 2010 Reporting of differences - 3 For the SE of trend differences it is important to take the equating error into account The estimation of SE for differences between CIVED and ICCS can be computed as The equating error in the international metric is 3.31
NRC Meeting Madrid February 2010 Multivariate analysis Multiple regression models were used for the tables in draft Chapter 7 –Bivariate regression –Multiple regression Multi-level models were used for the analysis in draft Chapter 8 –Students nested within classrooms –Classrooms mostly equivalent to schools
NRC Meeting Madrid February 2010 Questions or comments?