Elena Kardanova Higher School of Economics, Moscow Applying the Rasch model to assess cross-cultural comparability of test scores Elena Kardanova Higher School of Economics, Moscow
Overview of presentation A little about me and my Institute Cross-cultural research: challenges ISHEL project short description Steps for the development of cross-nationally comparable assessment instruments Experts’ behaviour analysis Psychometrics analysis Constructing the common scale across countries and between different grades
My teachers
Russia 146 544 710 People 160 nationalities 9 time zones Climate from -70 C to +40 C
Moscow
A few facts about Moscow Population more than 12,000,000 144 universities 887 100 university students 1727 schools 782 400 school students
National Research University Higher School of Economics
Higher School of Economics in Numbers 2500 since 1992 teachers collaboration with 45 countries 27 000 students 51 000 graduates
Institute of Education at HSE Established in 2012 to carry out a research agenda at HSE University 14 research centers, 70+ projects, 150+ staff Graduate school of Education: 7 master programs and a PhD program
Master Program Measurement in Psychology and Education The only MP in psychometrics in Russia. Started from 2010. 43 graduates +12 in 2016 Training in the field of measurement, psychometrics, assessment and evaluation; research and data analysis, statistics; Students involvement in real scientific work at Institute centers and projects (including large international projects TIMSS, PIRLS, PISA, PIAAC, etc.)
Center for Monitoring the Quality in Education Elena Kardanova PhD Ekaterina Orel PhD Tatjana Kanonire PhD Alina Ivanova PhD student Brun Irina PhD student Ponomareva Alena PhD student Inna Antipkhina PhD student Ilyushina Natalia PhD student Staff: 12 + 9 Master students (trainees) Research and Teaching Main research interests: measurement and psychometrics, measuring quality of education, test development, assessment of 21st century skills (4K, problem solving, etc.), international comparative studies, problems with teachers’ surveying, predictors of student’s success in primary school.
We have introduced new holiday in Russia – Psychometrician’s Day - 4 December We need a holiday – International Psychometrician’s Day!
Now about serious things
Cross-cultural research: challenges Comparability of results requires equivalence of measurement (ME). ME: an instrument measures the same attribute in exactly the same way in different cultures. ME includes (a) equivalence of constructs, (b) equivalence of tests, and (c) equivalence of testing conditions.
Sources of Incomparability Estimates always contain some error Errors causing systematic deviations across cultures - threaten the comparability of the data Major sources of incomparability: Construct differences Instrument differences Test conditions differences
Basic approaches to address comparability in cross-cultural studies To ignore the possibility that the data may not be comparable To test the comparability and suggest that the comparisons should be avoided if evidence of incomparability is found To carry out analyses and make an assumption that any systematic error will not be too large to contradict the conclusions drawn To estimate the errors and correct for them (equating, DIF analysis, SEM, MG-LCA, etc.)
Aim of presentation To provide evidence concerning the reliability and cross-national comparability of assessment results for the ISHEL project (International Study for Higher Education Learning). The main goal of the ISHEL project is to study the quality of engineering education across Russia and China. Specifically: to assess and compare skill gains of students in engineering programs in the two countries.
Although there is high and increasing interest among researchers and policymakers, few studies have assessed and compared skill gains among engineering students within and across countries. There are very few international comparison studies (measure skill levels and does not measure skill gains) (AHELO OECD) A couple of studies have measured gains in general skills (e.g.Arum and Roksa, 2011) or in specific subject domains such as economics (Bruckner et al., 2015) A major barrier to assessing and comparing the skill gains of engineering students is the lack of valid and cross-nationally comparable assessment instruments
Given the lack of research in this area, the purpose of this study is to provide empirical support for the development and validation of instruments that assess and compare engineering students’ skill gains in mathematics and physics skills (as one important measure of engineering education quality) within and across countries. One of the major challenges we faced was to ensure the cross-national equivalence of measurements.
Steps for the development of valid and cross-nationally comparable assessment instruments: Select comparable EE and CS majors across China, Russia, and the United States Select content and sub-content areas in math and physics (with experts) Collect and verify items (with experts) Conduct a small-scale pilot study Conduct a large pilot survey Conduct a psychometric analysis
Two stages of providing evidence in support of reliability and cross-national comparability of the assessment instruments: Stage 1. Analysis of content and construct validity using cross-national expert evaluations of content areas and test items for each subject - feedback from 24 experts from a range of elite and non-elite engineering programs in China and Russia - analysis of the experts’ behaviour using multifaceted Rasch analysis (Linacre, 1989; Myford &Wolfe, 2003) Stage 2. Rasch analysis to ensure that (1) the assessment instruments meet basic standards for educational measurement, and (2) they can be equated both between two grades and across two countries and provide comparable measurement results. - data collection from approximately 3,600 first and third year students from 21 undergraduate engineering programs in China and Russia.
Results for stage 1. Consistency among the experts (Item difficulty criterion) The Cronbach’s Alpha coefficient was over 0.8.
Experts’ behaviour analysis (math test, grade 1) Severity Error Fit statistics Unweighted Weighted RUS 1 .69 .12 .92 1.06 RUS 2 -.42 .11 .93 .94 RUS 3 .34 .66 RUS 4 -.57 1.03 RUS 5 .24 .98 1.14 RUS 6 .48 1.04 .97 CH 1 .41 .82 .88 CH 2 -.11 .13 1.26 1.28 CH 3 -.82 .40 1.10 1.17 CH 4 -.56 .84 CH 5 .14 1.05 CH 6 .19 .96 1.11 There were statistically significant differences in the levels of “severity” of the experts’ evaluations. (χ2 = 135.83, df = 11, p<0.001, reliability 0.923). The “severity” of an expert was not strong related to nationality. Despite these differences in “severity,” no experts demonstrated effects that might have affected the final ratings and introduced bias into the evaluation procedure.
45 items in each of 4 tests (math and physics for grades 1 and 3) Based on the results of the expert evaluations and the small-scale pilot study, the instruments were prepared for a large-scale pilot. The end of October 2014. Large Scale Piloting: Instruments 45 items in each of 4 tests (math and physics for grades 1 and 3) All items had MC format with one correct answer from 4–5 options Paper and pencil format The items were scored dichotomously The grade 1 and grade 3 tests for each subject had approximately 20 common items
Large scale piloting: Sample 11 universities in China and 10 universities in Russia, both elite and non-elite universities, located in both large and small cities across the country; In each university, we sampled classes of grade 1 and grade 3 students from EE and CS departments until we sampled at least 60 students from each major–grade or until we sampled all the available students in that major-grade; Final sample: 2,726 students in China and 2,753 students in Russia
Large scale piloting: Procedure Two 55-minute sessions (one for math and one for physics, administered in random order) 1/3 of the students in each class took a critical thinking test (provided by ETS) 2/3 of the students in each class took both a math test and a physics test Final sample for math and physics tests: 1,797 students in China and 1,802 students in Russia. Students were also asked to fill out a short 10 minute questionnaire about their background and schooling experience after testing was completed.
Analytical approach The dichotomous Rasch model (Wright and Stone, 1979) was used to conduct item analyses as well as tests of dimensionality and reliability, Winsteps software (Linacre, 2011) Particular attention was paid to differential item functioning (DIF) to provide evidence concerning the cross-national comparability of the test results and to ascertain the possibility of creating a common scale between the two grades and across the two countries.
Analytical approach: analyses Fit analysis Unweighted and weighted mean square statistics (Wright and Stone, 1979) DIF analysis t-statistic, MH method, LR method Dimensionality analysis PCA of standardized residuals (Linacre, 1998; Smith, 2002; Ludlow, 1985) Reliability study Person reliability index, separation index (Wright and Stone, 1979; Smith, 2001) Linking of measures Simultaneous calibration and separate calibration (Wright & Bell, 1984; Wolfe, 2004)
Methods of DIF analysis: t-test Traditional approach within the framework of Rasch measurement Provided by Winsteps Gives the exact size and significance of DIF General guidelines: an item is considered to be exhibiting DIF if (1) DIF size is big enough (usually size > 0.5 logits) (2) DIF is significant enough not to have occurred by chance (usually t > 2.0)
Methods of DIF analysis: MH approach One of the most commonly used approaches for detecting DIF (Dorans, 1989) ETS classification for DIF (Zwick et al., 1999): A (nonsignificant DIF), B (slight DIF), or C (large DIF) items An item was considered a C item if two conditions were satisfied: the difference in the item difficulty measures for two groups of students was more than 0.64 logits, and the Mantel-Haenzel statistic had a significance level of p < .05 (Linacre, 2011).
Methods of DIF analysis: approach available in ConQuest ConQuest software (Wu, M.L., Adams, R.J., and Wilson, M.R., 1998) Uses multi-faceted modelling and can correctly classify bias when the items are biased in only one direction.
Methods of DIF analysis: Logistic Regression (LR) method Commonly used for detecting DIF Models the probability of responding correctly to an item as a logistic function of predictor variables (the total score, a grouping variable, and the interaction between ability and group) An item is identified as a DIF item, when the latter two variables show a significant improvement in the data-model fit beyond a model that includes only ability (Zumbo, 1999) Zumbo &Thomas (1996) approach for DIF classification: negligible DIF (R-squared values below 0.13), moderate DIF (R-squared values between 0.13 and 0.26), and large DIF (R-squared values above 0.26) items. Both the moderate and large categories also required the item to be flagged as statistically significant with the two degrees of freedom chi-square test.
The reasons to use these four methods for DIF analysis The difference in the ability distribution means for Chinese and Russian students was about 1.5 logits (larger than SD=1.1) The differences in the ability distributions increase the Type I error rate for DIF detection methods (Narayanan & Swaminathan, 1994; Pei & Li, 2010; Monahan & Ankenmann, 2005).
Two stages of data psychometric analysis First stage The data for each grade were analyzed separately to discover whether it would be possible to construct a common scale across countries within each grade. Second stage The data for the two grades were analyzed simultaneously, using common items included in both grades as a link, to determine whether it would be possible to place on a common scale all the parameters for the two grades and for the two countries.
The grade 1 mathematics test: fit analysis 8 items of 45 were deleted (low discrimination and/or mis-fitting the model) For the rest of the analysis we consider the reduced set of 37 items for the grade 1 mathematics test
Cross-country DIF analysis (Math test, grade 1): t-test and ETS approach 8 items were DIF free, 15 items had t-values greater than +2.0 (were easier for Chinese students), 14 items had t-values less than - 2.0 (were easier for Russian students) ETS approach: 24 items were DIF free, 13 items with DIF: 7 items were easier for Chinese students, 6 items were easier for Russian students
Cross country DIF analysis : ConQuest approach ======================================================================================== TERM 2: (-)country ---------------------------------------------------------------------------------------- VARIABLES UNWEIGHTED FIT WEIGHTED FIT --------------- ----------------------- ----------------------- country ESTIMATE ERROR^ MNSQ CI T MNSQ CI T 1 CHI 0.766 0.009 1.02 ( 0.91, 1.09) 0.5 1.03 ( 0.91, 1.09) 0.7 2 RUS -0.766* 0.009 0.95 ( 0.91, 1.09) -1.2 0.95 ( 0.91, 1.09) -1.1 An asterisk next to a parameter estimate indicates that it is constrained Separation Reliability Not Applicable Chi-square test of parameter equality = 7836.49, df = Proportions correct of all items for China and Russia: all items were easier for Chinese students the patterns were similar (correlation was .75)
Cross country DIF analysis : results for ConQuest approach 10 items demonstrated large DIF (deviation in difficulty was close or even larger than student SD): 5 items were easier for Chinese students, 5 items were easier for Russian students DIF was also detected in all 10 items with other methods.
Cross-country DIF analysis: LR method No items exhibited moderate or large DIF according to Zumbo &Thomas approach for DIF classification LR analysis (fragment): step1 step2 step3 Chi-squared Sign of Chi R2∆ (step 3-1) R2∆ (step 3-2) R2∆ (step 2-1) @1_1mc03101_A .369 .370 .371 4.09 .129 0.002 0.001 @2_2mc03103_A .245 .246 2.94 .230 0.000 @3_3mc03105_A .165 .188 37.81 .000 0.023 @4_4mc03106_A .391 .401 20.60 0.010 @5_5mr00119_A .327 .331 .332 7.44 .024 0.005 0.004 @6_6mr00120_A .258 .274 .275 28.68 0.017 0.016 @7_7mr00122_A .277 .278 1.26 .532 @8_8mc01111_A .427 .443 .447 39.8 0.020 @9_9mc01112_A .160 .183 38.971 @10_10mc02101_A .280 .289 16.59 0.009
Simulation study All methods showed NO DIF despite the difference in the mean ability levels of the two countries
DIF analysis: conclusions and solution The test is not entirely fair for different countries: several items most likely exhibit DIF Based on results from the DIF analysis using the ETS approach, 13 items (out of 37) demonstrated country-related DIF: 7 items in favour of Chinese students, while 6 items in favour of Russian students, and 24 items were DIF free. The 24 items without DIF functioned as “common items” on the test for linking between the countries The 13 items displaying DIF were split and considered as unique items for Russia and unique items for China Additional analysis: No evidence of DIF in the 24 common items.
Psychometric analysis. Math test, grade 1, Russia+China The person reliability was 0.85 (classical reliability α = 0.83). The person separation index was 2.39 The test was unidimensional Other tests (math test for grade 3 and physics tests for grades 1 and 3) showed similar results
Constructing the common scale between different grades Starting from 20 common items across the grade 1 and 3 math tests, 7 items were selected as good candidates for linking between grades. Other common items either were deleted when the separate analyses for each grade were conducted, or exhibited DIF for at least one test. These items were used as anchors for the two grades To evaluate the quality of the link between the grade 1 and grade 3 tests we calculated the item-within-link statistic (Wright &Bell, 1984). Its value of .95 indicated a reasonable fit within the link
A common math ability scale for both countries and for both grades It allows us to compare test results directly and to estimate the amount of progress students make between grades in different countries. Therefore, it gives us a basis for making international comparisons. Items Students
Conclusions: what has been done? For each country and each grade we constructed the tests. Evidence we gathered suggests that they are unidimensional, reliable, and fair. We constructed common scales for both countries for grade 1 and for grade 3 separately We constructed a common scale for both countries and for both grades. MEAN ST. DEV. Difference RUSSIA 1 GRADE -0.33 0.75 +0.97 3 GRADE 0.65 1.04 CHINA 1.17 0.86 -0.37 0.80 0.76
Comparisons of the mathematics achievement of Russian and Chinese students (grades 1 and 3)
Conclusion We used Rasch measurement to investigate cross-national comparability and to explore the possibility of constructing a multi-grade, multi-national common scale for ISHEL project. We used simultaneous and separate calibration procedures for creating a common scale between grades and countries. We showed that the tests were of good quality and could be used for international comparative research.
Thank you for your attention!