Download presentation
Presentation is loading. Please wait.
Published byPhilip Webster Modified over 9 years ago
1
Changes in test Scores with Multiple Sittings of CanTEST Philip Nagy
2
Official Languages and Bilingualism Institute Rationale Research Questions Do test scores change on repeating the test? Is change related to length of time between sittings? Test Development Questions Can data from repeaters be used in test calibration for new form development? Context: Receptive Skills
3
Official Languages and Bilingualism Institute The Data Listening Tests: Six forms with 15 short and 25 long passage items Reading Tests: Seven forms with 15 skim-and-scan, 20 reading passage, and 25 cloze items The Sample: Mean first score of 3.6, compared to 4.3 for those who write only once Assumptions Difficulty of forms is balanced across sittings (true) Samples writing each form are equivalent (untested)
4
Official Languages and Bilingualism Institute Listening Results: Sitting 2 minus Sitting 1 (N=179) Change in Raw Score Total Test (40) Short Passages (15) Long Passages (25) Down >1131 Down 6 to 1018211 Down 3 to 5182422 Same ± 2439172 Up 3 to 542 46 Up 6 to 10362024 Up >11193
5
Official Languages and Bilingualism Institute Listening Results, another look Change in Raw Score Total Test (40) Short Passages (15) Long Passages (25) Down some22%15%19% About the same 24%51%40% Up some54%34%41% Mean raw gain 2.61.3 Mean % gain6.5% of 40 items 8.8% of 15 items 5.2% of 25 items
6
Official Languages and Bilingualism Institute Listening Results Interpretation How important is the improvement? On average, 3.6 points needed out of 40 to improve one band So, 2.6 points is about 75% of a band improvement
7
Official Languages and Bilingualism Institute Listening Results Interpretation Can the data be used for test calibration? The changes in average item difficulty are different for the subtests.088 for short passages.052 for long passages The difference of.036 (.088 -.052) is about the same as the standard error of the difficulty indices Listening data from repeaters should not be used for item calibration
8
Official Languages and Bilingualism Institute Changes in Listening by Length of Time between Sittings Test → Time Between Tests ↓ Total Test Short Passages Long Passages > 6 months (N=63) +2.13+0.63 1 +1.49 < 6 months (N=116) +2.87+1.69 1 +1.18 1 Difference significant, p=0.05 Those who repeat sooner do better than those who repeat later
9
Official Languages and Bilingualism Institute Reading Results: Sitting 2 minus Sitting 1 (N=284) Note: Reading Score is doubled to give a total out of 80 rather than 60. Change in Raw Score Total (80)Skim-&-Scan (15) Passage (20)Cloze (25) Down 21 or more17 Down 11 to 2019212 Down 6 to 1021121832 Down 3 to 528323034 Same score ± 246139142106 Up 3 to 533656352 Up 6 to 1047312336 Up 11 to 20483812 Up 21 or more25
10
Official Languages and Bilingualism Institute Reading Results, another look Change in Raw ScoreTotal (80)Skim-&- Scan (15) Reading Passage (20) Cloze Passage (25) Down some30%16%17%27% About the same16%49%50%37% Up some54%35%33%35%
11
Official Languages and Bilingualism Institute Reading Results Interpretation How important is the improvement? On average, 6.5 points needed (out of 80) to improve one band So, 3.45 points is about 55% of a band improvement
12
Official Languages and Bilingualism Institute Reading Results Interpretation Can the data be used for test calibration? The changes in average item difficulty are different for the subtests +0.072 for skim-and-scan +0.050 for reading passages +0.002 for cloze The largest difference of.070 (.072 -.002) is two to three times larger than the standard error of the difficulty indices Reading data from repeaters should not be used for item calibration
13
Official Languages and Bilingualism Institute Changes in Reading by Length of Time between Sittings 1 Difference significant, p=0.05 Those who repeat later actually do worse than those who repeat sooner Test → Time Between Tests ↓ Total (80)Skim-&ScanReading Passage Cloze Passage > 6 months (N=105) -0.119-0.292 1 -0.017-0.079 < 6 months (N=179) +0.070+0.171 1 +0.010+0.046
14
Official Languages and Bilingualism Institute Conclusion Listening: 30% of sample do more poorly on 2 nd sitting Average gain is 75% of a band score Differences in gains across item types vary by an item standard error Reading 40% of sample do more poorly on 2 nd sitting Average gain is 55% of a band score Differences in gains across item types vary by 2-3 times an item standard error Both Those who rewrite within six months do better Data from repeaters should not be used for item calibration
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.