Validity and reliability of rating speaking and writing performances STANAG 6001 Testing Workshop, Brno, 6–8 September 2016 Validity and reliability of rating speaking and writing performances Ülle Türk, Estonia
Quality in assessment Reliability is the degree to which an assessment tool produces stable and consistent results. Validity is defined as the extent to which an assessment accurately measures what it is intended to measure. Scoring validity
Scoring validity How far can we depend on the scores which result from the test? Parameters for tests of productive skills Criteria / rating scale Rating procedures Rater selection Rater training Standardisation Moderation Rating conditions Statistical analysis Raters Grading and awarding
Reliability in tests of productive skills Intra-rater reliability or internal consistency Inter-rater reliability of inter-rater agreement Parallel forms reliability
Rater effects that affect reliability Differences in rater severity Halo effects = failing to assign independent scores for the distinct categories of an analytic rubric Central tendency = the reluctance to assign scores at the extremes of a rating scale
Methods for assessing rater reliability Numerical a percentage of agreement between the two raters/ ratings correlation coefficients Visual cross-tabulation matrix
Percent agreement 1st 2nd Agreement Ann 2+ 2 Paul Mary 1+ Jill 3 Tom Steve 1 Linda John Harry Kate Bill Joe Tina Jane The basic model for calculating inter-rater reliability is percent agreement in the two- rater model. 1. Calculate the number/rate of ratings that are in agreement 2. Calculate the total number of ratings 3. Convert the fraction to a percentage
Rules-of-Thumb for Percent Agreement Interpretation Rules-of-Thumb for Percent Agreement Number of Ratings High Agreement Minimal Agreement Qualifications 4 or fewer categories 90% 75% No ratings more than one level apart 5-7 categories Approximately 90% of ratings identical or adjacent
Correlation With plus-levels, translate levels into numbers: 2nd Ann 2+ 4 2 3 Paul Mary 1+ Jill 5 Tom Steve 1 Linda John Harry Kate Bill Joe Tina Jane With plus-levels, translate levels into numbers: 1 =1 1+ =2 2 = 3 2+ = 4 3 = 5 Pearson = 0.670 Mean: 1st = 3,07 2nd = 3,07 St Dev 1st =1,385 2nd = 1,072
Interpretation Benchmarks for correlation coefficients: < 0.20 = poor 0.21 to 0.40 = fair 0.41 to 0.60 = fair 0.61 to 0.80 = good 0.81 to 1.00 = very good
First rating Second rating 1 1+ 2 2+ 3 1st 2nd Ann 2+ 2 John Paul Harry 3 Mary 1+ Kate Jill Bill Tom Joe Steve 1 Tina Linda Jane
References Luoma, Sari (2004) Assessing Speaking. Cambridge University Press. Weir, Cyril J. (2005) Language Testing and Validation: An Evidence-Based Approach. Palgrave Macmillan.