Download presentation
Presentation is loading. Please wait.
Published byJan-Erik Sandberg Modified over 6 years ago
1
Validity and reliability of rating speaking and writing performances
STANAG 6001 Testing Workshop, Brno, 6–8 September 2016 Validity and reliability of rating speaking and writing performances Ülle Türk, Estonia
2
Quality in assessment Reliability is the degree to which an assessment tool produces stable and consistent results. Validity is defined as the extent to which an assessment accurately measures what it is intended to measure. Scoring validity
3
Scoring validity How far can we depend on the scores which result from the test? Parameters for tests of productive skills Criteria / rating scale Rating procedures Rater selection Rater training Standardisation Moderation Rating conditions Statistical analysis Raters Grading and awarding
4
Reliability in tests of productive skills
Intra-rater reliability or internal consistency Inter-rater reliability of inter-rater agreement Parallel forms reliability
5
Rater effects that affect reliability
Differences in rater severity Halo effects = failing to assign independent scores for the distinct categories of an analytic rubric Central tendency = the reluctance to assign scores at the extremes of a rating scale
6
Methods for assessing rater reliability
Numerical a percentage of agreement between the two raters/ ratings correlation coefficients Visual cross-tabulation matrix
7
Percent agreement 1st 2nd Agreement Ann 2+ 2 Paul Mary 1+ Jill 3 Tom Steve 1 Linda John Harry Kate Bill Joe Tina Jane The basic model for calculating inter-rater reliability is percent agreement in the two- rater model. 1. Calculate the number/rate of ratings that are in agreement 2. Calculate the total number of ratings 3. Convert the fraction to a percentage
8
Rules-of-Thumb for Percent Agreement
Interpretation Rules-of-Thumb for Percent Agreement Number of Ratings High Agreement Minimal Agreement Qualifications 4 or fewer categories 90% 75% No ratings more than one level apart 5-7 categories Approximately 90% of ratings identical or adjacent
9
Correlation With plus-levels, translate levels into numbers:
2nd Ann 2+ 4 2 3 Paul Mary 1+ Jill 5 Tom Steve 1 Linda John Harry Kate Bill Joe Tina Jane With plus-levels, translate levels into numbers: 1 =1 1+ =2 2 = 3 2+ = 4 3 = 5 Pearson = 0.670 Mean: 1st = 3,07 2nd = 3,07 St Dev 1st =1,385 2nd = 1,072
10
Interpretation Benchmarks for correlation coefficients:
< 0.20 = poor 0.21 to 0.40 = fair 0.41 to 0.60 = fair 0.61 to 0.80 = good 0.81 to 1.00 = very good
11
First rating Second rating 1 1+ 2 2+ 3 1st 2nd Ann 2+ 2 John Paul
Harry 3 Mary 1+ Kate Jill Bill Tom Joe Steve 1 Tina Linda Jane
12
References Luoma, Sari (2004) Assessing Speaking. Cambridge University Press. Weir, Cyril J. (2005) Language Testing and Validation: An Evidence-Based Approach. Palgrave Macmillan.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.