Reliability in assessment Cees van der Vleuten Maastricht University Certificate Course on Assessment 6 May 2015
Overview What is reliability conceptually? Evidence of the literature? How to improve reliability?
What is reliability? Correlation (r x,y )
What is reliability? High correlation (r x,y -> 1.0) Low correlation (r x,y -> 0.0)
Measurement influence
Reliability in achievement tests Test = item r = Split-half reliability coefficient,
Reliability in achievement tests Test = item r across all colours = Cronbach’s alpha
Reliability and test length Reliability Test length Spearman-Brown Prophecy formula Actual Predicted See:
Item-response theory Generalizability theory Three reliability theories Classical test theory Further reading: De Champlain, A. F. (2010). A primer on classical test theory and item response theory for assessments in medical education. Medical education, 44(1), Bloch, R., & Norman, G. (2012). Generalizability theory for the perplexed: A practical introduction and guide: AMEE Guide No. 68. Medical teacher, 34(11),
Overview What is reliability conceptually? Evidence of the literature? How to improve reliability?
Reliabilities across methods Testing Time in Hours MCQ Case- Based Short Essay PMP Oral Exam Long Case OSCE Practice Video Assess- ment Norcini et al., Stalenhoef-Halling et al., Swanson, Wass et al., Van der Vleuten, Norcini et al., 1999 In- cognito SPs Mini CEX Ram et al., Gorter, 2002 This table has been published in: Van Der Vleuten, C. P., & Schuwirth, L. W. (2005). Assessing professional competence: from methods to programmes. Medical education, 39(3), See:
Reliability oral examination (Swanson, 1987) Testing Time in Hours Two New Examiners for Each Case New Examiner for Each Case Same Examiner for All Cases Number of Cases Here multiple sources of error (cases, examiners) are combined in a single reliability estimate. This is the strength of generalizability theory.
Reliabilities across methods Testing Time in Hours MCQ Case- Based Short Essay PMP Oral Exam Long Case OSCE Practice Video Assess- ment Norcini et al., Stalenhoef-Halling et al., Swanson, Wass et al., Van der Vleuten, Norcini et al., 1999 In- cognito SPs Mini CEX Ram et al., Gorter, 2002 This table has been published in: Van Der Vleuten, C. P., & Schuwirth, L. W. (2005). Assessing professional competence: from methods to programmes. Medical education, 39(3), See:
Checklist or rating scale reliability in OSCE 1 1 Van Luijk & van der Vleuten, 1990
The literature clearly suggests Reliability is a matter of sampling Across contexts Across assessors or any other factor influencing the assessment Objectivity is NOT the same as reliability Many subjective judgments make a robust judgment There are no intrinsically more reliable methods of assessment Most of our assessments in actual practice are not very reliable!
Overview What is reliability conceptually? Evidence of the literature? How to improve reliability?
Consequently…… One single measure is no measure Combine information Across time Across multiple measures Be aware of substantial false-positive and false-negative errors in a single measure.
Reliability Expected % false decisions 1,000 0,9510 0,8020 0,7025 0,6030 0,5033 0,0050
Finally…… Reliability and sampling are strongly related Objectification and standardization do not intrinsically lead to more reliability Do not objectify or standardize where it is not needed (e.g. when assessing complex skills in the real world).
This Powerpoint can be found at: