Evaluation Rating Forms Craig McClure, MD May 15, 2003 Educational Outcomes Service Group
Typical Use of Rating Scales End of Rotation (global) After single encounter (focused) To incorporate input from multiple evaluators Videotaped encounters NOT As checklist for single encounters: Yes/No
Alternate Forms Multiple episodes versus focused (single) episode Measuring global (six domains) versus task-specific behavior
Global Rating of Learner Domains of competence, not specific skills, tasks, or behaviors Completed retrospectively concerning multiple days and activities May be from multiple sources Use rating scales
Focused Rating Scale Single patient encounter Concerning specific task, skill, behavior
Advantages (Global) Easy to develop Easy to use (training minimal) Can be used to evaluate all domains Reasonable reliability when Focused evaluation Tailored to competencies measured
Systematic Rater Errors (Global) Leniency/Severity Range Restriction Halo Effect Inappropriate Weighting
Drawbacks (Global) Content validity uncertain Questionable validity of general assessments extrapolated to whole domain Inefficient at directing learner improvement Accuracy variable Generosity factor Poor discrimination between learners
Mixed Research results Discriminating between competence levels Reliably rating more skilled physicians higher than less skilled Reliability of ratings Reproducibility Best: knowledge Harder: patient care, interpersonal skills
Clarify Evaluative Objectives Global versus focused Define using competency-based language emphasized by ACGME
Group the Competencies Patient Care, Medical knowledge, Practice-Based Learning and Improvement, Interpersonal and Communication Skills, Professionalism, and Systems-Based Practice.
Composition of Form Short is better than long Big font is better than small Clean better than cluttered
Each Behavior is Evaluated Independently Otherwise: Uncertain what to evaluate Learner uncertain what to address
Decide on Options in the Scale Best if minimum of five Best if a descriptor present for each Absence of middle labels skews ratings toward the positive side
Primacy Effect “The results showed that when the positive side of the scale was on the left, the ratings were more positive and had reduced variance than when the positive label was on the right.”
Lake Wobegon Effect Where all the children are above average Faculty tend to interpret anchors as more negative than literal Generosity effect
Consider Changing Anchors IF desire to keep evaluative anchors Poor, fair, below average, average, above average and excellent Very poor, poor, fair, good, very good, excellent
Consider Using Frequency Anchors Frequency of observable resident behaviors from “never” to “always” Considerable education of the evaluators to minimize inter-rater variability needed for judgmental rating Permits PD competency judgment
Example of Stem for Frequency Anchor Resident demonstrates respect in speaking to patient… Never, 25%, 50%, 75%, Always
Competency Judgment at Program Level Permits competency definitions to vary by year of training Diminishes effect of inter-rater variability Focuses on observable behavior Requires less training of evaluators
References Evaluations, S. Swing, Academic Emergency Medicine 2002;9:1278-88 Assessment of Communication and Interpersonal Skills Competencies, Academic Emergency Medicine 2002;9: 1257-69 ACGME/ABMS Joint Initiative Toolbox of Assessment Methods, September 2000
References (2) Challenges in using rater judgments in medical education, M.A. Albanese, Journal of Evaluation in Clinical Practice,6:3: 305-319