Download presentation
Presentation is loading. Please wait.
Published byEstella Walker Modified over 9 years ago
1
JENNA PORTER DAVID JELINEK SACRAMENTO STATE UNIVERSITY Statistical Analysis of Scorer Interrater Reliability
2
Background & Overview Piloted since 2003-2004 Began with our own version of training & calibration Overarching question: Scorer interrater reliability i.e., Are we confident that the TE scores our candidates receive are consistent from one scorer to the next? If not -- the reasons? What can we do out it? Jenna: Analysis of data to get these questions Then we’ll open it up to “Now what?” “Is there anything we can/should do about it?” “Like what?”
3
Presentation Overview Is there interrater reliability among our PACT scorers? How do we know? Two methods of analysis Results What do we do about it? Implications
4
Data Collection 8 credential cohorts – 4 Multiple subject and 4 Single subject 181 Teaching Events Total – 11 rubrics (excludes pilot Feedback rubric) – 10% randomly selected for double score – 10% TEs were failing and double scored 38 Teaching Events were double scored (20%)
5
Scoring Procedures Trained and calibrated scorers University Faculty Calibrate once per academic year Followed PACT Calibration Standard-Scores must: Result in same pass/fail decision (overall) Have exact agreement with benchmark at least 6 times Be within 1 point of benchmark All TEs scored independently once If failing, scored by second scorer and evidence reviewed by chief trainer
6
Methods of Analysis Percent Agreement Exact Agreement Agreement within 1 point Combined (Exact and within 1 point) Cohen’s Kappa (Cohen, 1960) Indicates percentage of agreement accounted for from raters above what is expected by chance Cohen (1960). Cohen, J. (1960). A coefficient of agreement for nominal scales, Educational and Psychological Measurement, 20 (1), pp.37–46.
7
Percent Agreement Benefits Easy to Understand Limitations Does not account for chance agreement Tends to overestimate true agreement (Berk, 1979; Grayson, 2001). Berk, R. A. (1979). Generalizability of behavioral observations: A clarification of interobserver agreement and interobserver reliability. American Journal of Mental Deficiency, 83, 460-472. Grayson, K. (2001). Interrater Reliability. Journal of Consumer Psychology, 10, 71-73.
8
Cohen’s Kappa Benefits Accounts for chance agreement Can be used to compare across different conditions (Ciminero, Calhoun, & Adams, 1986). Limitations Kappa may decrease if low base rates, so need at least 10 occurances (Nelson and Cicchetti, 1995) Ciminero, A. R., Calhoun, K. S., & Adams, H. E. (Eds.). (1986). Handbook of behavioral assessment (2nd ed.). New York: Wiley. Nelson, L. D., & Cicchetti, D. V. (1995). Assessment of emotional functioning in brain impaired individuals. Psychological Assessment, 7, 404–413.
9
Kappa coefficient Kappa= Proportion of observed agreement – chance agreement 1- chance agreement Coefficient ranges from -1.0 (disagreement) to 1.0 (perfect agreement) Altman, D.G. (1991). Practical Statistics for Medical Research. London: Chapman and Hall. Fleiss, J. L. (1981). Statistical methods for rates and proportions. NY:Wiley. Landis, J. R., Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33:159-174.
10
Percent Agreement
12
Pass/Fail Disagreement
13
Cohen’s Kappa
14
Interrater Reliability Compared
15
Implications Overall interrater reliability poor to fair Consider/reevaluate protocol for calibration Calibration protocol may be interpreted differently Drifting scorers? Use multiple methods to calculate interrater reliability Other?
16
How can we increase interrater reliability? Your thoughts... Training protocol Adding “Evidence Based” Training- Jeanne Stone, UCI More calibration
17
Contact Information Jenna Porter jmporter@csus.edu jmporter@csus.edu David Jelinek djelinek@csus.edu djelinek@csus.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.