Examining Rubric Design and Inter-rater Reliability: a Fun Grading Project Presented at the Third Annual Association for the Assessment of Learning in Higher Education (AALHE) Conference, Lexington, Kentucky, June 3, 2013 Dr. Yan Zhang Cooksey University of Maryland University College
Outline of Today’s Presentation Background and purposes of the full-day grading project Procedural methods of the project Discuss the results and decisions informed by the assessment findings Lessons learned through the process
Purposes of the Full-day Grading Project To simplify the current assessment process To validate the newly developed common rubric measuring four core student learning areas (written communication, critical thinking, technology fluency, and information literacy)
UMUC Graduate School Previous Assessment Model: Model
Previous Assessment Model: Model (Cont.)
Strengths:Weaknesses: Tested rubrics Added faculty workload Reasonable collection points Lack of consistency in assignments Larger samples - more data for analysis Variability in applying scoring rubrics
C2 Model: Common activity & Combined rubric
Compare Model to (new)C2 Model Current ModelCombined Activity/Rubric (C2) Model Multiple Rubrics: one for each of 4 SLEs Single rubric for all 4 SLEs Multiple assignments across graduate school Single assignment across graduate school One to multiple courses/4 SLEs Single course/4 SLEs Multiple raters for the same assignment/course Same raters/assignment/course Untrained raters Trained raters
Purposes of the Full-day Grading Project To simplify the current assessment process To validate the newly developed common rubric measuring four core student learning areas (written communication, critical thinking, technology fluency, and information literacy)
Procedural Methods of the Grading Project Data Source Rubric Experimental design for data collection Inter-rater reliability
Procedural Methods of the Grading Project (Cont.) Data Source (student papers/redacted) Course name# of Papers BTMN BTMN BTMN90807 DETC6309 MSAF67020 MSAS67013 TMAN68016 Total 121
Procedural Methods of the Grading Project (Cont.) Common Assignment Rubric (rubric design and refinement) 18 Raters (faculty members)
Procedural Methods of the Grading Project (Cont.) Experimental design for data collection randomized trial (Group A&B) raters’ norming and training grading instruction
Procedural Methods of the Grading Project (Cont.) Inter-rater reliability (literature) SStemler (2004): in any situation that involves judges (raters), the degree of inter-rater reliability is worthwhile to investigate, as the value of inter-rater reliability has significant implication for the validity of the subsequent study results. IIntraclass Correlation Coefficients (ICC) were used in this study.
Results and Findings Two-sample t-test Group Statistics Group #NMean Std. Deviation Std. Error Mean Differ_Rater1and2 Group A-Experiment Group Group B-Control Group
Results and Findings (Cont.) Independent Samples Test Levene's Test for Equality of Variances t-test for Equality of Means FSig.tdf Sig. (2- tailed) Mean Difference Std. Error Difference 95% Confidence Interval of the Difference LowerUpper Differ_Rater 1and2 Equal variances assumed Equal variances not assumed
Results and Findings (Cont.) Inter-rater Reliability: Intraclass Correlations Coefficients (ICC) Overall Intraclass Correlation Coefficient Intraclass Correlation Group AGroup B Single Measures Average Measures One-way random effects model where people effects are random. Group A-Experiment Group; Group B-Control Group
Results and Findings (Cont.) Intraclass Correlation Coefficient by Criterion Criterion Average Measures Intraclass Correlation Group A 1 Conceptualization/Content/Ideas [THIN] Analysis/Evaluation [THIN] Synthesis /Support [THIN] Conclusion/Implications [THIN] Selection/Retrieval [INFO] Organization [COMM] Writing Mechanics [COMM] APA Compliance [COMM] Technology Application [TECH].303
Results and Findings (Cont.) Inter-Item Correlation for Group A Reliability Statistics a Cronbach's Alpha Cronbach's Alpha Based on Standardized Items N of Items a. Group# = Group A-Experiment
Results and Findings (Cont.) Inter-Item Correlation Matrix a Criterion 1 Criterion 2 Criterion 3 Criterion 4 Criterion 5 Criterion 6 Criterion 7 Criterion 8 Criterion 9 Criterion 1 [THIN] Criterion 2 [THIN] Criterion 3 [THIN] Criterion 4 [THIN] Criterion 5 [INFO] Criterion 6 [COMM] Criterion 7 [COMM] Criterion 8 [COMM] Criterion 9 [TECH]
Lessons Learned through the Process Get faculty excited about assessment! Strategies to improve inter-rater agreement More training Clear rubric criteria Map assignment instructions to rubric criteria Make decisions based on the assessment results Further refined the rubric and common assessment activity
Resources McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), (Correction, 1(1), 390). Nunnally, J. (1978). Psychometric theory (2 nd ed.). New York: McGraw-Hill. Stemler, S.E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating. Practical Assessment, Research & Evaluation, 9(4). Retrieved from Shrout, P.E. & Fleiss, J.L. (1979). Intraclass Correlations: Uses in Assessing Rater reliability. Psychological Bulletin, 2, Retrieved from
Stay Connected… Dr. Yan Zhang Cooksey Director for Outcomes Assessment The Graduate School, University of Maryland University College Homepage: