Why Rasch analysis is not the answer in grading essays

Why Rasch analysis is not the answer in grading essays
Gavin T L Brown, The University of Auckland Nordic Testing and Assessment of Writing Symposium, Trondheim September 2015

Progress in Writing Norm-referenced Criterion-referenced
Ordered categories of quality Excellent—more than expected or desired …. Unsatisfactory—well below expected OR Highly accomplished—appropriate to expert Novice—appropriate to beginner Guide to teaching and learning and evaluation Norm-referenced Criterion-referenced

Rating Guided judgments of expert raters (teachers) as to best-fit of performance to a quality stage Often supplemented by a rubric

Unreliable/inconsistent
Humans are unreliable & inconsistent in their ratings Some are far from target but consistent (#2) Others are inconsistent & rarely on target (#4) Some are close to but not on the target (#3) Training, monitoring, moderation are needed to gain accuracy and precision

Lenience Some are consistent but either too lenient (never give a low mark) or too harsh (never give a high mark) These can be adjusted statistically once they are identified Harshness

Assumptions: Rasch statistical modeling
All responses are a function of a single factor—the mutually measured difficulty of an item and the ability of a person No other factor matters (e.g., discrimination of item, amount of chance, choice of task, marker, etc.) But this is clearly not applicable to realities of marking extended writing.

Statistical Analysis of Ratings
Multiple facets to take into account The students’ performances (p) which differ from each other in quality The task (t) each student completes (unless all do the same) The raters (r) who differ from each other (unless only one) Component (c) or sub-scores within each performance The interactions of p*t; p*r; t*r; p*c; r*c; t*c; p*t*r; p*t*c; p*r*c; t*r*c; p*t*r*c Goal: variance in scores should be attributable to the student (p); NOT the task they did, the marker they had, the component being used to judge, or any interaction of these construct irrelevant factors.

Good scoring of writing if….
Student scores are spread across range Markers are close to each other Tasks are close to each other Components or sub-scores are close to each other

Techniques for analysis of multiple facets
Multi-facet Rasch analysis Generalisability theory Both look for proportion of variance in observed scores based on presence of facets. Key difference: Rasch analysis transforms raw score to a logit (log-odds) before analysis

Example: e-asTTle Writing MFR analysis
Highly able Very easy Shows distribution of students, tasks, markers, & sub-scores from a norming sample

Example: Rasch analysis-asTTle writing
Example: e-asTTle Writing MFR analysis Example: Rasch analysis-asTTle writing 1 SD Almost all markers are within 1 SD of each other Almost all elements are within 1 SD of each other Almost all tasks are within 1 SD of each other

Hence Yes markers, tasks, and components are not identical in their tendency to award marks and differences are beyond chance But most variation comes correctly from the students not construct-irrelevant factors Having established this do we need to make any further adjustments? And if so which value is the correct one? The highest, lowest, average? Which one do we agree is the ‘true’ value?

Hence, the real problem In classroom operation what is the TRUE essay score? What is a valid reference point? Each teacher’s truthiness? a test score on a related construct? The statistically adjusted score? The socially agreed value assigned by experts following a systematic process?

Systematic Classroom Processes
Common rubric to guide judgments and discussion Linked to curriculum Training in use of rubric Exemplars, feedback Moderation by another marker Simple checks for level of agreement Discussion of differences Non-use of scores until agreement sufficient

Analytic Writing Rubric
Level Audience Awareness and Purpose 2 Proficient • Evidence that writer recognises that his/her opinion is needed. • May state opinions from a personal perspective. 3 Proficient • Language generally appropriate to audience. • Some attempt to influence the reader is evident. 4 Proficient • Writer aware audience may hold a different point of view but tends to assume there is only one different generalised point of view. • Opening presents a point of view to audience. • Writing attempts to persuade reader. • Clearly stated consistent position is evident. 5 Proficient • Identifies & relates to a concrete/specific audience. • Awareness of intended reader evident at beginning & end, but may be inconsistent in middle sections. • Language use is appropriate and has elements which begin to be persuasive to audience. 6 Proficient • Implicit awareness that audience may hold a range of points of view. • Consistently persuasive throughout for intended audience. • Tone likely to impact on or affect change or manipulate the intended audience towards author’s purpose. In NZ the asTTle system has 7 analytic scales : Audience Awareness & Purpose, Content or Ideas, Structure or Organisation, Language and Style, Grammar, Punctuation, Spelling Available at

Moderation of scoring Cross-checking by having 2 qualified judges mark and compare scores for a common group of essays Identical scores: Target is 70% the same Approximately equal (+/- 1 score point): Target is 90% the same if using A+ to F scale Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4), Available online:

Moderation of scoring Debate and discussion and resolution is needed for any essay that differs by more than 1 letter grade or 3/20 or 10/100 Discussion must be linked to evidence in essay and criteria in scoring guide If agreement can’t be reached need 3rd judge who should be MORE experienced than both markers If you meet the expected targets you can use the scores defensibly to make decisions about learning needs and priorities and to report

Conclusion Multi-facet Rasch or Generalisability theory can determine whether construct-irrelevant factors undermine validity of scores This is necessary for norming purposes BUT: Classroom judgment by teachers cannot be treated the same way until they are calibrated into the system with equivalent training as norm-marking panels Simple inter-rater moderation statistics and discussion are sufficient to generate dependable scores

Further New Zealand writing assessment resources at:

Why Rasch analysis is not the answer in grading essays

Similar presentations

Presentation on theme: "Why Rasch analysis is not the answer in grading essays"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Why Rasch analysis is not the answer in grading essays

Similar presentations

Presentation on theme: "Why Rasch analysis is not the answer in grading essays"— Presentation transcript:

Similar presentations

About project

Feedback