Is rater training worth it?

Slides:

Advertisements

Similar presentations

TESTING SPEAKING AND LISTENING

Advertisements

MCR Michael C. Rodriguez Research Methodology Department of Educational Psychology.

IATEFL/TEASIG, Innsbruck 2011 READING LITERACY OF AUSTRIAN SCHOOL LEAVERS: BETWEEN PISA AND "MATURA” Irene Thelen-Schaefer, BIFIE Wien.

Raili Hildén, University of Helsinki, Finland TBLT 2009 Lancaster ‘Tasks: context, purpose and use’ 3rd Biennial International.

Centre for Applied Linguistics Dr Claudia Harsch Centre for Applied Linguistics University of Warwick From Norm- to Standards-based assessment What role.

The Research Consumer Evaluates Measurement Reliability and Validity

Lesson Six Reliability.

MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT

General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.

Updated 11/16/06©1996 & forthcoming, Bachman & Palmer & OUPPage 1 The Place of Intended Impact in Assessment Use Arguments * Lyle F. Bachman Department.

Using Rubrics for Assessment: A Primer Marcel S. Kerr Summer 2007 

Susan Malone Mercer University.  “The unit has taken effective steps to eliminate bias in assessments and is working to establish the fairness, accuracy,

Validating analytic rating scales for speaking at tertiary level Armin Berger IATEFL TEASIG 2011.

Evaluating tests and examinations What questions to ask to make sure your assessment is the best that can be produced within your context. Dianne Wall.

1 The New Adaptive Version of the Basic English Skills Test Oral Interview Dorry M. Kenyon Funded by OVAE Contract: ED-00-CO-0130 The BEST Plus.

| ERK/ CEFR in Context 23 January 2015, Groningen Estelle Meima Language Centre.

An overview of Assessment. Aim of the presentation Define and conceptualise assessment Consider the purposes of assessment Describe the key elements of.

Characteristics of Sound Tests

Developed by Marian Hargreaves for NEAS 2013

Using statistics in small-scale language education research Jean Turner © Taylor & Francis 2014.

Lecture 8 Teaching Writing in EFL/ESL Joy Robbins

ASSESSING SPEAKING – PURPOSES AND TECHNIQUES

Principles of language testing

ASSESSMENT OF ESSAY TYPE QUESTIONS. CONSTRUCTING QUESTIONS Construct questions that test HIGHER LEVEL PROCESSES SUCH AS Construct questions that test.

6 th semester Course Instructor: Kia Karavas.  What is educational evaluation? Why, what and how can we evaluate? How do we evaluate student learning?

1 Development of Valid and Reliable Case Studies for Teaching, Diagnostic Reasoning, and Other Purposes Margaret Lunney, RN, PhD Professor College of.

14th International GALA conference, Thessaloniki, December 2007

Instrument Validity & Reliability. Why do we use instruments? Reliance upon our senses for empirical evidence Senses are unreliable Senses are imprecise.

MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.

June 09 Testing: Back to Basics. Abdellatif Zoubair Abdellatif

Principles in language testing What is a good test?

Assessing Writing Writing skill at least at rudimentary levels, is a necessary condition for achieving employment in many walks of life and is simply taken.

CCSSO Criteria for High-Quality Assessments Technical Issues and Practical Application of Assessment Quality Criteria.

Assessment in Education Patricia O’Sullivan Office of Educational Development UAMS.

General Information Iowa Writing Assessment The Riverside Publishing Company, 1994 $39.00: 25 test booklets, 25 response sheets 40 minutes to plan, write.

Issues in Comparability of Test Scores Across States Liru Zhang, Delaware DOE Shudong Wang, NWEA Presented at the 2014 CCSSO NCSA New Orleans, LA June.

Reliability & Validity

An Investigation of test- taking strategies among Uitm students in an online test. SITI NASUHA ABU HASSAN P61632.

Week 5 Lecture 4. Lecture’s objectives  Understand the principles of language assessment.  Use language assessment principles to evaluate existing tests.

What do the kids think? A quantitative analysis of feedback questionnaires in standardised reading tests Eva Konrad & Annabell Marinell.

Using the IRT and Many-Facet Rasch Analysis for Test Improvement “ALIGNING TRAINING AND TESTING IN SUPPORT OF INTEROPERABILITY” Desislava Dimitrova, Dimitar.

Evaluating Survey Items and Scales Bonnie L. Halpern-Felsher, Ph.D. Professor University of California, San Francisco.

Presented By Dr / Said Said Elshama  Distinguish between validity and reliability.  Describe different evidences of validity.  Describe methods of.

Examining Rubric Design and Inter-rater Reliability: a Fun Grading Project Presented at the Third Annual Association for the Assessment of Learning in.

Donal Crawford International Pacific College July 2010.

The exam THERE ARE FOUR PAPERS: READING & USE OF ENGLISH, WRITING, LISTENING SPEAKING: TIME ALLOWED: 1) READING & USE OF ENGLISH: 1 H. 15 MINS. 2) WRITING 1.

Measurement MANA 4328 Dr. Jeanne Michalski

Nurhayati, M.Pd Indraprasta University Jakarta.  Validity : Does it measure what it is supposed to measure?  Reliability: How the representative is.

VALIDITY, RELIABILITY & PRACTICALITY Prof. Rosynella Cardozo Prof. Jonathan Magdalena.

1 A Century of Testing: Ideas on Solving Enduring Accountability and Assessment Problems UCLA, Los Angeles 8-9 September 2005 Barry McGaw Director for.

Chapter 3 Selection of Assessment Tools. Council of Exceptional Children’s Professional Standards All special educators should possess a common core of.

Jamal Abedi, UCLA/CRESST Major psychometric issues Research design issues How to address these issues Universal Design for Assessment: Theoretical Foundation.

Assistant Instructor Nian K. Ghafoor Feb Definition of Proposal Proposal is a plan for master’s thesis or doctoral dissertation which provides the.

If I hear, I forget. If I see, I remember. If I do, I understand. Rubrics.

Anxiety Sensitivity and Pain Catastrophizing: Distinct Factors in Predicting Pain Susan T. Heinze, Jamie L. Elftman, W. Hobart Davies University of Wisconsin-Milwaukee.

© 2009 Pearson Prentice Hall, Salkind. Chapter 5 Measurement, Reliability and Validity.

Psychology 3051 Psychology 305A: Theories of Personality Lecture 3 1.

Automatic Writing Evaluation

Why Rasch analysis is not the answer in grading essays

Test Design & Construction

Instructional Design Models

Validity and reliability of rating speaking and writing performances

RELATING NATIONAL EXTERNAL EXAMINATIONS IN SLOVENIA TO THE CEFR LEVELS

Bursting the assessment mythology: A discussion of key concepts

9th Grade Literature & Composition

Small group consensus discussion tasks: CA driven criteria

Roadmap Towards a Validity Argument

A Study of the Decision-making Behavior of Markers in E-C Sentence Translation Assessment Wen Hui Nie Jianzhong.

Effective Use of Rubrics to Assess Student Learning

Presentation transcript:

Is rater training worth it? Mag. Franz Holzknecht Mag. Benjamin Kremmel IATEFL TEASIG Conference September 2011 Innsbruck

Overview Research literature on rater training CLAAS Study Results Discussion Overview Research literature on rater training CLAAS CEFR Linked Austrian Assessment Scale Study Participants Procedure Results Discussion

Rater training need for training highlighted in testing literature Overview Literature CLAAS Study Results Discussion Rater training need for training highlighted in testing literature Alderson, Clapham & Wall, 1995; McNamara, 1996; Bachman & Palmer, 1996; Shaw & Weir, 2007 training helps clarify rating criteria, modifies rater expectations and provides a reference group for raters Weigle, 1994 training can increase intra-rater consistency Lunz, Wright & Linacre, 1990; Stahl & Lunz, 1991; Weigle, 1998 training can redirect attention of different rater types and so decrease imbalances Eckes, 2008

Rater training effects not as positive as expected Overview Literature CLAAS Study Results Discussion Rater training effects not as positive as expected Lumley & McNamara, 1995; Weigle, 1998 eliminating rater differences unachievable and possibly undesirable’ McNamara, 1996: 232 “Rater training is more successful in helping raters give more predictable scores [...] than in getting them to give identical scores“ Weigle, 1998: 263

CLAAS CEFR-Linked Austrian Assessment Scale Overview Literature CLAAS Study Results Discussion CLAAS CEFR-Linked Austrian Assessment Scale developed over 2 years tested against performances from 4 field trials item writers, international experts, standard setting judges analytic scale with 4 criteria Task Achievement Organisation and Layout Lexical and Structural Range Lexical and Structural Accuracy 11 Bands per criterion 6 described 5 not described

Overview Literature CLAAS Study Results Discussion Bifie, 2011

Participants 3 groups of raters: days of training N Overview Literature CLAAS Study Results Discussion Participants 3 groups of raters: days of training N provinces of Austria group 1 5 15 8 group 2 2 12 group 3 13 6

Procedure [1] groups were asked to rate a range of performances Overview Literature CLAAS Study Results Discussion Procedure [1] groups were asked to rate a range of performances different task types article email essay report selected criteria Task Achievement [TA] Organisation and Layout [OL] Lexical and Structural Range [LSR] Lexical and Structural Accuracy [LSA]

Procedure [2] group 1 group 2 group 3 [5 days training] Overview Literature CLAAS Study Results Discussion Procedure [2] TA OL LSR LSA Article 2743 2722 2540 Email 2288 2630 2449 group 1 [5 days training] group 2 [2 days training] group 3 [no training] TA OL LSR LSA Essay 1071 1152 Report 1348 Article 2701 Email 2428 TA OL LSR LSA Essay 1152 Report 1348 Article 2743 2540 Email 2288 2630 2438

Results [1] Inter-rater reliability group 2 [2 days training]: Overview Literature CLAAS Study Results Discussion Results [1] group 2 [2 days training]: Inter-rater reliability group 3 [no training]:

Results [2] Inter-rater reliability group 1 [5 days training]: Overview Literature CLAAS Study Results Discussion Results [2] group 1 [5 days training]: Inter-rater reliability group 3 [no training]:

Results [3] Separation index Reliability Inter-rater reliability Overview Literature CLAAS Study Results Discussion Results [3] Inter-rater reliability Separation index are rater measurements statistically distinguishable? Reliability not inter-rater how reliable is the distinction between different levels of severity among raters? high separation = low inter-rater reliability high reliability = low inter-rater reliability

Results [4] Inter-rater reliability Separation Reliability 1.48 0.69 Overview Literature CLAAS Study Results Discussion Results [4] Inter-rater reliability Separation Reliability group 3 [no training] group 2 [2 days training] group 1 [5 days training] 1.48 0.69 Fairly low inter-rater reliability 0.00 0.00 High inter-rater reliability 0.52 0.21 High inter-rater reliability

Results [5] Intra-rater reliability Infit Mean Square: Overview Literature CLAAS Study Results Discussion Results [5] Intra-rater reliability Infit Mean Square: values between 0.5 – 1.5 are acceptable Lunz & Stahl, 1990 values above 2.0 are of greatest concern Linacre, 2010

Results [6] Intra-rater reliability 53% 23% 33% Overview Literature CLAAS Study Results Discussion Results [6] Intra-rater reliability 53% 23% 33%

Discussion Weigle’s [1998] findings could not be confirmed Overview Literature CLAAS Study Results Discussion Discussion Weigle’s [1998] findings could not be confirmed trained raters showed higher levels of inter-rater reliability intra-rater reliability decreased with more days of rater training Results maybe due to form of rater training Is rater training worth it?

Overview Literature CLAAS Study Results Discussion Further research monitoring of future ratings of group 1 [5 days training] larger number of data points per element [= ratings per rater / per examinee] Linacre, personal communication More data points for examinees for group 3 [no training] More data points for raters for group 1 [5 days training] group 1 [5 days training] rate same scripts again after 10 days training Compare inter- and intra-rater reliability of first and second ratings

Bibliography Overview Literature CLAAS Study Results Discussion Alderson, J.C., Clapham C., & Wall, D. [1995]. Language test construction and evaluation. Cambridge: Cambridge University Press. Bachman, L.F., & Palmer, A.S. [1996]. Language testing in practice. Oxford: Oxford University Press. Bifie. [2011]. CEFR linked Austrian assessment scale. <https://www.bifie.at/system/files/dl/srdp_scale_b2_2011-05-18.pdf>. Retrieved on September 19th 2011. Eckes, T. [2008]. Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25 [2], 255-185. Linacre, J.M. [2010]. Manual for Online FACETS course [unpublished]. Lumley, T., & McNamara, T.F. [1995]. Rater characteristics and rater bias: implications for training. Language Testing 12 [1], 54-71. Lunz, M.E. & Stahl, J.A. [1990]. Judge Consistency and Severity Across Grading Periods. Evaluation and the Health Professions 13, 425-444. Lunz, M.E., Wright, B.D., & Linacre, J.M. [1990]. Measuring the impact of judge severity on examination scores. Applied Measurement in Education 3 [4], 331-45. McNamara, T.F. [1996]. Measuring Second Language Performance. London: Longman. Shaw, S.D., & Weir, C.J. [2007]. Examining Writing: Research and practic in assessing second language writing. Cambridge: CUP. Stahl, J.A., & Lunz, M.E. [1991]. Judge performance reports: Media and message, paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Weigle, S.C. [1994]. Effects of training on raters of ESL compositions. Language Testing 11 [2], 197-223. Weigle, S.C. [1998]. Using FACETS to model rater training effects. Language Testing 15 [2], 263-87.