Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence.

Slides:

Advertisements

Similar presentations

Performance Measures Criteria Criteria used to evaluate Performance Management Systems: Strategic Congruence Extent to which performance mgt systems elicits.

Advertisements

Psychology Practical (Year 2) PS2001 Correlation and other topics.

The Research Consumer Evaluates Measurement Reliability and Validity

Jim Pettersson, Ph.D. Utah Valley University.

Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.

Dr. David A. Gaitros Research Associate 8/20/20091Dr. David A. Gaitros.

Writing an Effective Proposal for Innovations in Teaching Grant

TESTING ORAL PRODUCTION Presented by: Negin Maddah.

Perception of syllable prominence by listeners with and without competence in the tested language Anders Eriksson 1, Esther Grabe 2 & Hartmut Traunmüller.

Susan Malone Mercer University.  “The unit has taken effective steps to eliminate bias in assessments and is working to establish the fairness, accuracy,

Evaluating tests and examinations What questions to ask to make sure your assessment is the best that can be produced within your context. Dianne Wall.

 I would like to thank my dear teachers who have taken all the trouble to come to this remote place in upper Egypt.

The Role of Noticing: An Experimental Study on Chinese Tones in a CFL Classroom Zihan Geng & Chen-Yu Liu Principal Investigators: Andrew Farley & Kimi.

2002 ASEE/IEEE FIE Conference1 Teaching Teamwork Skills in Software Engineering Based on an Understanding of Factors Affecting Group Performance Robert.

Primary Stress and Intelligibility: Research to Motivate the Teaching of Suprasegmentals By Laura D. Hahn Afra MA Carolyn MA Josh MA

LOGO Needs Analysis on Non- English Major Students’ English Language Needs --An analysis based on Hutchinson and Waters’ categorization of needs Group.

Test Evaluation ~assessing speaking Group Members Lulu Irena Crystal.

Chapter 7 Correlational Research Gay, Mills, and Airasian

Linguistics and Language Teaching Lecture 9. Approaches to Language Teaching In order to improve the efficiency of language teaching, many approaches.

Enjoyability of English Language Learning from Iranian EFL Learners' Perspective.

Technical Issues Two concerns Validity Reliability

Specific Learning Disabilities in Plain English Specific Learning Disabilities in Plain English Children with specific learning disabilities (SLD) have.

WORKSHOP LANGUAGE PROFICIENCY REQUIREMENTS IMPLEMENTATION March 2010 Rome - Italy REGULATORY ISSUES ON TESTING Eleonora Italia Enac Personnel Licensing.

Statistics for Education Research Lecture 10 Reliability & Validity Instructor: Dr. Tung-hsien He

Languages NEW INTERNATIONALIST EASIER ENGLISH ELEMENTARY READY LESSON.

Near East University Department of English Language Teaching Advanced Research Techniques Correlational Studies Abdalmonam H. Elkorbow.

CAMBRIDGE CERTIFICATE IN TEACHING ENGLISH TO SPEAKERS OF OTHER LANGUAGES CELTA.

PLAN AND ORGANISE ASSESSMENT. By the end of this session, you will have an understanding of what is assessment, competency based assessment, assessment.

Lynn Thompson Center for Applied Linguistics Startalk Network for Program Excellence Chicago, Illinois October 16-18, 2009 Formative and Summative Assessment.

1 English language proficiency standards (ELPS) Georgina K. Gonzalez Bilingual/ESL Director Susie Coultress Assistant Director Curriculum Division Texas.

OCTOBER 21, 2014 ENGLISH AS A SECOND LANGUAGE PROGRAMS.

Renaissance Academy World Language Program Assessment.

Experimental Research Methods in Language Learning Chapter 11 Correlational Analysis.

Dr. Grisel M.García Pérez

Chap. 2 Principles of Language Assessment

EDU 8603 Day 6. What do the following numbers mean?

 21 in kindergarten  17 in first grade  10 in second grade  16 in third grade  3 in fourth  5 in fifth.

Chapter 2 Chapter 2 Teaching Pronunciation. I why teach pronunciation? 1. Inaccurate production of a phoneme or inaccurate use of suprasegmental elements.

English stress teaching and learning in Taiwan 林郁瑩 MA0C0104.

Evaluating Survey Items and Scales Bonnie L. Halpern-Felsher, Ph.D. Professor University of California, San Francisco.

1 English In A Changing World Introduction. 2 3 Text And New Words: Advice  Record New Unfamiliar Words  Organize In Textbook Units or by Topics 

STAMP (Standards-based Measurement of Proficiency)

1.) *Experiment* 2.) Quasi-Experiment 3.) Correlation 4.) Naturalistic Observation 5.) Case Study 6.) Survey Research.

Language proficiency evaluation: Raters Henry Emery PRICESG Linguistic Sub-Group.

Chapter 10 Experimental Research Gay, Mills, and Airasian 10th Edition

JS Mrunalini Lecturer RAKMHSU Data Collection Considerations: Validity, Reliability, Generalizability, and Ethics.

Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.

By: HANIM MOHAMED (MP ) SITI FATIMAH ZAINI (MP091421)

Maurice Grinberg evaluation.nbu.bg ALTE meeting, Lisbon, November

Stages of Test Development By Lily Novita

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Language Assessment Lecture 7 Validity & Reliability Instructor: Dr. Tung-hsien He

An Institutional Writing Assessment Project Dr. Loraine Phillips Texas A&M University Dr. Yan Zhang University of Maryland University College October 2010.

Building awareness and concern for pronunciation by Joanne Kenworthy - Teaching English Pronunciation FONETICA Y FONOLOGIA II - ALEXANDRA NAIR ZUÑIGA.

BUS 308 Entire Course (Ash Course) For more course tutorials visit BUS 308 Week 1 Assignment Problems 1.2, 1.17, 3.3 & 3.22 BUS 308.

English Language Learners In Our Classrooms. The New Face of ESL ESL TEACHERS: Rebekkah Kemp Joyce Metallo Michelle Wesbrook.

Midterm Report Presenter: Eunice Lai Instructor: Patricia Su Date: 19 th April, 2012.

Making yourself understood is not all about accent.

Relating Foreign Language Curricula to the CEFR in the Maltese context

Teachers’ evaluation by the Petroleum – Gas University of Ploiești

EVALUATING EPP-CREATED ASSESSMENTS

Ch. 5 Measurement Concepts.

WTC, Native-Speakerism, and TOEIC Scores

Differences in comprehension strategies for discourse understanding by native Chinese and Korean speakers learning Japanese Katsuo Tamaoka Graduate.

Finding Answers through Data Collection

THE RELATIONSHIP BETWEEN PRE-SERVICE TEACHERS’ PERCEPTIONS TOWARD ACTIVE LEARNING IN STATISTIC 2 COURSE AND THEIR ACADEMIC ACHIEVEMENT Vanny Septia Efendi.

BiH Test Piloting Mary Jo DI BIASE.

Week 14 More Data Collection Techniques Chapter 5

A COMPARISON OF USING DIGLOT WEAVE TECHNIQUE AND STUDENT TEAM ACHIEVEMENT DIVISION ON STUDENT’S VOCABULARY ACHIEVEMENT Indonesian Education and Culture.

Qualities of a good data gathering procedures

Presentation transcript:

Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Definitions ITA candidates-international graduate students seeking TA positions Performance test-short teaching demonstration, graded on Likert scale Discourse intonation-intonation contours (Thought groups, Prominence, Tone) Inter-rater reliability-coefficient that measures how similarly members of rater team rate the same performance

Why the workshop? Tech needs ITA’s to teach ITA’s need to be able to communicate with undergraduates ITA’s need to pass three tests

Who is the workshop for? Stakeholders 1.ITA candidates 2. performance test raters 3. ITA workshop directors 4. TTU administrators 5. Department heads of ITA candidates’ departments 6. Undergraduates who might be taught by ITAs

Who is this presentation for? 1.Performance test raters 2.ITA workshop directors 3.Anyone involved with rater training 4.Anyone interested in issues of rater reliability Why? Inter-rater reliability necessary to answer stakeholders worries This study is a first step towards validating the summer workshop program

What type of research is inter-rater reliability research? Since this study is observing what the ITA candidates do, not experimentally manipulating them in some way, then the study’s research method is a “correlational or cross-sectional research” (Field, 2013, p. 13). In this case, we are measuring how closely two different raters rate the same criteria for the same ITA candidate on the same performance test. Correlation is an accepted measure of reliability. This study uses Kendall’s Tau for the correlation. Kendall’s Tau fits our needs because it does not require normal distribution and it works well with numerous equal data points (like the strings of 4s and 5s in our data) (Field, 2013).

Research Questions Do rater’s rate the same criteria in the same way? In other words, is there a moderate to high level of inter-rater reliability on the final performance test? Do raters’ ratings become more reliable the more experience they have with the ITA candidates and with the constructs of the ITA workshop course? Does inter-rater reliability increase from the midterm test to the final test?

Raters Three teams of paired raters 4 male and 2 female raters 4 native and 2 non-native English speakers 5 held Master’s degrees and 1 was in process of completing Master’s degree Experience with ITA candidates ranged from 20 years to 0 years (but all had experience teaching EFL/ESL)

Rater training Conducted by the ITA workshop directors Both trainers were authors of text used in the workshop and had many years of experience with ITA candidates Training session for two days before workshop Listened to performances, rated them, discussed ratings and reasons.

Participants Gender-53 are male, and 31 are female Native language-33 Chinese speakers, 10 Bengali speakers, 8 Farsi speakers, 6 Arabic speakers, 6 Korean speakers, 6 Sinhalese speakers, 4 Tamil speakers, 3 Nepali speakers, 2 Spanish speakers, 2 French speakers, 2 Hindi speakers, and 1 speaker of each of the following: English, Indonesian, Japanese, Kamona, Urdu, Vietnamese and Yoruba Number in each group-Team one rated 30 students, Team two rated 29 students, Team three rated 25 students

Materials ITA Performance test version 9.0 Four constructs and ten criteria 1.Grammatical competence-pronunciation, word stress, thought groups 2.Textual competence-grammatical structures, transitional phrases, definitions 3.Sociolinguistic competence-prominence, comprehension checks, tone 4.Functional competence-answering students’ questions

Procedures Entered scores on Excel Used SPSS to calculate Kendall’s Tau coefficients for each team of raters ratings of each candidate on the midterm and final test Evaluated the reliability coefficients of the final test ratings Compared the coefficients of midterm and final test

Analysis Expect a value for Kendall’s Tau 0.2 to 0.4 for moderate correlation and 0.4 or better for good correlation because the ratings are subjective and there are many factors that can influence raters (fatigue, experience with certain groups of English learners). Expect the difference between the final and midterm correlation coefficients for each criteria to be positive indicating that the raters are rating more similarly.

CriteriaMidterm correlationFinal correlationFinal-Midterm 1 pronunciation ** word stress.169 Constant (4s), can’t compute na 3 thought groups ** grammar * transitional phrases * definitions and examples.389*.566 ** prominence.391* comprehension checks ** intonation ** answering questions.651**.458 ** Team 1

criteriamidtermfinalFinal-Midterm 1 pronunciation * word stress.405* thought groups grammar Constant (4s), can’t compute.692**na 5 transitional phrases definitions and examples Constant (4s), can’t compute.520*na 7 prominence comprehension checks * intonation * answering questions.626** Team 2

criteriamidtermfinalFinal-midterm 1 pronunciation * word stress.341 Constant (4s), can’t compute na 3 thought groups * grammar ** transitional phrases definitions and examples.548** prominence ** comprehension checks.586**.503** intonation.389*.373* answering questions.486**.618**.132 Team 3

The Good, the Bad and the Ugly News What we are really concerned with are the final test scores; the midterm is a practice for both the ITA candidates and the raters, and all of the criteria for every rater team had moderate to strong correlation on the final test with the notable exception of criteria 8, Prominence. Inter-rater reliability for most of the criteria for most of the teams went up from midterm to final, with gains in reliability far outweighing losses in all cases except criteria 8, Prominence (and oddly enough Criteria 10, answering students’ questions).

RQs answered Do rater’s rate the same criteria in the same way? In other words, is there a moderate to high level of inter-rater reliability on the final performance test? Yes, in every case excepting criteria 8, prominence. Do raters’ ratings become more reliable the more experience they have with the ITA candidates and the constructs of the ITA workshop course? Does inter-rater reliability increase from the midterm test to the final test? Yes, except for prominence (and to a lesser extent criteria 10, answering students’ questions).

What’s it all mean? Since two of the three teams had reliability issues with prominence and there were no other reliability issues, And there was a decrease in reliability on prominence with both of these teams from the midterm to the final test, It seems that the raters had a vague understanding of prominence and/or a difficulty perceiving prominence.

What’s to be done? More time in rater training sessions devoted to understanding prominence and prominence’s role in the construct of sociolinguistic competence in general. More time in rater train sessions devoted to hearing prominence when it is used.

Limitations of this study Only studies one summer workshop Only studies three rater teams Different rater teams are likely to have different reliability issues Did not interview raters to learn their justifications for their ratings

Thank you for your attention. Have a great day.