Validity and reliability of rating speaking and writing performances

Slides:

Advertisements

Similar presentations

Richard M. Jacobs, OSA, Ph.D.

Advertisements

Chapter 16: Correlation.

Consistency in testing

The Research Consumer Evaluates Measurement Reliability and Validity

Part 4 Staffing Activities: Selection

Defining, Measuring and Manipulating Variables. Operational Definition  The activities of the researcher in measuring and manipulating a variable. 

Reliability & Validity.  Limits all inferences that can be drawn from later tests  If reliable and valid scale, can have confidence in findings  If.

Part II Sigma Freud & Descriptive Statistics

Chapter 8 Criteria and Validity PERSIAN GROUP. ارزیابی امتحان آزمون ارزیابی امتحان آزمون ارزیابی امتحان آزمون ارزیابی امتحان آزمون ارزیابی امتحان آزمون.

JENNA PORTER DAVID JELINEK SACRAMENTO STATE UNIVERSITY Statistical Analysis of Scorer Interrater Reliability.

Susan Malone Mercer University.  “The unit has taken effective steps to eliminate bias in assessments and is working to establish the fairness, accuracy,

Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.

Evaluating tests and examinations What questions to ask to make sure your assessment is the best that can be produced within your context. Dianne Wall.

Reliability n Consistent n Dependable n Replicable n Stable.

A quick introduction to the analysis of questionnaire data John Richardson.

1 Basic statistics Week 10 Lecture 1. Thursday, May 20, 2004 ISYS3015 Analytic methods for IS professionals School of IT, University of Sydney 2 Meanings.

Review Performance Management and Appraisal

Aim: How do we calculate and interpret correlation coefficients with SPSS? SPSS Assignment Due Friday 2/12/10.

PRED 354 TEACH. PROBILITY & STATIS. FOR PRIMARY MATH Lesson 14 Correlation & Regression.

Instrumentation.

Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.

MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.

McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.

LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.

Classroom Assessments Checklists, Rating Scales, and Rubrics

Chapter 11 Descriptive Statistics Gay, Mills, and Airasian

Descriptive Statistics

Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.

Chapter 4: Test administration. z scores Standard score expressed in terms of standard deviation units which indicates distance raw score is from mean.

Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.

Descriptive Research: Quantitative Method Descriptive Analysis –Limits generalization to the particular group of individuals observed. –No conclusions.

Using the IRT and Many-Facet Rasch Analysis for Test Improvement “ALIGNING TRAINING AND TESTING IN SUPPORT OF INTEROPERABILITY” Desislava Dimitrova, Dimitar.

Inter-rater reliability in the KPG exams The Writing Production and Mediation Module.

Presented By Dr / Said Said Elshama  Distinguish between validity and reliability.  Describe different evidences of validity.  Describe methods of.

Reliability n Consistent n Dependable n Replicable n Stable.

Chapter 16: Correlation. So far… We’ve focused on hypothesis testing Is the relationship we observe between x and y in our sample true generally (i.e.

Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.

Chapter 6 - Standardized Measurement and Assessment

Chapter 3 Selection of Assessment Tools. Council of Exceptional Children’s Professional Standards All special educators should possess a common core of.

Educational Research: Data analysis and interpretation – 1 Descriptive statistics EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.

Language Assessment Lecture 7 Validity & Reliability Instructor: Dr. Tung-hsien He

Lesson 3 Measurement and Scaling. Case: “What is performance?” brandesign.co.za.

WEEK 5 Staffing Activities: Selection Chapter 7: Measurement.

Educational Research Descriptive Statistics Chapter th edition Chapter th edition Gay and Airasian.

© 2009 Pearson Prentice Hall, Salkind. Chapter 5 Measurement, Reliability and Validity.

ESTABLISHING RELIABILITY AND VALIDITY OF RESEARCH TOOLS Prof. HCL Rawat Principal UCON,BFUHS Faridkot.

EVALUATING EPP-CREATED ASSESSMENTS

Classroom Assessments Checklists, Rating Scales, and Rubrics

Why Rasch analysis is not the answer in grading essays

Introduction to the Workshop

Classroom Assessments Checklists, Rating Scales, and Rubrics

Human Resource Management By Dr. Debashish Sengupta

Chapter 15: Correlation.

Calculating Reliability of Quantitative Measures

Scoring: Measures of Central Tendency

USI College of Business Faculty Evaluation System

PSY 614 Instructor: Emily Bullock, Ph.D.

Natalie Robinson Centre for Evidence-based Veterinary Medicine

Roadmap Towards a Validity Argument

MANA 5341 Dr. George Benson Measurement MANA 5341 Dr. George Benson 1.

Jesus Jr. Camposeco ECE 682 Dr. David L. Brown

Correlation Analysis and interpretation of correlation, including correlation coefficients.

From Learning to Testing

Matthew McCullagh Linking the Principles of Assessment to the QA Criteria.

STANAG 6001 Testing Workshop

Mapping the ACRL Framework and Nursing Professional

An Introduction to Correlational Research

Affiliation Sandwell NHS Lay summary

Chapter 15 Correlation Copyright © 2017 Cengage Learning. All Rights Reserved.

Descriptive Statistics

Presentation transcript:

Validity and reliability of rating speaking and writing performances STANAG 6001 Testing Workshop, Brno, 6–8 September 2016 Validity and reliability of rating speaking and writing performances Ülle Türk, Estonia

Quality in assessment Reliability is the degree to which an assessment tool produces stable and consistent results. Validity is defined as the extent to which an assessment accurately measures what it is intended to measure. Scoring validity

Scoring validity How far can we depend on the scores which result from the test? Parameters for tests of productive skills Criteria / rating scale Rating procedures Rater selection Rater training Standardisation Moderation Rating conditions Statistical analysis Raters Grading and awarding

Reliability in tests of productive skills Intra-rater reliability or internal consistency Inter-rater reliability of inter-rater agreement Parallel forms reliability

Rater effects that affect reliability Differences in rater severity Halo effects = failing to assign independent scores for the distinct categories of an analytic rubric Central tendency = the reluctance to assign scores at the extremes of a rating scale

Methods for assessing rater reliability Numerical a percentage of agreement between the two raters/ ratings correlation coefficients Visual cross-tabulation matrix

Percent agreement 1st 2nd Agreement Ann 2+ 2 Paul Mary 1+ Jill 3 Tom Steve 1 Linda John Harry Kate Bill Joe Tina Jane The basic model for calculating inter-rater reliability is percent agreement in the two- rater model. 1. Calculate the number/rate of ratings that are in agreement 2. Calculate the total number of ratings 3. Convert the fraction to a percentage

Rules-of-Thumb for Percent Agreement Interpretation Rules-of-Thumb for Percent Agreement Number of Ratings High Agreement Minimal Agreement Qualifications 4 or fewer categories 90% 75% No ratings more than one level apart 5-7 categories Approximately 90% of ratings identical or adjacent

Correlation With plus-levels, translate levels into numbers: 2nd Ann 2+ 4 2 3 Paul Mary 1+ Jill 5 Tom Steve 1 Linda John Harry Kate Bill Joe Tina Jane With plus-levels, translate levels into numbers: 1 =1 1+ =2 2 = 3 2+ = 4 3 = 5 Pearson = 0.670 Mean: 1st = 3,07 2nd = 3,07 St Dev 1st =1,385 2nd = 1,072

Interpretation Benchmarks for correlation coefficients: < 0.20 = poor 0.21 to 0.40 = fair 0.41 to 0.60 = fair 0.61 to 0.80 = good 0.81 to 1.00 = very good

First rating Second rating 1 1+ 2 2+ 3 1st 2nd Ann 2+ 2 John Paul Harry 3 Mary 1+ Kate Jill Bill Tom Joe Steve 1 Tina Linda Jane

References Luoma, Sari (2004) Assessing Speaking. Cambridge University Press. Weir, Cyril J. (2005) Language Testing and Validation: An Evidence-Based Approach. Palgrave Macmillan.