LG675 Session 4: Reliability I Sophia Skoufaki 8/2/2012.

Slides:



Advertisements
Similar presentations
Topics: Quality of Measurements
Advertisements

The Research Consumer Evaluates Measurement Reliability and Validity
Taking Stock Of Measurement. Basics Of Measurement Measurement: Assignment of number to objects or events according to specific rules. Conceptual variables:
Reliability and Validity checks S-005. Checking on reliability of the data we collect  Compare over time (test-retest)  Item analysis  Internal consistency.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Chapter 4 – Reliability Observed Scores and True Scores Error
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
Lesson Six Reliability.
Chapter 5 Measurement, Reliability and Validity.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Part II Sigma Freud & Descriptive Statistics
What is a Good Test Validity: Does test measure what it is supposed to measure? Reliability: Are the results consistent? Objectivity: Can two or more.
Part II Sigma Freud & Descriptive Statistics
LECTURE 9.
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
Chapter 4 Validity.
Reliability n Consistent n Dependable n Replicable n Stable.
Lesson Seven Reliability. Contents  Definition of reliability Definition of reliability  Indication of reliability: Reliability coefficient Reliability.
FOUNDATIONS OF NURSING RESEARCH Sixth Edition CHAPTER Copyright ©2012 by Pearson Education, Inc. All rights reserved. Foundations of Nursing Research,
Research Methods in MIS
Chapter 7 Correlational Research Gay, Mills, and Airasian
Classroom Assessment A Practical Guide for Educators by Craig A
Principles of language testing
LG675 Session 5: Reliability II Sophia Skoufaki 15/2/2012.
Validity and Reliability
CHAPTER 4 Research in Psychology: Methods & Design
Collecting Quantitative Data
Measurement in Exercise and Sport Psychology Research EPHE 348.
Instrumentation.
Foundations of Educational Measurement
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Unanswered Questions in Typical Literature Review 1. Thoroughness – How thorough was the literature search? – Did it include a computer search and a hand.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Psychometrics William P. Wattles, Ph.D. Francis Marion University.
1 Cronbach’s Alpha It is very common in psychological research to collect multiple measures of the same construct. For example, in a questionnaire designed.
Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Quantitative SOTL Research Methods Krista Trinder, College of Medicine Brad Wuetherick, GMCTE October 28, 2010.
Experimental Research Methods in Language Learning Chapter 9 Descriptive Statistics.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Selecting a Sample. Sampling Select participants for study Select participants for study Must represent a larger group Must represent a larger group Picked.
SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.
Psychometrics. Goals of statistics Describe what is happening now –DESCRIPTIVE STATISTICS Determine what is probably happening or what might happen in.
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
Reliability n Consistent n Dependable n Replicable n Stable.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
MEASUREMENT: PART 1. Overview  Background  Scales of Measurement  Reliability  Validity (next time)
SECOND EDITION Chapter 5 Standardized Measurement and Assessment
Measurement Experiment - effect of IV on DV. Independent Variable (2 or more levels) MANIPULATED a) situational - features in the environment b) task.
Chapter 6 - Standardized Measurement and Assessment
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Chapter 3 Selection of Assessment Tools. Council of Exceptional Children’s Professional Standards All special educators should possess a common core of.
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
TEST SCORES INTERPRETATION - is a process of assigning meaning and usefulness to the scores obtained from classroom test. - This is necessary because.
Assessing Student Performance Characteristics of Good Assessment Instruments (c) 2007 McGraw-Hill Higher Education. All rights reserved.
Dr. Jeffrey Oescher 27 January 2014 Technical Issues  Two technical issues  Validity  Reliability.
© 2009 Pearson Prentice Hall, Salkind. Chapter 5 Measurement, Reliability and Validity.
Ch. 5 Measurement Concepts.
CHAPTER 4 Research in Psychology: Methods & Design
Associated with quantitative studies
CHAPTER 5 MEASUREMENT CONCEPTS © 2007 The McGraw-Hill Companies, Inc.
PSY 614 Instructor: Emily Bullock, Ph.D.
CHAPTERs 2 & 3 Research in Psychology: Getting Started & measurement
TESTING AND EVALUATION IN EDUCATION GA 3113 lecture 1
Chapter 8 VALIDITY AND RELIABILITY
Presentation transcript:

LG675 Session 4: Reliability I Sophia Skoufaki 8/2/2012

2 What does ‘reliability’ mean in the context of applied linguistic research?  Definition and examples Which are the two broad categories of reliability tests? How can we use SPSS to examine the reliability of a norm-referenced measure?  Work with typical scenarios

3 Reliability: broad definitions  The degree to which a data-collection instrument (e.g., a language test, questionnaire) yields consistent results.  The degree to which a person categorises linguistic output consistently (as compared to himself/herself or someone else).

Reliability in applied linguistic research: examples a. A researcher who has created a vocabulary-knowledge test wants to see whether any questions in this test are inconsistent with the test as a whole (Gyllstad 2009). b. A researcher who has collected data through a questionnaire she designed wants to see whether any questionnaire items are inconsistent with the test as a whole (Sasaki 1996). c. A researcher wants to see whether she and another coder agree to a great extent in their coding of idiom-meaning guesses given by EFL learners as right or wrong (Skoufaki 2008). 4

Two kinds of reliability Reliability of norm-referenced data- collection instruments Reliability of criterion-referenced data collection instruments, AKA ‘dependability’. 5

6 6 Classification of data-collection instruments according to the basis of grading Data-collection instruments Norm- referenced Criterion- referenced

7 Norm-referenced “Each student’s score on such a test is interpreted relative to the scores of all other students who took the test. Such comparisons are usually done with reference to the concept of the normal distribution …” (Brown 2005) In the case of language tests, these tests assess knowledge and skills not based on specific content taught.

8 Criterion-referenced “… the purpose of a criterion-referenced test is to make a decision about whether an individual test taker has achieved a pre-specified criterion…” (Fulcher 2010) “The interpretation of scores on a CRT is considered absolute in the sense that each student’s score is meaningful without reference to the other students’ scores.” (Brown 2005) In the case of language tests, these tests assess knowledge and skills based on specific content taught.

9 How reliability is assessed in norm- referenced data-collection instruments Norm-referenced reliability Test-retest Equivalent forms Internal consistency

10 Which reliability test we will use also depends on the nature of the data ratio interval ordinal nominal

11 Scoring scales Nominal: Numbers are arbitrary; they distinguish between groups of individuals (e.g., gender, country of residence) Ordinal: Numbers show greater or lesser amount of something; they distinguish between groups of individuals and they rank them (e.g., students in a class can be ordered)

12 Scoring scales (cont.) Interval: Numbers show greater or lesser amount of something and the difference among adjacent numbers remains stable throughout the scale; numbers distinguish between groups of individuals and they rank them and they show how large the difference is between two numbers (e.g., in tests where people have to get a minimum score to pass) Ratio: This scale contains a number zero, for cases which completely lack a characteristic; numbers do all the things that numbers in interval scales do and they include a zero point (e.g., in length, time)

13 Assessing reliability through the test- rest or equivalent forms approach The procedure for this kind of reliability test is to ask the same people to do the same test again (test-retest) or an equivalent version of this test (equivalent forms). Correlational or correlation-like statistics are used to see how much the scores of the participants are similar between the two tests.

14 SPSS: Testing test-retest reliability Open SPSS and input the data from Set 1 on page 5 of Phil’s ‘Simple statistical approaches to reliability and item analysis’ handout. Do this activity. Then do the activity with Sets 2 and 3.

15 Rater reliability: The degree to which a) a rater rates test-takers’ performance consistently (intra-rater reliability) and b) two or more raters which rate test- takers’ performance give ratings which agree among themselves (inter-rater agreement)

16 Ways of assessing internal-consistency reliability Split-half reliability  We split the test items in half. Then we do a correlation between the scores of the halves. Because our finding will indicate how reliable half our test is (not all of it) and the longer a test is, the higher its reliability, we need to adjust the finding. We use the Spearman- Brown prophecy formula for that. Or Statistic that compares the distribution of the scores that each item got with the distribution of the scores the whole test got  E.g.: Cronbach’s a or Kuder-Richardson formula 20 or 21 In both cases, the higher the similarity found, the higher the internal- consistency reliability. Cronbach’s a is the most frequently used internal-consistency reliability statistic.

17 SPSS: Assessing internal-consistency reliability with Cronbach’s a This is an activity from Brown (2005). He split the scores from a cloze test into odd and even numbered ones, as shown in the table in your handout. Input the file ‘Brown_2005.sav’ into SPSS. Then click on Analyze...Scale...Reliability analysis.... In the Model box, choose Alpha Click on Statistics and tick Scale and Correlations.

Assessing intra-judge agreement or inter- judge agreement between two judges When data is interval, correlations can be used (Pearson r if the data are normally distributed and Spearman rho if they are not). When there are more than two judges and the data is interval, Cronbach’s a can be used. When data is categorical, we can calculate agreement percentage (e.g., the two raters agreed 30% of the time) or Cohen’s Kappa. Kappa corrects for the chance agreement between judges. However, the agreement percentage is good enough in some studies and Kappa has been criticised (see Phil’s handout, pp ). 18

SPSS: Assessing interjudge agreement with Cronbach’s a This is an activity from Larsen-Hall (2010). Go to pss-data-sets.asp pss-data-sets.asp Download and input into SPSS the file MunroDerwingMorton. MunroDerwingMorton Click on Analyze...Scale...Reliability analysis.... In the Model box, choose Alpha. Click on Statistics and tick Scale, Scale if item deleted and Correlations. 19

SPSS: Assessing rater reliability through Cohen’s Kappa The file ‘Kappa data.sav’ contains the results of an error tagging task that I and a former colleague of mine performed on some paragraphs written by learners of English. Each number is an error category (e.g., 2=spelling error). There are 7 categories. For SPSS to understand what each row means, you should weigh the two judge variables by the ‘Count’ variable. 20

SPSS: Assessing rater reliability through Cohen’s Kappa (cont.) Go to Analyse…Descriptive Statistics…Crosstabs. Move one of the judge variables in the ‘Row(s)’ and the other on the ‘Column(s)’ box. You shouldn’t do anything with the ‘Count’ variable. In ‘Statistics’ tick ‘Kappa’. 21

Kappa test

Kappa at Go to the website above and Select number of categories in the data. In the table, enter the raw numbers as they appear in the SPSS contingency table. Click Calculate. The result is not only the same Kappa value as in SPSS, but also three more. Phil recommends using one of the Kappas which have a possible maximum value (see p. 18 of his handout). 23

Next week Item analysis for norm-referenced measures Reliability tests for criterion-referenced measures Validity tests 24

25 References Brown, J.D Testing in language programs: a comprehensive guide to English language assessment. New York: McGraw Hill. Fulcher, G Practical language testing. London: Hodder Education. Gyllstad, H Designing and evaluating tests of receptive collocation knowledge: COLLEX and COLLMATCH. In Barfield, A. and Gyllstad, H. (eds.) Researching Collocations in Another Language: Multiple Interpretations (pp ). London: Palgrave Macmillan. Larsen-Hall, J A guide to doing statistics in second language research using SPSS. London: Routledge. Sasaki, C. L Teacher preferences of student behavior in Japan. JALT Journal 18 (2), Scholfield, P Simple statistical approaches to reliability and item analysis. LG675 Handout. University of Essex. Skoufaki, S Investigating the source of idiom transparency intuitions. Metaphor and Symbol 24(1),

26 Suggested readings On the meaning of ‘reliability’ (particularly in relation to language testing) Bachman, L.F. and Palmer, A.S Language Testing in Practice. Oxford: Oxford University Press. (pp.19-21) Bachman, L.F Statistical Analyses for Language Assessment. Cambridge University Press. (Chapter 5) Brown, J.D Testing in language programs: a comprehensive guide to English language assessment. New York: McGraw Hill. Fulcher, G Practical language testing. London: Hodder Education. (pp.46-7) Hughes, A Testing for Language Teachers. (2 nd ed.) Cambridge: Cambridge University Press. (pp ) On the statistics used to assess language test reliability Bachman, L.F Statistical Analyses for Language Assessment. Cambridge University Press. (chapter 5) Brown, J.D Testing in language programs: a comprehensive guide to English language assessment. New York: McGraw Hill. (chapter 8)

Suggested readings (cont.) Brown, J.D Reliability of surveys. Shiken: JALT Testing & Evaluation SIG Newsletter 1 (2), Field, A Discovering statistics using SPSS. (3 rd ed.) London: Sage. (sections 17.9, 17.10) Fulcher, G Practical language testing. London: Hodder Education. (pp.47-52) Howell, D.C Statistical methods for psychology. Calif.: Wadsworth. (pp ) Larsen-Hall, J A guide to doing statistics in second language research using SPSS. London: Routledge. (section 6.4, , 6.5.5) 27

Homework The file ‘P-FP Sophia Sumei.xls’ contains the number of pauses (unfilled, filled, and total) in some spoken samples of learners of English according to my and a former colleague’s judgment. Which of the aforementioned statistical tests of interjudge agreement seem appropriate for this kind of data? What else would you need to find out about the data in order to decide which test is the most appropriate? 28