Testing 05 Reliability.

Slides:



Advertisements
Similar presentations
Consistency in testing
Advertisements

Topics: Quality of Measurements
Procedures for Estimating Reliability
Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Chapter 5 Reliability Robert J. Drummond and Karyn Dayle Jones Assessment Procedures for Counselors and Helping Professionals, 6 th edition Copyright ©2006.
The Department of Psychology
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.
Psychometrics William P. Wattles, Ph.D. Francis Marion University.
Chapter 4 – Reliability Observed Scores and True Scores Error
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
VALIDITY AND RELIABILITY
Chapter 3. Reliability: As my grand pappy, Old Reliable,
Lesson Six Reliability.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
-生醫統計期末報告- Reliability 學生 : 劉佩昀 學號 : 授課老師 : 蔡章仁.
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Reliability n Consistent n Dependable n Replicable n Stable.
Reliability Analysis. Overview of Reliability What is Reliability? Ways to Measure Reliability Interpreting Test-Retest and Parallel Forms Measuring and.
RELIABILITY consistency or reproducibility of a test score (or measurement)
Reliability and Validity
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
Session 3 Normal Distribution Scores Reliability.
LECTURE 16 STRUCTURAL EQUATION MODELING.
Research Methods in MIS
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.
Multivariate Methods EPSY 5245 Michael C. Rodriguez.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Validity and Reliability
MEASUREMENT MODELS. BASIC EQUATION x =  + e x = observed score  = true (latent) score: represents the score that would be obtained over many independent.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Unanswered Questions in Typical Literature Review 1. Thoroughness – How thorough was the literature search? – Did it include a computer search and a hand.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Reliability Chapter 3. Classical Test Theory Every observed score is a combination of true score plus error. Obs. = T + E.
Reliability Chapter 3.  Every observed score is a combination of true score and error Obs. = T + E  Reliability = Classical Test Theory.
Reliability & Validity
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Assessing Learners with Special Needs: An Applied Approach, 6e © 2009 Pearson Education, Inc. All rights reserved. Chapter 4:Reliability and Validity.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
All Hands Meeting 2005 The Family of Reliability Coefficients Gregory G. Brown VASDHS/UCSD.
Chapter 2: Behavioral Variability and Research Variability and Research 1. Behavioral science involves the study of variability in behavior how and why.
RELIABILITY Prepared by Marina Gvozdeva, Elena Onoprienko, Yulia Polshina, Nadezhda Shablikova.
Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
Reliability n Consistent n Dependable n Replicable n Stable.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Reliability: Introduction. Reliability Session 1.Definitions & Basic Concepts of Reliability 2.Theoretical Approaches 3.Empirical Assessments of Reliability.
Reliability: Introduction. Reliability Session Definitions & Basic Concepts of Reliability Theoretical Approaches Empirical Assessments of Reliability.
2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
Reliability As my grand pappy, Old Reliable, used to say... Who is this famous bloodhound? What was he noted for saying? 1.
Chapter 13 Understanding research results: statistical inference.
Chapter 6 Norm-Referenced Reliability and Validity.
Lesson 5.1 Evaluation of the measurement instrument: reliability I.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Chapter 6 Norm-Referenced Measurement. Topics for Discussion Reliability Consistency Repeatability Validity Truthfulness Objectivity Inter-rater reliability.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Chapter 2 Norms and Reliability. The essential objective of test standardization is to determine the distribution of raw scores in the norm group so that.
Copyright © 2009 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 47 Critiquing Assessments.
Classical Test Theory Margaret Wu.
Reliability & Validity
PSY 614 Instructor: Emily Bullock, Ph.D.
Evaluation of measuring tools: reliability
Psy 425 Tests & Measurements
Presentation transcript:

Testing 05 Reliability

Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors: Obvious: poor health, fatigue, lack of interest Less obvious: facets discussed in Fig. 5.3

Reliability & Validity Reliability is a necessary condition for validity. Reliability & validity are complementary aspects of the measurement. Reliability: How much of the performance is due to measurement errors, or to factors other than the language ability we want to measure. Validity: How much of the performance is due to the language ability we want to measure.

Reliability Measurement Reliability measurement includes: logical analysis and empirical research, i.e. identify sources of errors and estimate the magnitude of their effects on the scores.

Logical Analysis Example of identification of source of errors: Topic in an oral interview: business negotiation Source of error: if we want to measure the test taker’s ability of general topics. Indicator of the ability: if we want to the test taker’s ability of business English.

Empirical Research Procedures are usually complex. Three kinds of theories Classical true score theory (CTS) Generalizability theory (G-Theory) Item Response Theory (IRT)

Factors on Test Scores Characteristics of factors general vs. specific lasting vs. temporary systematic vs. unsystematic

Factors that affect language test scores

Variance & Standard Deviation s: standard deviation of the sample σ: standard deviation of the population s2: variance of the sample σ2: variance of the population s=√∑(X-Xˉ)2/n-1 where X: individual score Xˉ: mean score n: number of students

Correlation Coefficient (相关系数) Covariance (COV): two variables, X and Y, vary together. COV(X,Y)=1/(n-1)∑(Xi-Xˉ)(Yi-Yˉ) Correlation Coefficient (Pearson Product-moment Correlation Coefficient 皮尔逊积差相关系数) r(x,y)=COV(x,y)/sxsy r(x,y)= 1/(n-1)∑(Xi-Xˉ)(Yi-Yˉ)/ sxsy

Correlation Coefficient Where n: number of items Xi: individual score of the first half Xˉ: mean of the scores in the first half Yi: individual score of the second half Yˉ: mean of the scores of the second half sx: standard deviation of the first half sy: standard deviation of the second half

Calculation of Correlation Coefficient Manually Manual + Excel Excel

Classical True Score Theory also referred to as the classical reliability theory because its major task is to estimate the reliability of the observed scores of a test. That is, it attempts to estimate the strength of the relationship between the observed score and the true score. sometimes referred to as the true score theory because its theoretical derivations are based on a mathematical model known as the true score model

Assumptions in CTS Assumption 1: The observed score consists of the true score and the error score, i..e. x=xt+xe Assumption 2: Error scores are unsystematic, random and uncorrelated to the true score, i.e. s2=st2+se2

Parallel Test Two tests are parallel if xˉ=x’ˉ sx2=sx’ˉ2 rxy=rx’y

Correlation Between Parallel Tests If the observed scores on two parallel tests are highly correlated, the effects of the error scores are minimal. Reliability is the correlation between the observed scores of two parallel tests. The definition is the basis for all estimates of reliability within CTS theory. Condition: the observed scores on the two tests are experimentally independent.

Error Score Estimation and Measurement Relations between reliability, true score and error score: The higher the portion of the true score, the higher the correlation of the two parallel tests. (True scores are systematic) The higher the portion of the error score, the lower the correlation of the two parallel tests. (Error scores are random)

Error Score Estimation and Measurement rxx’=st2/se2 (st2+se2)/sx2=1 se2/ sx2=1- st2/ sx2 st2/ sx2= rxx’ se2/ sx2=1- rxx’ se2=(1- rxx’)/ sx2

Approaches to Estimate Reliability Three approaches based on different sources of errors. Internal consistency: source of errors from within the test and scoring procedure Stability: How consistent test scores are over time. Equivalence: Scores on alternative forms of tests are equivalent.

Internal Consistency Dichotomous Non-dichotomous Split-half reliability estimates The Spearman-Brown split-half estimate The Guttman split-half estimate Kuder-Richardson reliability coefficients Non-dichotomous Coefficient alpha Rater consistency

Split-half Reliability Estimates Split the test into two halves which have equal means and variances (equivalence) and are independent of each other (independence). 1. divide the test into the first and second halves. 2.  random halves 3.  odd-even method

Spearman-Brown Reliability Estimate rxx’=2rhh’/(1+rhh’) where: rhh’: correlation between the two halves of the test Procedure: 1.   Divide the test into two equal halves 2.  Calculate the correlation coefficient between the two halves 3. Calculate the Spearman-Brown reliability estimate

Guttman Split-Half Estimate rxx’=2(1-(sh12+sh22)/sx2) where sh12: variance of the first half sh22: variance of the second half sx2: variance of the total scores

Kuder-Richardson Formula 20 rxx’=k/(k-1)(1-∑pq/sx2) where k: number of items on the test p: proportion of the correct answers, i.e. correct answers/total answers (difficulty) q: proportion of the incorrect answers, i.e. 1-p sx2: total test score variance

Kuder-Richardson Formula 21 rxx’=(ksx2-xˉ(k-xˉ))/(k-1)sx2 where k: number of items on the test sx2: total test score variance xˉ: mean score

Coefficient alpha α=k/(k-1)(1-∑si2/sx2) where k: number of items on the test ∑si2 : sum of the variances of the different parts of the test sx2: variance of the test scores

Comparison of Estimates: Assumptions   Assumption Effect if assumption is violated Estimate Equivalence Independence Spearman-Brown + underestimate overestimate Guttman - K-C Coefficientα  

Summary: Estimate Procedure Spearman-Brown 1. split 2. variances of each half 3. correlation coefficient of each half 4. reliability coefficient

Summary: Estimate Procedure Guttman 1. split 2. variances of each half 3. variance of the whole test 4. reliability coefficient

Summary: Estimate Procedure K-C 20 1. number of questions 2. proportion of correct answers of each question 3. proportion of incorrect answers of each question 4. sum of the product of p and q 5. variance of the whole test 6. reliability coefficient

Summary: Estimate Procedure K-C 21 1. number of questions 2. mean of the test 3. variance of the test 4. reliability coefficient

Summary: Estimate Procedure Coefficientα 1. number of the parts of the test 2. mean of each part 3. variance of each part 4. sum of variances of all parts 5. mean of the test 6. variance of the test 7. reliability coefficient

Rater Consistency Intra-rater Inter-rater

Intra-rater Reliability Rate each paper twice. Condition: the two ratings must be independent of each other. Two ways of estimating: Spearman-Brown: Take each rating as a split half and compute the reliability coefficient.

Intra-rater Reliability Conditions: the two ratings must have the similar means and variances to ensure the equivalence of the two ratings Coefficient alpha: Take two ratings as two parts of a test. α=(k/(k-1))(1-(sx12+sx22)/sx1+x22)

Intra-rater Reliability where k: number of ratings sx12: variance of the first rating sx22: variance of the second rating sx1+x22: variance of the summed ratings Since k=2, the formula can be reduced to the Guttman Reliability Coefficient Formula.

Inter-rater Reliability If there are only two raters, use split-half estimates to obtain the reliability coefficient. Or Grade Correlation Coefficient: rxx’=1-6∑D2/(n(n2-1)) where D: difference between the grades of the two ratings

Inter-rater Reliability n: number of the test takers See testing 05-2 sheet 5 for example Note: the same grade should be shared. If there are more than two raters, use Coefficient alpha estimate

Stability (test-retest reliability) Administer the test twice to a group of individuals and compute the correlation between the two set of scores. The correlation can then be interpreted as an indicator of how stable the scores are over time. Learning effects and practice effects must be taken into account.

Equivalence (parallel forms reliability) Use alternative forms of a given test. Compute and compare the means and standard deviations of for each of the two forms to determine their equivalence. The correlation between the two sets can be interpreted as an indicator of the equivalence of the two tests or an estimate of the reliability of either one.

GENERALIZABILITY THEORY

GENERALIZABILITY THEORY Generalizability theory (G-theory) is a framework of factorial design and the analysis of variance. It constitutes a theory and set of procedures for specifying and estimating the relative effects of different factors on observed test scores, and thus provides a means for relating the uses or interpretations to the way test users specify and interpret different factors as either abilities or sources of error.

GENERALIZABILITY THEORY G-theory treats a given measure or score as a sample from a hypothetical universe of possible measures, i.e. on the basis of an individual's performance on a test we generalize to his performance in other contexts. Reliability = generalizability The way we define a given universe of measures will depend upon the universe of generalization

Application of G-theory Two stages: G-study D-study

G-study consider the uses that will be made of the test scores, investigate the sources of variance that are of concern or interest.On the basis of this generalizability study, the test developer obtains estimates of the relative sizes of the different sources of variance ('variance components').

D-study When the results of the G-study are satisfactory, the test developer administers the test under operational conditions, and uses G-theory procedures to estimate the magnitude of the variance components. These estimates provide information that can inform the interpretation and use of the test scores.

Significance of G-theory The application of G-Theory thus enables test developers and test users to specify the different sources of variance that are of concern for a given test use, to estimate the relative importance of these different sources simultaneously, and to employ these estimates in the interpretation and use of test scores.

Universes Of Generalization And Universe Of Measures universe of generalization, a domain of uses or abilities (or both) the universe of possible measures: types of test scores we would be willing to accept as indicators of the ability to be measured for the purpose intended.

Populations of Persons In addition to defining the universe of possible measures, we must define the group, or population of persons about whom we are going to make decisions or inferences.

Universe Score A universe score xp is thus defined as the mean of a person's scores on all measures from the universe of possible measures. The universe score is thus the G-theory analog of the CTS-theory true scores. The variance of a group of persons' scores on all measures would be equal to the universe score variance sp2, which is similar to CTS true score variance in the sense that it represents that proportion of observed score variance that remains constant across different individuals and different measurement facets and conditions.

Universe Score The universe score is different from the CTS true score, however, in that an individual is likely to have different universe scores for different universes of measures.

Generalizability Coefficients The G-theory analog of the CTS-theory reliability coefficient is the generalizability coefficient, which is defined as the proportion of observed score variance that is universe score variance:  pxx’2=sp2/sx2 where sp2 is universe score variance and sx2 is observed score variance, which includes both universe score and error variance.

Estimation Variance components: sources of variances persons(p), forms(f), raters(r) sx2=sp2+sf2+sr2+spf2+spr2+sfr2+spfr2 Use ANOVA to compute for the magnitude of the variance Analyse those that are significantly large.

Standard Error of Measurement (SEM) We need to know the extent the test score may vary.(SEM) Formula of SEM Estimation se=sx√(1-rxx’) From: rxx’=st2/sx2 (1) st2/sx2+se2/sx2=1 (2) se2/sx2=1-st2/sx2 (3) se2/sx2=1-rxx’ se2=sx2(1-rxx’)

Interpretation of Test Scores Difficulty Distinction Z score

Difficulty for Dichotomous Scoring p=R/n where: p: difficulty index R: right answers n: number of students

Difficulty for Dichotomous Scoring (Corrected) Cp=(kp-1)/(k-1) Where Cp: corrected difficulty index p: uncorrected difficulty index k: number of choices

Difficulty for Non-dichotomous Scoring p=mean/full score 30%--85%

Distinction Label the top 27% of the total as the high group and the lowest 27% of the total as the low group. D=PH-PL Where D: distinction index PH: rate of the correct answers in the high group PL: rate of the correct answers in the low group

Z score A way of placing an individual score in the whole distribution of scores on a test; it expresses how many standard deviation units lie above or below the mean. Scores above the mean are positive; those below the mean are negative. An advantage of z scores is that they allow scores from different tests to be compared, where the mean and standard deviation differ, and where score points may not be equal. Z=(X-X’)/s

T-score A transformation of a z score, equivalent to it but with the advantage of avoiding negative values, and hence often used for reporting purposes. T=10Z+50

Standardized Score A transformation of raw scores which provides a measure of relative standing in a group and allows comparison of raw scores from different distributions, eg. from tests of different lengths. It does this by converting a raw score into a standard frame of reference which is expressed in terms of its relative position in the distribution of scores. The z score is the most commonly used standardized score. Standardized score = 100Z+500