Classical Test Theory Margaret Wu.

Slides:



Advertisements
Similar presentations
Questionnaire Development
Advertisements

Reliability IOP 301-T Mr. Rajesh Gunesh Reliability  Reliability means repeatability or consistency  A measure is considered reliable if it would give.
Topics: Quality of Measurements
Some (Simplified) Steps for Creating a Personality Questionnaire Generate an item pool Administer the items to a sample of people Assess the uni-dimensionality.
Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
The Department of Psychology
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.
Chapter 4 – Reliability Observed Scores and True Scores Error
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
-生醫統計期末報告- Reliability 學生 : 劉佩昀 學號 : 授課老師 : 蔡章仁.
Can you do it again? Reliability and Other Desired Characteristics Linn and Gronlund Chap.. 5.
Reliability n Consistent n Dependable n Replicable n Stable.
Reliability Analysis. Overview of Reliability What is Reliability? Ways to Measure Reliability Interpreting Test-Retest and Parallel Forms Measuring and.
Item PersonI1I2I3 A441 B 323 C 232 D 112 Item I1I2I3 A(h)110 B(h)110 C(l)011 D(l)000 Item Variance: Rank ordering of individuals. P*Q for dichotomous items.
When Measurement Models and Factor Models Conflict: Maximizing Internal Consistency James M. Graham, Ph.D. Western Washington University ABSTRACT: The.
LECTURE 5 TRUE SCORE THEORY. True Score Theory OBJECTIVES: - know basic model, assumptions - know definition of reliability, relation to TST - be able.
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
LECTURE 16 STRUCTURAL EQUATION MODELING.
Research Methods in MIS
Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.
Classical Test Theory By ____________________. What is CCT?
Validity and Reliability
LECTURE 6 RELIABILITY. Reliability is a proportion of variance measure (squared variable) Defined as the proportion of observed score (x) variance due.
MEASUREMENT MODELS. BASIC EQUATION x =  + e x = observed score  = true (latent) score: represents the score that would be obtained over many independent.
Foundations of Educational Measurement
Data Analysis. Quantitative data: Reliability & Validity Reliability: the degree of consistency with which it measures the attribute it is supposed to.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Unanswered Questions in Typical Literature Review 1. Thoroughness – How thorough was the literature search? – Did it include a computer search and a hand.
Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Tests and Measurements Intersession 2006.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
1 EPSY 546: LECTURE 1 SUMMARY George Karabatsos. 2 REVIEW.
RELIABILITY Prepared by Marina Gvozdeva, Elena Onoprienko, Yulia Polshina, Nadezhda Shablikova.
SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
Reliability n Consistent n Dependable n Replicable n Stable.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Chapter 6 - Standardized Measurement and Assessment
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
Reliability EDUC 307. Reliability  How consistent is our measurement?  the reliability of assessments tells the consistency of observations.  Two or.
Lesson 2 Main Test Theories: The Classical Test Theory (CTT)
Lesson 5.1 Evaluation of the measurement instrument: reliability I.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Chapter 2 Norms and Reliability. The essential objective of test standardization is to determine the distribution of raw scores in the norm group so that.
ESTABLISHING RELIABILITY AND VALIDITY OF RESEARCH TOOLS Prof. HCL Rawat Principal UCON,BFUHS Faridkot.
Professor Jim Tognolini
Measurement Reliability
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
ESTIMATION.
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
Reliability.
Item Analysis: Classical and Beyond
Reliability & Validity
Calculating Reliability of Quantitative Measures
PSY 614 Instructor: Emily Bullock, Ph.D.
Evaluation of measuring tools: reliability
By ____________________
LESSON 18: CONFIDENCE INTERVAL ESTIMATION
Psy 425 Tests & Measurements
Item Analysis: Classical and Beyond
Multitrait Scaling and IRT: Part I
Chapter 8 VALIDITY AND RELIABILITY
Item Analysis: Classical and Beyond
Presentation transcript:

Classical Test Theory Margaret Wu

Some Statistical Terms Mean (average) (location of distribution) Variance: (spread of distribution) Standard Deviation = sqrt (Variance) (spread of distribution) 95% of the observations fall within mean±2standard deviation

Normal distribution About 95% of the observations fall between –2 and 2.

Correlation Degree of association between two variables Corr=0.8

CTT item difficulty, person ability and item discrimination CTT item difficulty - Percentage of persons obtaining the correct answer on an item CTT person ability – percentage of items a person obtains the correct answer CTT item discrimination – correlation between student test score and item score

Correlation between these two columns Student Score on the Test Item Discrimination Correlation between these two columns Student Score on the Test Score on one item 1 2 3 … 28 29 30

Classical Test Theory vs. IRT - 1 Classical Test Theory: true-score theory (CTT) Modern Test Theory: item response theory; latent trait models (IRT) IRT focuses on estimating each student’s ‘ability’ ( ) on a latent trait CTT focuses on estimating each student’s ‘true score’ (T) on a test.

Classical Test Theory vs. IRT - 2 IRT: making inference about a student’s ‘ability’ on the latent trait (that is being tested) CTT: making inference about a student’s likely score on a test IRT: notion of latent trait, range from - to  CTT: test scores on a test, range from 0 to maximum score on a test.

Classical Test Theory vs. IRT - 3 For example, a geometry test is given to students Under the IRT approach, we try to estimate each student’s level on the latent trait “geometry”. The level on this latent trait “influences” the item responses. Under the CTT, we try to estimate the likely score on THIS geometry test (and geometry tests like this test). Philosophical difference between the two.

Classical Test Theory vs. IRT - 4 IRT provides more scope for linking different tests, and providing substantive interpretations to scores on a test. CTT is more limited to scores on ONE (kind of) test. There is less scope for generalisation. If you are only interested in ONE test, and you are only interested in ranking students, then IRT does not provide much more than CTT.

Assumptions of CTT 1. X = T + E (obs score = true score + error) 2. mean(X) = T 3. Corr (E,T) = 0 4. Corr (E1,E2) = 0 5. Corr (E1,T2) = 0 Parallel tests: if X and X’ satisfy 1-5, and T=T’, var(E) = var(E’) This equation says that the observed score for a person is his/her true score plus error. The equation says that, if you were able to administer a test over and over again, then the average of the observed scores is the true score. In fact, this is how True Score is defined: the average of observed scores if a test is administered many times. This equation says that the correlation between a person’s true score and the error is independent. This equation says that the correlation between the errors across people are independent. E1 denotes error for person 1, and E2 denotes error for person 2, etc. This equation says that the error for a person is uncorrelated with the true score of any other person. Parallel Tests: X is a person’s score on one test, and X’ is a person’s score on a “parallel” test Tau equivalent: If scores on two tests satisfy 1-5, but the True score on one test is equal to the true score plus a constant, and the error variances are possibly different, then the two tests are said to be tau-equivalent.

These follow from CTT: Mean(E) = 0 Var(X) = Var(T) + Var(E) [Corr(X,T)]2 = Var(T)/Var(X) Var(X) = Var(X’), for parallel tests Corr(X, X’) = Var(T)/Var(X) The last one is defined as reliability of a test. To show that [corr(X,T)]2 = var(T)/var(X): corr(X,T) = cov(X,T)/sqrt(var(X)var(T)) The numerator, cov(X,T) can be further expanded as cov(T+E,T) = cov(T,T) + cov(E,T) = var(T) + 0 = var(T); So, corr(X,T) = var(T)/sqrt(var(X)var(T)) = sqrt(var(T)/var(X)) As an exercise, show that corr(X, X’) = Var(T)/Var(X)

Reliability In words, reliability is the proportion of true scores variance over the observed scores variance. If measurement error is small, then observed scores will be close to true scores, so reliability will be close to 1. If measurement error is large, then observed scores will have a much large variance than that of true scores, so reliability will be close to zero.

Reliability - 1 Test/Retest Administer the same test on two occasions Compare the agreements between candidate scores on the two occasions For example, if we are testing the ability to shoot basketball goals, we can test each person on two occasions. Or if we are testing for stress levels, we can administer a questionnaire, or measure symptoms, on two occasions. But, in general, it will be difficult to administer the same achievement test on two occasions.

Reliability - 2 Parallel forms Administer two “similar” tests and assess the agreements between candidate scores. Overcome the problem of ‘exposed items’, as the two tests have different items. Yet the two tests test the same construct, so they are similar in content and difficulty. But we need more resources in constructing two separate tests, and we need time to have two test administrations.

Reliability – 3 Single administration method Internal consistency reliability How about split the test into two halves, or into many sub-tests, and assess the agreements of scores on the sub-sections? This is a less expensive option, as there needs to be only one test, and one test administration.

Computing Reliability - 1 Internal Consistency. Spearman-Brown Cronbach’s  (Coefficient ) The Spearman-Brown formula computes the reliability by splitting the test into two halves. The correlation between the two sub-tests is rho(y,y’). Then the reliability is given by rho(x,x’). As there are many ways of splitting the test into two halves, the average of all possible Spearman-Brown-corrected half-test correlations is Cronbach’s alpha, also known as Coefficient alpha. The numerator in Cronbach’s alpha is essentially the covariance between the scores on the two halves. Symbols:  is for correlation; 2 is for variance

Computing Reliability - 2 General form of Spearman-Brown General form of Cronbach’s  Kudar-Richardson (KR-20) The generalised form of Spearman-Brown is a formula to predict the reliability of a test N times the length of tests for which correlations between parallel forms are computed. (In the same way, in the previous slide, the Spearman-Brown formula predicts the reliability of a test twice the length of the tests Y and Y’, as Y and Y’ are the split half tests.) The general form of Cronbach’s alpha assumes each item is a sub-test. The numerator in Cronbach’s alpha is essentially the covariance between the item scores. In the case of dichotomous items, Cronbach’s alpha is known as KR-20.

Sources of variation & reliability Variation in individual from day to day: test/retest Variation in items: parallel forms, single administration Variation in measurement procedures: all types of reliability Test/retest reliability captures errors due to variation in an individual from day to day, as the two tests are usually taken place at two different time points. Test/retest reliability also captures measurement error. But test/retest reliability does not capture errors due to the sampling of items. Parallel form reliability captures measurement errors, as well as errors due to the sampling of items, as the two parallel tests contain different items. However, parallel form reliability does not capture variation due to changes in an individual from day to day.

Use of reliability - 1 Standard error of measurement Example: reliability=0.9. Standard deviation of test scores = 15, then Standard error of measurement = 4.7 If a person’s score on the test is 65, we are 95% confident that the true score lies between 55 and 74. In the above example, the standard error of measurement = 15 × √(1-0.9) = 15 × 0.316 = 4.7

Use of reliability - 2 To correct for “attenuation” Example: Var (T) = reliability * Var(X) Example: If the variance of the observed scores on a maths test is 5.2 and the reliability of the test is 0.8, then the corrected variance for the population is 5.2 x 0.8 = 4.16 When a group of students take a test, we have their test scores. These are “observed” test scores , and not “true” test scores. Consequently, if we compute the variance of the observed test scores, these are generally a little larger than the variance of the true test scores, because there is error in each observed test score. The reliability can be used to correct for the variance of observed scored to give an estimate of the variance of the true scores.

Factors affecting reliability Ability range of the group Wider range gives higher reliability Level of ability in the group Higher reliability if item difficulties match abilities Length of the test Longer tests have higher reliability.