Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classical Test Theory Margaret Wu.

Similar presentations


Presentation on theme: "Classical Test Theory Margaret Wu."— Presentation transcript:

1 Classical Test Theory Margaret Wu

2 Some Statistical Terms
Mean (average) (location of distribution) Variance: (spread of distribution) Standard Deviation = sqrt (Variance) (spread of distribution) 95% of the observations fall within mean±2standard deviation

3 Normal distribution About 95% of the observations fall between –2 and 2.

4 Correlation Degree of association between two variables Corr=0.8

5 CTT item difficulty, person ability and item discrimination
CTT item difficulty - Percentage of persons obtaining the correct answer on an item CTT person ability – percentage of items a person obtains the correct answer CTT item discrimination – correlation between student test score and item score

6 Correlation between these two columns Student Score on the Test
Item Discrimination Correlation between these two columns Student Score on the Test Score on one item 1 2 3 28 29 30

7 Classical Test Theory vs. IRT - 1
Classical Test Theory: true-score theory (CTT) Modern Test Theory: item response theory; latent trait models (IRT) IRT focuses on estimating each student’s ‘ability’ ( ) on a latent trait CTT focuses on estimating each student’s ‘true score’ (T) on a test.

8 Classical Test Theory vs. IRT - 2
IRT: making inference about a student’s ‘ability’ on the latent trait (that is being tested) CTT: making inference about a student’s likely score on a test IRT: notion of latent trait, range from - to  CTT: test scores on a test, range from 0 to maximum score on a test.

9 Classical Test Theory vs. IRT - 3
For example, a geometry test is given to students Under the IRT approach, we try to estimate each student’s level on the latent trait “geometry”. The level on this latent trait “influences” the item responses. Under the CTT, we try to estimate the likely score on THIS geometry test (and geometry tests like this test). Philosophical difference between the two.

10 Classical Test Theory vs. IRT - 4
IRT provides more scope for linking different tests, and providing substantive interpretations to scores on a test. CTT is more limited to scores on ONE (kind of) test. There is less scope for generalisation. If you are only interested in ONE test, and you are only interested in ranking students, then IRT does not provide much more than CTT.

11 Assumptions of CTT 1. X = T + E (obs score = true score + error)
2. mean(X) = T 3. Corr (E,T) = 0 4. Corr (E1,E2) = 0 5. Corr (E1,T2) = 0 Parallel tests: if X and X’ satisfy 1-5, and T=T’, var(E) = var(E’) This equation says that the observed score for a person is his/her true score plus error. The equation says that, if you were able to administer a test over and over again, then the average of the observed scores is the true score. In fact, this is how True Score is defined: the average of observed scores if a test is administered many times. This equation says that the correlation between a person’s true score and the error is independent. This equation says that the correlation between the errors across people are independent. E1 denotes error for person 1, and E2 denotes error for person 2, etc. This equation says that the error for a person is uncorrelated with the true score of any other person. Parallel Tests: X is a person’s score on one test, and X’ is a person’s score on a “parallel” test Tau equivalent: If scores on two tests satisfy 1-5, but the True score on one test is equal to the true score plus a constant, and the error variances are possibly different, then the two tests are said to be tau-equivalent.

12 These follow from CTT: Mean(E) = 0 Var(X) = Var(T) + Var(E)
[Corr(X,T)]2 = Var(T)/Var(X) Var(X) = Var(X’), for parallel tests Corr(X, X’) = Var(T)/Var(X) The last one is defined as reliability of a test. To show that [corr(X,T)]2 = var(T)/var(X): corr(X,T) = cov(X,T)/sqrt(var(X)var(T)) The numerator, cov(X,T) can be further expanded as cov(T+E,T) = cov(T,T) + cov(E,T) = var(T) + 0 = var(T); So, corr(X,T) = var(T)/sqrt(var(X)var(T)) = sqrt(var(T)/var(X)) As an exercise, show that corr(X, X’) = Var(T)/Var(X)

13 Reliability In words, reliability is the proportion of true scores variance over the observed scores variance. If measurement error is small, then observed scores will be close to true scores, so reliability will be close to 1. If measurement error is large, then observed scores will have a much large variance than that of true scores, so reliability will be close to zero.

14 Reliability - 1 Test/Retest Administer the same test on two occasions
Compare the agreements between candidate scores on the two occasions For example, if we are testing the ability to shoot basketball goals, we can test each person on two occasions. Or if we are testing for stress levels, we can administer a questionnaire, or measure symptoms, on two occasions. But, in general, it will be difficult to administer the same achievement test on two occasions.

15 Reliability - 2 Parallel forms
Administer two “similar” tests and assess the agreements between candidate scores. Overcome the problem of ‘exposed items’, as the two tests have different items. Yet the two tests test the same construct, so they are similar in content and difficulty. But we need more resources in constructing two separate tests, and we need time to have two test administrations.

16 Reliability – 3 Single administration method
Internal consistency reliability How about split the test into two halves, or into many sub-tests, and assess the agreements of scores on the sub-sections? This is a less expensive option, as there needs to be only one test, and one test administration.

17 Computing Reliability - 1
Internal Consistency. Spearman-Brown Cronbach’s  (Coefficient ) The Spearman-Brown formula computes the reliability by splitting the test into two halves. The correlation between the two sub-tests is rho(y,y’). Then the reliability is given by rho(x,x’). As there are many ways of splitting the test into two halves, the average of all possible Spearman-Brown-corrected half-test correlations is Cronbach’s alpha, also known as Coefficient alpha. The numerator in Cronbach’s alpha is essentially the covariance between the scores on the two halves. Symbols:  is for correlation; 2 is for variance

18 Computing Reliability - 2
General form of Spearman-Brown General form of Cronbach’s  Kudar-Richardson (KR-20) The generalised form of Spearman-Brown is a formula to predict the reliability of a test N times the length of tests for which correlations between parallel forms are computed. (In the same way, in the previous slide, the Spearman-Brown formula predicts the reliability of a test twice the length of the tests Y and Y’, as Y and Y’ are the split half tests.) The general form of Cronbach’s alpha assumes each item is a sub-test. The numerator in Cronbach’s alpha is essentially the covariance between the item scores. In the case of dichotomous items, Cronbach’s alpha is known as KR-20.

19 Sources of variation & reliability
Variation in individual from day to day: test/retest Variation in items: parallel forms, single administration Variation in measurement procedures: all types of reliability Test/retest reliability captures errors due to variation in an individual from day to day, as the two tests are usually taken place at two different time points. Test/retest reliability also captures measurement error. But test/retest reliability does not capture errors due to the sampling of items. Parallel form reliability captures measurement errors, as well as errors due to the sampling of items, as the two parallel tests contain different items. However, parallel form reliability does not capture variation due to changes in an individual from day to day.

20 Use of reliability - 1 Standard error of measurement Example: reliability=0.9. Standard deviation of test scores = 15, then Standard error of measurement = 4.7 If a person’s score on the test is 65, we are 95% confident that the true score lies between 55 and 74. In the above example, the standard error of measurement = 15 × √(1-0.9) = 15 × = 4.7

21 Use of reliability - 2 To correct for “attenuation” Example:
Var (T) = reliability * Var(X) Example: If the variance of the observed scores on a maths test is 5.2 and the reliability of the test is 0.8, then the corrected variance for the population is 5.2 x 0.8 = 4.16 When a group of students take a test, we have their test scores. These are “observed” test scores , and not “true” test scores. Consequently, if we compute the variance of the observed test scores, these are generally a little larger than the variance of the true test scores, because there is error in each observed test score. The reliability can be used to correct for the variance of observed scored to give an estimate of the variance of the true scores.

22 Factors affecting reliability
Ability range of the group Wider range gives higher reliability Level of ability in the group Higher reliability if item difficulties match abilities Length of the test Longer tests have higher reliability.


Download ppt "Classical Test Theory Margaret Wu."

Similar presentations


Ads by Google