Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University of Illinois at Urbana- Champaign Undergraduate Institution: University of California, Santa Cruz
Introduction Simulation Results Discussion
Introducation Classification consistency is the degree to which examinees would be classified into the same performance categories over parallel replications of the same assessment (Lee, 2010). classification accuracy refers to the extent to which actual classifications using observed cut scores agree with “true” classifications based on known true cut scores.(Lee, 2010). Distribution method: Estimating true score distribution + the observed score distribution Individual/person method: Each individual’s classification status –true CA increases with test length –decreases when the cut score is located near the mean of the examinee distribution
classification accuracy (CA) rate 1 total sum scores—— Lee approach Bergeson(2007). language proficiency tests CA increased with grade 2 latent trait estimates——Rudner approach Sireci et al., 2008 math tests Examinees >20 good estimate
The Lee Approach The marginal probability of the total summed score X is given by –Pr(X = x | θ): conditional summed-score distribution g(θ):density Let x 1, x 2,..., x K−1 denote a set of observed cut scores that are used to classify examinees into k categories. Given the conditional summed-score distribution and the cut scores, the conditional category probability can be computed by summing conditional summed-score probabilities for all x values that belong to category h, Expected summed scores can be obtained from the θ cut scores as
Suppose a set of true cut scores on the summed-score metric, τ 1, τ 2,..., τ K−1, determine the true categorical status of each examinee with θ or τ (i.e., expected summed score). If the true categorical status, η (=1, 2,..., K), of an examinee is known, the conditional probability of accurate classification is simply the true category η can be determined by comparing the expected summed score for with the true cut scores the marginal classification accuracy index, γ, is given by
The Lee Approach Response pattern V’ = (V 1, V 2, V 3,..., V J ), V j is the response to the item j; j=1, 2,..., J; J is the test length. V j can take values m=0, 1,...,M, Assumes that classifications are made on the basis of the total score x. Its goal is to find the probability of each possible total score ( ) by summing the probabilities of all possible response patterns that would lead to that total score given , and then aggregates the probabilities according to the cut scores: the probability of scoring in category k is
CA Using sample estimated ^ , the conditional CA estimate under the Lee approach can be given as the probability of an examinee’s total score and his or her estimated expected true score based on ^ falling into the same category
Rudner-Based Indices C + 1 cut-scores estimated examinee scores standard error estimates The expected probability of scoring in each performancelevel category C based on these assumptions can be written as the index can be written as Define a N* 3C matrix of weights
Rudner approach The cut scores are aligned on the latent trait scale based on normally distributed measurement error the probability of scoring in category k with the Rudner approach is calculated as Conditional CA is the probability of being placed in the category that the examinee truly belongs to given u
Marginal Indices With the D-method, or the distribution-based method, the marginal CA is found by integrating the conditional CA over the domain use estimated quadrature points and weights and replace the integrals by summations The P-method is person based, and simply averages the conditional indices computed for each examinee in the sample (uses the individual θ estimates)
Simulation Dichotomous: Items:10-80 by 10 1PL 2PL 3PLl difficulty parameter N(0,1) discrimination parameters: narrow N(0,0.3) / wider N(0,0.5) Guessing U(0, 0.25). Items: 10, 20, 40, and 80 GRM- five ordered response categories threshold 1: N(1, :5), threshold 2-4 N(1, :2) D-methods:40 quadrature points and weights from ~N(0, P-methods:1)N:250,500,1000 ~N(0, 1)
Two empirical accuracies were calculated: making a classification based on ^ , and the other based on the observed total score X.
RESULTS
1PL Sample size se bias D-method> lee Shorter test
2PL Classifications made on the basis of ^ were more accurate than classifications made on the basis of x (discrimination parameters vary more between items, the superiority of using ^ over x is more pronounced)
DISCUSSION Results indicate that if the classification is made with x, Lee’s approach estimated the accuracy well. Lee’s approach, when coupled with the P-method, was slightly positively biased for short tests. While the D-method performed as well or better than the P- method, the D-method required an assumption of the distribution of the latent trait. Rudner’s approach estimated the true accuracy of using ^u well. But the pattern of bias changed with the IRT model.
model fit will affect both Rudner and Lee approaches item parameters and ability distribution are unknown in practice Multiple cut scores cognitive diagnostic models multiple dimensions or multiple tests Parameter accuracy ?
the wrong model robustness of Lee and Rudner approach the signal detection theory conditional false positive/negative error rate
谢谢