Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.

Slides:



Advertisements
Similar presentations
Empirical Estimator for GxE using imputed data Shuo Jiao.
Advertisements

How Should We Assess the Fit of Rasch-Type Models? Approximating the Power of Goodness-of-fit Statistics in Categorical Data Analysis Alberto Maydeu-Olivares.
1 Scaling of the Cognitive Data and Use of Student Performance Estimates Guide to the PISA Data Analysis ManualPISA Data Analysis Manual.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
VALIDITY AND RELIABILITY
Item Response Theory in Health Measurement
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.
Exploring the Full-Information Bifactor Model in Vertical Scaling With Construct Shift Ying Li and Robert W. Lissitz.
Models for Measuring. What do the models have in common? They are all cases of a general model. How are people responding? What are your intentions in.
1 Introduction to Inference Confidence Intervals William P. Wattles, Ph.D. Psychology 302.
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Latent Change in Discrete Data: Rasch Models
A Method for Estimating the Correlations Between Observed and IRT Latent Variables or Between Pairs of IRT Latent Variables Alan Nicewander Pacific Metrics.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
Edpsy 511 Homework 1: Due 2/6.
Market Risk VaR: Historical Simulation Approach
Comparison of Reliability Measures under Factor Analysis and Item Response Theory —Ying Cheng , Ke-Hai Yuan , and Cheng Liu Presented by Zhu Jinxin.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview Central Limit Theorem The Normal Distribution The Standardised Normal.
Let sample from N(μ, σ), μ unknown, σ known.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.
Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.
Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.
Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.
Adventures in Equating Land: Facing the Intra-Individual Consistency Index Monster * *Louis Roussos retains all rights to the title.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
General Confidence Intervals Section Starter A shipment of engine pistons are supposed to have diameters which vary according to N(4 in,
Copyright © 2004, Graduate Management Admission Council ®. All Rights Reserved. 1 Expected Classification Accuracy Lawrence M. Rudner Graduate Management.
Confidence intervals for the mean - continued
Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.
1 EPSY 546: LECTURE 1 SUMMARY George Karabatsos. 2 REVIEW.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Multitrait Scaling and IRT: Part I Ron D. Hays, Ph.D. Questionnaire Design and Testing.
Item Factor Analysis Item Response Theory Beaujean Chapter 6.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Classification Ensemble Methods 1
Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.
Item Response Theory in Health Measurement
Item Parameter Estimation: Does WinBUGS Do Better Than BILOG-MG?
FIT ANALYSIS IN RASCH MODEL University of Ostrava Czech republic 26-31, March, 2012.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.
Chapter 6 - Standardized Measurement and Assessment
1 Probability and Statistics Confidence Intervals.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Reducing Burden on Patient- Reported Outcomes Using Multidimensional Computer Adaptive Testing Scott B. MorrisMichael Bass Mirinae LeeRichard E. Neapolitan.
Classical Test Theory Psych DeShon. Big Picture To make good decisions, you must know how much error is in the data upon which the decisions are.
Estimating standard error using bootstrap
Chapter 3: Maximum-Likelihood Parameter Estimation
Evaluation of measuring tools: validity
Classical Test Theory Margaret Wu.
Item Analysis: Classical and Beyond
پرسشنامه کارگاه.
Introduction to Inference Confidence Intervals
Bootstrap - Example Suppose we have an estimator of a parameter and we want to express its accuracy by its standard error but its sampling distribution.
National Conference on Student Assessment
By ____________________
Item Analysis: Classical and Beyond
Introduction to Inference
Item Analysis: Classical and Beyond
How Confident Are You?.
Presentation transcript:

Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University of Illinois at Urbana- Champaign Undergraduate Institution: University of California, Santa Cruz

Introduction Simulation Results Discussion

Introducation Classification consistency is the degree to which examinees would be classified into the same performance categories over parallel replications of the same assessment (Lee, 2010). classification accuracy refers to the extent to which actual classifications using observed cut scores agree with “true” classifications based on known true cut scores.(Lee, 2010). Distribution method: Estimating true score distribution + the observed score distribution Individual/person method: Each individual’s classification status –true CA increases with test length –decreases when the cut score is located near the mean of the examinee distribution

classification accuracy (CA) rate 1 total sum scores—— Lee approach Bergeson(2007). language proficiency tests CA increased with grade 2 latent trait estimates——Rudner approach Sireci et al., 2008 math tests Examinees >20 good estimate

The Lee Approach The marginal probability of the total summed score X is given by –Pr(X = x | θ): conditional summed-score distribution g(θ):density Let x 1, x 2,..., x K−1 denote a set of observed cut scores that are used to classify examinees into k categories. Given the conditional summed-score distribution and the cut scores, the conditional category probability can be computed by summing conditional summed-score probabilities for all x values that belong to category h, Expected summed scores can be obtained from the θ cut scores as

Suppose a set of true cut scores on the summed-score metric, τ 1, τ 2,..., τ K−1, determine the true categorical status of each examinee with θ or τ (i.e., expected summed score). If the true categorical status, η (=1, 2,..., K), of an examinee is known, the conditional probability of accurate classification is simply the true category η can be determined by comparing the expected summed score for with the true cut scores the marginal classification accuracy index, γ, is given by

The Lee Approach Response pattern V’ = (V 1, V 2, V 3,..., V J ), V j is the response to the item j; j=1, 2,..., J; J is the test length. V j can take values m=0, 1,...,M, Assumes that classifications are made on the basis of the total score x. Its goal is to find the probability of each possible total score ( ) by summing the probabilities of all possible response patterns that would lead to that total score given , and then aggregates the probabilities according to the cut scores: the probability of scoring in category k is

CA Using sample estimated ^ , the conditional CA estimate under the Lee approach can be given as the probability of an examinee’s total score and his or her estimated expected true score based on ^  falling into the same category

Rudner-Based Indices C + 1 cut-scores estimated examinee scores standard error estimates The expected probability of scoring in each performancelevel category C based on these assumptions can be written as the index can be written as Define a N* 3C matrix of weights

Rudner approach The cut scores are aligned on the latent trait scale based on normally distributed measurement error the probability of scoring in category k with the Rudner approach is calculated as Conditional CA is the probability of being placed in the category that the examinee truly belongs to given u

Marginal Indices With the D-method, or the distribution-based method, the marginal CA is found by integrating the conditional CA over the  domain use estimated quadrature points and weights and replace the integrals by summations The P-method is person based, and simply averages the conditional indices computed for each examinee in the sample (uses the individual θ estimates)

Simulation Dichotomous: Items:10-80 by 10 1PL 2PL 3PLl difficulty parameter N(0,1) discrimination parameters: narrow N(0,0.3) / wider N(0,0.5) Guessing U(0, 0.25). Items: 10, 20, 40, and 80 GRM- five ordered response categories threshold 1: N(1, :5), threshold 2-4 N(1, :2) D-methods:40 quadrature points and weights from  ~N(0, P-methods:1)N:250,500,1000  ~N(0, 1)

Two empirical accuracies were calculated: making a classification based on ^ , and the other based on the observed total score X.

RESULTS

1PL Sample size se bias D-method> lee Shorter test

2PL Classifications made on the basis of ^  were more accurate than classifications made on the basis of x (discrimination parameters vary more between items, the superiority of using ^  over x is more pronounced)

DISCUSSION Results indicate that if the classification is made with x, Lee’s approach estimated the accuracy well. Lee’s approach, when coupled with the P-method, was slightly positively biased for short tests. While the D-method performed as well or better than the P- method, the D-method required an assumption of the distribution of the latent trait. Rudner’s approach estimated the true accuracy of using ^u well. But the pattern of bias changed with the IRT model.

model fit will affect both Rudner and Lee approaches item parameters and ability distribution are unknown in practice Multiple cut scores cognitive diagnostic models multiple dimensions or multiple tests Parameter accuracy ?

the wrong model robustness of Lee and Rudner approach the signal detection theory conditional false positive/negative error rate

谢谢