Introduction to IRT for non-psychometricians

Introduction to IRT for non-psychometricians
Paul K. Crane, MD MPH Internal Medicine University of Washington

Outline Definitions of IRT IRT vs. CTT Hays et al. paper
Error and information Rational test construction DIF

What is IRT IRT is a scoring algorithm
Every score (including standard “sum” scores) is a formula score Score = Σwixi Standard scores assume wi=1 IRT empirically determines from the data the weights that are applied to each item

What is IRT IRT has emerged as the dominant test development tool in educational psychology IRT has an increasingly dominant role in test development in psychology in general IRT is beginning to make inroads in medical testing; recent NIH and FDA interest; publication record increasing

IRT vs. Classical Test Theory (CTT)
Reliability=alpha, constant across test Reliability=information, varies across test Scores only interpretable within a population Scores interpretable directly Missing data a huge problem Missing data much less important Score co-calibration difficult Score co-calibration easy Representativeness of norming sample is crucial As long as a broad array of trait levels / scores are present in the norming sample, representativeness is irrelevant Standard errors relates to population characteristics, assumed equal for all Can obtain individual level estimates of measurement precision / measurement error Ordinal scaling More interval-level scaling properties

Hays et al. paper Limited Limited Not limited
Item a lot a little at all Vigorous activities, running, lifting heavy objects, strenuous sports Climbing one flight Walking more than 1 mile 1 2 3 Walking one block Bathing / dressing self Preparing meals / doing laundry 1 2 3 Shopping Getting around inside home 1 2 3 Feeding self

Intuitive approach (sum score)
Simple approach: there are numbers that will be circled; total these up, and there we have a score But: should “limited a lot” for walking a mile receive the same weight as “limited a lot” in getting around inside the home? Should “limited a lot” for walking one block be twice as bad as “limited a little” for walking one block? Relationships in a single item and between items

IRT’s role IRT provides us with a data-driven means of rational scoring for such measures In practice, the simple sum score is often very good; improvement is at the margins IRT has other uses in test development and test assessment (second half of the talk) If there is not a big problem, IRT analyses should validate expert opinion; give mathematical basis to gut feelings

Means and ceiling for items
Item Mean (SD) Not limited (%) Vigorous activities (0.86) 45 Walking > 1 mile (0.84) 49 Climbing 1 flight of stairs (0.76) 55 Shopping (0.68) 72 Walking 1 block (0.64) 72 Preparing meals, doing laundry 2.67 (0.63) 75 Bathing or dressing (0.49) 84 Getting around inside home 2.81 (0.47) 84 Feeding self (0.36) 91

Comments on Table 1 Range of those with no limitations (45-91%)
Already from a measurement perspective this is a problem; ~45% will have a perfect score; ceiling / floor effect Significant skew often found in medical settings; implications often not discussed Especially difficult in longitudinal studies. What can we say about someone who had a perfect score before and a less than perfect score now? (NOT solved by IRT)

Dichotomous IRT models
Arbitrary choice to dichotomize into any limitation = 0 and no limitation = 1 Rarely a good idea to collect more detailed data and throw it away for analysis; lose power (Samejima 1967; van Belle 2002) Begin with 1PL model

1PL (~Rasch) model

Implications of 1PL model
All items have same weight (ICCs are parallel) Only difference is “difficulty” (amount of trait required to endorse) Nice math (sum score is sufficient) Nested within 2PL model so can check to see whether it is acceptable Math: p(y=1|θ,b)=1/[1+exp(-D(θ-b))] This puts item difficulty and person trait level on the same scale

1PL results from Hays et al.
Item Limited at all (%)* Difficulty Vigorous activities Walking > 1 mile Climbing 1 flight of stairs Shopping Walking 1 block Preparing meals, doing laundry Bathing or dressing Getting around inside home Feeding self * This is just 100- “not limited.” Slopes fixed at 3.49.

Comments on 1PL analysis
Note similarities to Table 1 Same order of items Relationships between items are similar Recall: same data is being used to derive these estimates Poor measurement properties highlighted – only one “hard” item with a difficulty of 0.5 ? How good is this model for these items?

2PL model

Implications of 2PL model
Items do not need to have the same weights (ICCs are not parallel) Difficulty differs as before Now slope differs as well “Discrimination” parameter Harder math (sum score no longer sufficient) P(y=1|θ,a,b) = 1/[1+exp(-Da(θ-b))] If a parameters are identical, reduces to 1PL model

Table 4: 2PL results Item Difficulty (b) Discrim (a)
Vigorous activities Walking > 1 mile Climbing 1 flight stairs Shopping Walking 1 block Making meals /laundry Bathing or dressing Mobility inside home Feeding self * 3.2 * Note that b parameters are identical to 1PL model!

Assumptions of 1PL model
Because the 1PL model is nested within the 2PL model, we can test whether it is okay to treat the a’s as if they were constant

Graph of Hays 2PL results

Graph of Hays 2PL results – I(θ)

Comments on 2PL results Not that different from 1PL results
Some variability in slopes, however Easier to look at assumption of identical slopes on the information plane Ability / trait level estimates tend to be very similar in simulation studies and in real data studies with 1PL and 2PL models Difficulty much more important than slope in determining score

What about that dichotomization?
Don’t want to throw away all the data we collected; “Yes, a little” vs. “yes, a lot” currently lumped Polytomous IRT models allow us to do this GRM vs. PCM vs. RSM GRM is most flexible; 2PL extension Samejima (1967) ! PCM and RSM are both Rasch extensions

Table 5: GRM results

Comments on GRM results
Empiric validation of expert opinion on how items relate to underlying construct Scores have meaning in terms of items (“Trait estimates anchored to item content,” p. II-35) Didn’t do PCM or RSM Can see impact of dichotomizing items

Software for IRT Currently old and clunky
PARSCALE will be illustrated tomorrow Another option is MULTILOG NIH has issued a SBIR for new and improved software that’s more user friendly

Outline revisited Definitions of IRT IRT vs. CTT Hays et al. paper
Error and information Rational test construction DIF

Error and information One of the real strengths of IRT is that measurement error is not assumed to be constant across the whole test Instead, measurement error is modeled directly in the form of item (test) information SEM = 1/SQRT(I(θ))

Information formulas For 2PL model: I(θ)=D2a2P(θ)Q(θ)
D2 is a constant a is the slope parameter P(θ) is the probability of getting the item correct P(y=1| θ,a,b)=1/[1+exp(-Da(θ-b))] Q(θ)=1-P(θ) P(θ)Q(θ) generates a hill centered at θ=b P(θ)0 as θ gets small; Q(θ)0 as θ gets large Test information is sum of item information curves Polytomous information is hard, but solved

Information implications
Item information is each item’s individual contribution to measurement precision at each possible score Test information gives a picture of the test’s measurement precision across all scores Can use test information to compare tests Lord (1980) advocated a ratio for comparisons

Relative information, CSI ‘D’ and CASI

Individual level measurement error
We estimate θ for each individual, and we can compute I(θ) as well We know not only the score, but also the precision with which we know the score Need to train providers to request the measurement precision Can model error – one of the purposes of this workshop is to integrate error terms into statistical models

Rational test construction
So far we have described existing tests using new IRT tools Can also build new tests from item banks Construct a particular information profile and choose items to fill it out Large literature in educational psychology van der Linden and colleagues from Twente

Differential item functioning (DIF)
Population is increasingly heterogeneous Concern about cultural, gender, education, etc. fairness of tests Can use standard statistical tools to assess differential test impact, that is, different performance according to culture etc. Have to control for underlying trait level to really get at test bias DIF is how this is accomplished in educational psychology

Different approaches to DIF
IRT approaches – model items separately in different groups and compare SIBTEST approach – based on dimensionality assessment Logistic regression approach – treat as an epidemiological problem Mantel-Haenszel approach – simple 2x2 table approach MIMIC approach using MPLUS

Comments on DIF Each technique is simple to explain but each is tricky to apply Not all that clear based on the literature what to do with existing epidemiological data from studies with DIF It is probably impossible to measure cognition without bias, especially education Years of education doesn’t get to the heart of the problem either (Manley paper) We re-visit DIF in detail on Thursday

The end. Definitions of IRT IRT vs. CTT Hays et al. paper
Error and information Rational test construction DIF Comments and questions?

Introduction to IRT for non-psychometricians

Similar presentations

Presentation on theme: "Introduction to IRT for non-psychometricians"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to IRT for non-psychometricians

Similar presentations

Presentation on theme: "Introduction to IRT for non-psychometricians"— Presentation transcript:

Similar presentations

About project

Feedback