Introduction to IRT for non-psychometricians

Slides:

Advertisements

Similar presentations

Introduction to IRT/Rasch Measurement with Winsteps Ken Conrad, University of Illinois at Chicago Barth Riley and Michael Dennis, Chestnut Health Systems.

Advertisements

MEASUREMENT Goal To develop reliable and valid measures using state-of-the-art measurement models Members: Chang, Berdes, Gehlert, Gibbons, Schrauf, Weiss.

Implications and Extensions of Rasch Measurement.

Test Development.

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 23, 2012.

DIF Analysis Galina Larina of March, 2012 University of Ostrava.

Item Response Theory in a Multi-level Framework Saralyn Miller Meg Oliphint EDU 7309.

Brief introduction on Logistic Regression

LOGO One of the easiest to use Software: Winsteps

1 Reliability in Scales Reliability is a question of consistency do we get the same numbers on repeated measurements? Low reliability: reaction time High.

Item Response Theory in Health Measurement

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

AN OVERVIEW OF THE FAMILY OF RASCH MODELS Elena Kardanova

Overview of field trial analysis procedures National Research Coordinators Meeting Windsor, June 2008.

Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.

1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.

Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.

1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.

Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.

1 Least squares procedure Inference for least squares lines Simple Linear Regression.

The ABC’s of Pattern Scoring Dr. Cornelia Orr. Slide 2 Vocabulary Measurement – Psychometrics is a type of measurement Classical test theory Item Response.

Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,

Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.

1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.

Differential Item Functioning. Anatomy of the name DIFFERENTIAL –Differential Calculus? –Comparing two groups ITEM –Focus on ONE item at a time –Not the.

Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June.

The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.

The ABC’s of Pattern Scoring

University of Ostrava Czech republic 26-31, March, 2012.

Item Factor Analysis Item Response Theory Beaujean Chapter 6.

Chapter 8: Simple Linear Regression Yang Zhenlin.

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

Item Response Theory in Health Measurement

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.

2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)

Item Response Theory Dan Mungas, Ph.D. Department of Neurology

Item Response Theory Dan Mungas, Ph.D. Department of Neurology University of California, Davis.

The expected value The value of a variable one would “expect” to get. It is also called the (mathematical) expectation, or the mean.

IRT Equating Kolen & Brennan, 2004 & 2014 EPSY

Bootstrap and Model Validation

Evaluating Patient-Reports about Health

Measurements Statistics

Maths Information Evening

A Different Way to Think About Measurement Development:

UCLA Department of Medicine

Evaluating Patient-Reports about Health

UCLA Department of Medicine

Chapter 21 More About Tests.

Measurement: Part 1.

Data Analysis and Standard Setting

Classical Test Theory Margaret Wu.

Item Analysis: Classical and Beyond

Paul K. Crane, MD MPH Dan M. Mungas, PhD

Evaluating IRT Assumptions

Measurement: Part 1.

EPSY 5245 EPSY 5245 Michael C. Rodriguez

Statistics · the study of information Data · information

DIF detection using OLR

Test co-calibration and equating

Building Valid, Credible, and Appropriately Detailed Simulation Models

Measurement: Part 1.

Item Analysis: Classical and Beyond

COMPARING VARIABLES OF ORDINAL OR DICHOTOMOUS SCALES: SPEARMAN RANK- ORDER, POINT-BISERIAL, AND BISERIAL CORRELATIONS.

Evaluating Multi-item Scales

Multitrait Scaling and IRT: Part I

Item Analysis: Classical and Beyond

CLASSROOM ENVIRONMENT AND THE STRATIFICATION OF SENIOR HIGH SCHOOL STUDENT’S MATHEMATICS ABILITY PERCEPTIONS Nelda a. nacion 5th international scholars’

Presentation transcript:

Introduction to IRT for non-psychometricians Paul K. Crane, MD MPH Internal Medicine University of Washington

Outline Definitions of IRT IRT vs. CTT Hays et al. paper Error and information Rational test construction DIF

What is IRT IRT is a scoring algorithm Every score (including standard “sum” scores) is a formula score Score = Σwixi Standard scores assume wi=1 IRT empirically determines from the data the weights that are applied to each item

What is IRT IRT has emerged as the dominant test development tool in educational psychology IRT has an increasingly dominant role in test development in psychology in general IRT is beginning to make inroads in medical testing; recent NIH and FDA interest; publication record increasing

IRT vs. Classical Test Theory (CTT) Reliability=alpha, constant across test Reliability=information, varies across test Scores only interpretable within a population Scores interpretable directly Missing data a huge problem Missing data much less important Score co-calibration difficult Score co-calibration easy Representativeness of norming sample is crucial As long as a broad array of trait levels / scores are present in the norming sample, representativeness is irrelevant Standard errors relates to population characteristics, assumed equal for all Can obtain individual level estimates of measurement precision / measurement error Ordinal scaling More interval-level scaling properties

Hays et al. paper Limited Limited Not limited Item a lot a little at all . Vigorous activities, running, lifting heavy objects, strenuous sports 1 2 3 . Climbing one flight 1 2 3 Walking more than 1 mile 1 2 3 Walking one block 1 2 3 Bathing / dressing self 1 2 3 Preparing meals / doing laundry 1 2 3 Shopping 1 2 3 Getting around inside home 1 2 3 Feeding self 1 2 3

Intuitive approach (sum score) Simple approach: there are numbers that will be circled; total these up, and there we have a score But: should “limited a lot” for walking a mile receive the same weight as “limited a lot” in getting around inside the home? Should “limited a lot” for walking one block be twice as bad as “limited a little” for walking one block? Relationships in a single item and between items

IRT’s role IRT provides us with a data-driven means of rational scoring for such measures In practice, the simple sum score is often very good; improvement is at the margins IRT has other uses in test development and test assessment (second half of the talk) If there is not a big problem, IRT analyses should validate expert opinion; give mathematical basis to gut feelings

Means and ceiling for items Item Mean (SD) Not limited (%) Vigorous activities 1.97 (0.86) 45 Walking > 1 mile 2.22 (0.84) 49 Climbing 1 flight of stairs 2.37 (0.76) 55 Shopping 2.61 (0.68) 72 Walking 1 block 2.63 (0.64) 72 Preparing meals, doing laundry 2.67 (0.63) 75 Bathing or dressing 2.80 (0.49) 84 Getting around inside home 2.81 (0.47) 84 Feeding self 2.90 (0.36) 91

Comments on Table 1 Range of those with no limitations (45-91%) Already from a measurement perspective this is a problem; ~45% will have a perfect score; ceiling / floor effect Significant skew often found in medical settings; implications often not discussed Especially difficult in longitudinal studies. What can we say about someone who had a perfect score before and a less than perfect score now? (NOT solved by IRT)

Dichotomous IRT models Arbitrary choice to dichotomize into any limitation = 0 and no limitation = 1 Rarely a good idea to collect more detailed data and throw it away for analysis; lose power (Samejima 1967; van Belle 2002) Begin with 1PL model

1PL (~Rasch) model

Implications of 1PL model All items have same weight (ICCs are parallel) Only difference is “difficulty” (amount of trait required to endorse) Nice math (sum score is sufficient) Nested within 2PL model so can check to see whether it is acceptable Math: p(y=1|θ,b)=1/[1+exp(-D(θ-b))] This puts item difficulty and person trait level on the same scale

1PL results from Hays et al. Item Limited at all (%)* Difficulty Vigorous activities 55 0.5 Walking > 1 mile 51 0.1 Climbing 1 flight of stairs 45 -0.1 Shopping 28 -0.6 Walking 1 block 28 -0.7 Preparing meals, doing laundry 25 -0.8 Bathing or dressing 16 -1.2 Getting around inside home 16 -1.2 Feeding self 9 -1.6 * This is just 100- “not limited.” Slopes fixed at 3.49.

Comments on 1PL analysis Note similarities to Table 1 Same order of items Relationships between items are similar Recall: same data is being used to derive these estimates Poor measurement properties highlighted – only one “hard” item with a difficulty of 0.5 ? How good is this model for these items?

2PL model

Implications of 2PL model Items do not need to have the same weights (ICCs are not parallel) Difficulty differs as before Now slope differs as well “Discrimination” parameter Harder math (sum score no longer sufficient) P(y=1|θ,a,b) = 1/[1+exp(-Da(θ-b))] If a parameters are identical, reduces to 1PL model

Table 4: 2PL results Item Difficulty (b) Discrim (a) Vigorous activities 0.5 2.5 Walking > 1 mile 0.1 4.1 Climbing 1 flight stairs -0.1 3.5 Shopping -0.6 3.7 Walking 1 block -0.7 3.7 Making meals /laundry -0.8 3.8 Bathing or dressing -1.2 3.5 Mobility inside home -1.2 3.6 Feeding self -1.6* 3.2 * Note that b parameters are identical to 1PL model!

Assumptions of 1PL model Because the 1PL model is nested within the 2PL model, we can test whether it is okay to treat the a’s as if they were constant

Graph of Hays 2PL results

Graph of Hays 2PL results – I(θ)

Comments on 2PL results Not that different from 1PL results Some variability in slopes, however Easier to look at assumption of identical slopes on the information plane Ability / trait level estimates tend to be very similar in simulation studies and in real data studies with 1PL and 2PL models Difficulty much more important than slope in determining score

What about that dichotomization? Don’t want to throw away all the data we collected; “Yes, a little” vs. “yes, a lot” currently lumped Polytomous IRT models allow us to do this GRM vs. PCM vs. RSM GRM is most flexible; 2PL extension Samejima (1967) ! PCM and RSM are both Rasch extensions

Table 5: GRM results

Comments on GRM results Empiric validation of expert opinion on how items relate to underlying construct Scores have meaning in terms of items (“Trait estimates anchored to item content,” p. II-35) Didn’t do PCM or RSM Can see impact of dichotomizing items

Software for IRT Currently old and clunky PARSCALE will be illustrated tomorrow Another option is MULTILOG NIH has issued a SBIR for new and improved software that’s more user friendly

Outline revisited Definitions of IRT IRT vs. CTT Hays et al. paper Error and information Rational test construction DIF

Error and information One of the real strengths of IRT is that measurement error is not assumed to be constant across the whole test Instead, measurement error is modeled directly in the form of item (test) information SEM = 1/SQRT(I(θ))

Information formulas For 2PL model: I(θ)=D2a2P(θ)Q(θ) D2 is a constant a is the slope parameter P(θ) is the probability of getting the item correct P(y=1| θ,a,b)=1/[1+exp(-Da(θ-b))] Q(θ)=1-P(θ) P(θ)Q(θ) generates a hill centered at θ=b P(θ)0 as θ gets small; Q(θ)0 as θ gets large Test information is sum of item information curves Polytomous information is hard, but solved

Information implications Item information is each item’s individual contribution to measurement precision at each possible score Test information gives a picture of the test’s measurement precision across all scores Can use test information to compare tests Lord (1980) advocated a ratio for comparisons

Relative information, CSI ‘D’ and CASI

Individual level measurement error We estimate θ for each individual, and we can compute I(θ) as well We know not only the score, but also the precision with which we know the score Need to train providers to request the measurement precision Can model error – one of the purposes of this workshop is to integrate error terms into statistical models

Rational test construction So far we have described existing tests using new IRT tools Can also build new tests from item banks Construct a particular information profile and choose items to fill it out Large literature in educational psychology van der Linden and colleagues from Twente

Differential item functioning (DIF) Population is increasingly heterogeneous Concern about cultural, gender, education, etc. fairness of tests Can use standard statistical tools to assess differential test impact, that is, different performance according to culture etc. Have to control for underlying trait level to really get at test bias DIF is how this is accomplished in educational psychology

Different approaches to DIF IRT approaches – model items separately in different groups and compare SIBTEST approach – based on dimensionality assessment Logistic regression approach – treat as an epidemiological problem Mantel-Haenszel approach – simple 2x2 table approach MIMIC approach using MPLUS

Comments on DIF Each technique is simple to explain but each is tricky to apply Not all that clear based on the literature what to do with existing epidemiological data from studies with DIF It is probably impossible to measure cognition without bias, especially education Years of education doesn’t get to the heart of the problem either (Manley paper) We re-visit DIF in detail on Thursday

The end. Definitions of IRT IRT vs. CTT Hays et al. paper Error and information Rational test construction DIF Comments and questions?