Item Response Theory in a Multi-level Framework Saralyn Miller Meg Oliphint EDU 7309.

Slides:



Advertisements
Similar presentations
Introduction to IRT/Rasch Measurement with Winsteps Ken Conrad, University of Illinois at Chicago Barth Riley and Michael Dennis, Chestnut Health Systems.
Advertisements

FACULTY DEVELOPMENT PROFESSIONAL SERIES OFFICE OF MEDICAL EDUCATION TULANE UNIVERSITY SCHOOL OF MEDICINE Using Statistics to Evaluate Multiple Choice.
The effect of differential item functioning in anchor items on population invariance of equating Anne Corinne Huggins University of Florida.
DIF Analysis Galina Larina of March, 2012 University of Ostrava.
LOGO One of the easiest to use Software: Winsteps
How Should We Assess the Fit of Rasch-Type Models? Approximating the Power of Goodness-of-fit Statistics in Categorical Data Analysis Alberto Maydeu-Olivares.
Consistency in testing
Chapter 4 – Reliability Observed Scores and True Scores Error
Item Response Theory in Health Measurement
Introduction to Item Response Theory
IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.
AN OVERVIEW OF THE FAMILY OF RASCH MODELS Elena Kardanova
Overview of field trial analysis procedures National Research Coordinators Meeting Windsor, June 2008.
Latent Change in Discrete Data: Rasch Models
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
Item Analysis Prof. Trevor Gibbs. Item Analysis After you have set your assessment: How can you be sure that the test items are appropriate?—Not too easy.
© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.
Classical Test Theory By ____________________. What is CCT?
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.
Measurement Problems within Assessment: Can Rasch Analysis help us? Mike Horton Bipin Bhakta Alan Tennant.
Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.
Item Response Theory Psych 818 DeShon. IRT ● Typically used for 0,1 data (yes, no; correct, incorrect) – Set of probabilistic models that… – Describes.
Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.
1 © Lecture note 3 Hypothesis Testing MAKE HYPOTHESIS ©
Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.
You got WHAT on that test? Using SAS PROC LOGISTIC and ODS to identify ethnic group Differential Item Functioning (DIF) in professional certification exam.
Unanswered Questions in Typical Literature Review 1. Thoroughness – How thorough was the literature search? – Did it include a computer search and a hand.
The ABC’s of Pattern Scoring Dr. Cornelia Orr. Slide 2 Vocabulary Measurement – Psychometrics is a type of measurement Classical test theory Item Response.
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
智慧型系統實驗室 iLab 南台資訊工程 1 Evaluation for the Test Quality of Dynamic Question Generation by Particle Swarm Optimization for Adaptive Testing Department of.
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
The ABC’s of Pattern Scoring
University of Ostrava Czech republic 26-31, March, 2012.
Multitrait Scaling and IRT: Part I Ron D. Hays, Ph.D. Questionnaire Design and Testing.
Estimation. The Model Probability The Model for N Items — 1 The vector probability takes this form if we assume independence.
Item Factor Analysis Item Response Theory Beaujean Chapter 6.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.
Item Response Theory in Health Measurement
Item Parameter Estimation: Does WinBUGS Do Better Than BILOG-MG?
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.
Rating Scale Examples. A helpful resource
2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)
Item Response Theory Dan Mungas, Ph.D. Department of Neurology University of California, Davis.
Multitrait Scaling and IRT: Part I Ron D. Hays, Ph.D. Questionnaire.
Overview of Item Response Theory Ron D. Hays November 14, 2012 (8:10-8:30am) Geriatrics Society of America (GSA) Pre-Conference Workshop on Patient- Reported.
Utilizing Item Analysis to Improve the Evaluation of Student Performance Mihaiela Ristei Gugiu Central Michigan University Mihaiela Ristei Gugiu Central.
Item Response Theory and Computerized Adaptive Testing Hands-on Workshop, day 2 John Rust, Iva Cek,
Lesson 2 Main Test Theories: The Classical Test Theory (CTT)
[R] –irtoys –. For binary response data Provides common interface to some functions of –ICL (external to R) –BILOG (external to R) –ltm (R function) Syntax.
Chapter 2 Norms and Reliability. The essential objective of test standardization is to determine the distribution of raw scores in the norm group so that.
Nonequivalent Groups: Linear Methods Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2 nd ed.). New.
Vertical Scaling in Value-Added Models for Student Learning
UCLA Department of Medicine
Evaluating Multi-Item Scales
Using Item Response Theory to Track Longitudinal Course Changes
Introduction to the Validation Phase
assessing scale reliability
Classical Test Theory Margaret Wu.
Item Analysis: Classical and Beyond
الاختبارات محكية المرجع بناء وتحليل (دراسة مقارنة )
By ____________________
Rating Scale Examples.
Investigating item difficulty change by item positions under the Rasch model Luc Le & Van Nguyen 17th International meeting of the Psychometric Society,
Item Analysis: Classical and Beyond
Evaluating Multi-item Scales
Multitrait Scaling and IRT: Part I
Evaluating Multi-item Scales
Item Analysis: Classical and Beyond
Presentation transcript:

Item Response Theory in a Multi-level Framework Saralyn Miller Meg Oliphint EDU 7309

Agenda Item Response Theory – Conceptual Overview – Show Models (Rasch, 2PL, 3PL) – Example – In Class Practice Differential Item Functioning – Conceptual Overview – Example – In Class Practice

Defining Terms IRT – Item Response Theory - provides a framework for evaluating how well assessments work, and how well individual items on assessments work DIF – Differential Item Functioning– people from different groups with same ability function differently on certain items CTT – Classical Test Theory – – Observed Score + Error = True Score

IRT vs. CTT – Situating IRT IRT allows for greater reliability IRT can be used in CAT IRT allows for difficulty and ability to be on the same scale CTT is simple to compute IRT can be analyzed using multi-level modeling

How does IRT work?

Defining Item Parameters a i – ability parameter – point on the ability scale (θ) that intersects with the probability of P(θ) b i – difficulty parameter – point on θ where the ICC has its maximum slope c i – guessing parameter – added to ability (based number of response choices) to account for possible guessing

IRT formula 3 PL Model (which includes 1PL and 2PL) Rasch Model

Example: LSAT data 5 Test items; 1000 examinees library(ltm) head(LSAT) Item 1 Item 2 Item 3 Item 4 Item

Example: LSAT data descript(LSAT) Descriptive statistics for the 'LSAT' data-set Sample: 5 items and 1000 sample units; 0 missing values Proportions for each level of response: 0 1 logit Item Item Item Item Item Frequencies of total scores: Freq

Example: LSAT data Point Biserial correlation with Total Score: Included Excluded Item Item Item Item Item Cronbach's alpha: value All Items Excluding Item Excluding Item Excluding Item Excluding Item Excluding Item Pairwise Associations: Item i Item j p.value e e-04

##Fitting the Rasch model## fitRasch1<-rasch(LSAT,constraint=cbind(length(LSAT)+1,1)) summary(fitRasch1) Call: rasch(data = LSAT, constraint = cbind(length(LSAT) + 1, 1)) Model Summary: log.Lik AIC BIC Coefficients: value std.err z.vals Dffclt.Item Dffclt.Item Dffclt.Item Dffclt.Item Dffclt.Item Dscrmn NA NA Integration: method: Gauss-Hermite quadrature points: 21 Optimization: Convergence: 0 max(|grad|): 6.3e-05 quasi-Newton: BFGS

coef(fitRasch1,prob=TRUE,order=TRUE) Dffclt Dscrmn P(x=1|z=0) Item Item Item Item Item patterns<- rbind("all.zeros"=rep(0,5),"mix1"=rep(0:1,length=5),"mix2"=rep(1:0, length=5),"all.ones"=rep(1,5)) residuals(fitRasch1,resp.patterns=patterns,order=FALSE) Item 1 Item 2 Item 3 Item 4 Item 5 Obs Exp Resid all.zeros mix mix all.ones

Item Characteristic Curve plot(fitRasch1,legend=TRUE,pch=rep(1:20,each=5),xlab="LSAT",col=rep(1:5,2),lwd=2,cex=1.2,sub=paste("Call:",deparse(fitRasch1$call)))

Item Information Curve plot(fitRasch1, type = "IIC", legend = TRUE, pch = rep(1:2, each = 5), xlab = "Attitude",col = rep(1:5, 2), lwd = 2, cex = 1.2, sub = paste("Call: ", deparse(fitRasch1$cal l)))

Test Information Curve info1<-plot(fitRasch1,type="IIC",items=0,lwd=2,xlab="LSAT")

Multi-level analysis of IRT Hierarchical generalized linear models (HGLM) – Framework used for the nesting structure of item responses. – We are going to focus on the intercept model where items are dichotomous. Items are nested in examinees. – Item Responses (1 st level) – Examinees (2 nd level) The HGLM model is a fully crossed design since all examinees answer all test items. We will use a type of Rasch modeling.

Fully Nested DesignFully Crossed Design Person 1 Item 1 Item 2 Item 3 Item 4 Item 6 Item 5 Person 2 Person 1 Item 1 Item 2 Item 3 Item 4 Item 6 Item 5 Person 2

HGLM Rasch Model At level 1, all items are inserted into the model and usually the last item is used as the reference item (intercept). At level 2, we have fixed and random effects where examinee ability is random, but item difficulty is fixed.

Multi-level Formulas At level 1 we are obtaining the log-odds of the probability that person j obtains a correct score (one) on item i: At level 2 under this model, intercepts are random. This means we are allowing an examinee’s ability to be random. Slopes are not random. This means item difficulties are fixed. Now we can substitute the formulas above, back into the equation for the probability that person j answers item I correctly.

Kyle’s data multi level kyle<-read.table("mlm2.txt",header=T) ##All items must be factors to use nlme### kyle$person<-as.factor(kyle$person) kyle$resp<-as.factor(kyle$resp) kyle$i1<-as.factor(kyle$i1) kyle$i2<-as.factor(kyle$i2) kyle$i3<-as.factor(kyle$i3) kyle$i4<-as.factor(kyle$i4) kyle$i5<-as.factor(kyle$i5) kyle$i6<-as.factor(kyle$i6) kyle$i7<-as.factor(kyle$i7) kyle$i8<-as.factor(kyle$i8) kyle$i9<-as.factor(kyle$i9) kyle$i10<-as.factor(kyle$i10)

Kyle’s data multi level library(nlme) glmm.fit.kyle<- glmmPQL(resp~i1+i2+i3+i4+i5+i6+i7+i8+i9,random=~1|person,family=binomial,data=kyle) summary(glmm.fit.kyle) Linear mixed-effects model fit by maximum likelihood Data: kyle AIC BIC logLik NA NA NA Random effects: Formula: ~1 | person (Intercept) Residual StdDev: Variance function: Structure: fixed weights Formula: ~invwt Fixed effects: resp ~ i1 + i2 + i3 + i4 + i5 + i6 + i7 + i8 + i9 Value Std.Error DF t-value p-value (Intercept) i i i i i i i i i

Kyle’s data multi level library(nlme) glmm.fit.kyle<- glmmPQL(resp~i1+i2+i3+i4+i5+i6+i7+i8+i9,random=~1|person,family=binomial,data=kyle) summary(glmm.fit.kyle) ###Rest of output### Correlation: (Intr) i11 i21 i31 i41 i51 i61 i71 i81 i i i i i i i i i Standardized Within-Group Residuals: Min Q1 Med Q3 Max Number of Observations: 200 Number of Groups: 20

Calculate Difficulties Fixed effects: resp ~ i1 + i2 + i3 + i4 + i5 + i6 + i7 + i8 + i9 Value Std.Error (Intercept) i i i i i i i i i To calculate item difficulty, we must use the following: I1[-5.84-(-2.54)] = -3.3 I I I I I I70.73 I80.23 I92.03 I102.54

Kyle’s Data (single level) kyle<-read.table("mlm2.txt",header=T) library(psych) library(ltm) head(kyle) resp person id i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 cons bcons denom

Kyle’s Data (single level) ##data is already stacked for multi-level analysis so data needs to be unstacked### kyle$item<-rep(1:10, 20) kyle.new<-kyle[,c(1,2,17)] kyle1<-reshape(kyle.new, timevar="item", idvar="person", direction="wide") head(kyle1) person resp.1 resp.2 resp.3 resp.4 resp.5 resp.6 resp.7 resp.8 resp.9 resp ##create new subset without “person” variable## kyle2<- subset(kyle1,select=c(resp.1,resp.2,resp.3,resp.4,resp.5,resp.6,resp.7,resp.8,resp.9,resp.10))

Kyle’s Data (single level) ##constraints where disc=1### fitRasch1<-rasch(kyle2,constraint=cbind(length(kyle2)+1,1)) summary(fitRasch1) Call: rasch(data = kyle2, constraint = cbind(length(kyle2) + 1, 1)) Model Summary: log.Lik AIC BIC Coefficients: value std.err z.vals Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dscrmn NA NA Integration: method: Gauss-Hermite quadrature points: 21 Optimization: Convergence: 0 max(|grad|): quasi-Newton: BFGS

Kyle’s Data (single level) #items ordered by difficulty and probability of positive response by the average individual# coef(fitRasch1,prob=TRUE,order=TRUE) Dffclt Dscrmn P(x=1|z=0) resp resp resp resp resp resp resp resp resp resp

Compare Difficulties (kyle data) ItemsDifficulty – IRTDifficulty – Multi-level IRT Item Item Item Item Item Item Item Item Item Item

Example in Class – Multi-level of LSAT data in ltm package ##need to reshape the data## LSAT1<-reshape(LSAT,varying=list(1:5),direction="long") LSAT1<-LSAT1[order(LSAT1$id),] colnames(LSAT1)<-c("item","score","id") LSAT1$item1<-ifelse(LSAT1$item==1,1,0) LSAT1$item2<-ifelse(LSAT1$item==2,1,0) LSAT1$item3<-ifelse(LSAT1$item==3,1,0) LSAT1$item4<-ifelse(LSAT1$item==4,1,0) LSAT1$item5<-ifelse(LSAT1$item==5,1,0) LSAT1[1:15,] ###MAKE VARIABLES FACTORS### ###RUN ANALYSIS### ###COMPUTE DIFFICULTIES###

1. MAKE VARIABLES FACTORS 2. RUN ANALYSIS 3. COMPUTE DIFFICULTIES ItemDifficulty – IRTDifficulty – Multi-level IRT Item 1 Item 2 Item 3 Item 4 Item 5

Compare Difficulties ItemDifficulty – IRTDifficulty – Multi-level IRT Item Item Item Item Item

Example in Class – Multi-level of LSAT data in ltm package glmm.fit.LSAT<-glmmPQL(score~item1+item2+item3+item4,random=~1|id,family=binomial,data=LSAT1) summary(glmm.fit.LSAT) Linear mixed-effects model fit by maximum likelihood Data: LSAT1 AIC BIC logLik NA NA NA Random effects: Formula: ~1 | id (Intercept) Residual StdDev: Variance function: Structure: fixed weights Formula: ~invwt Fixed effects: score ~ item1 + item2 + item3 + item4 Value Std.Error DF t-value p-value (Intercept) item item item item Correlation: (Intr) item11 item21 item31 item item item item Standardized Within-Group Residuals: Min Q1 Med Q3 Max Number of Observations: 5000 Number of Groups: 1000

Differential Item Functioning DIF is a way to detect biased (unfair) questions in a given test An item is said to have DIF if: People in the same group With the same ability Answer the question differently Classic Example: Question about math that requires heavy reading load or Questions about calculating ERA

Differential Item Functioning Can be detected using logistic regression: – Looking for SS for an item to have DIF Can be detected in a multi-level modeling framework: – Looking at the interaction effect between the grouping variable and that certain item – If the DIF estimate is larger than twice the standard error, the item is biased Keep in mind: DIF = bad!

DIF Example with LSAT Data > LSAT1$gender<-as.factor(rep(0:1, each=500)) > head(LSAT1) item score id item1 item2 item3 item4 item5 gender

Item 1 glmm.dif.LSAT<- glmmPQL(score~0+item1*gender+item2*gender+item3*gender+item4*gender,random=~1|id,fa mily=binomial,data=LSAT1)  summary(glmm.dif.LSAT)  Fixed effects: score ~ 0 + item1 * gender + item2 * gender + item3 * gender + item4 * gender Value Std.Error DF t-value p-value item gender gender item item item item1:gender gender1:item gender1:item gender1:item Correlation: item1 gendr0 gendr1 item2 item3 item4 itm1:1 gnd1:2 gender gender item item item item1:gender gender1:item gender1:item gender1:item gnd1:3 gender0 gender1 item2 item3 item4 item1:gender1 gender1:item2 gender1:item3 gender1:item Standardized Within-Group Residuals: Min Q1 Med Q3 Max e e e e e+00 Number of Observations: 5000 Number of Groups: 1000 LOOK! This item does not have DIF!

Item 2 Fixed effects: score ~ 0 + item1 * gender + item2 * gender + item3 * gender + item4 * gender Value Std.Error DF t-value p-value item gender gender item item item item1:gender gender1:item gender1:item gender1:item Correlation: item1 gendr0 gendr1 item2 item3 item4 itm1:1 gnd1:2 gender gender item item item item1:gender gender1:item gender1:item gender1:item gnd1:3 gender0 gender1 item2 item3 item4 item1:gender1 gender1:item2 gender1:item3 gender1:item Standardized Within-Group Residuals: Min Q1 Med Q3 Max e e e e e+00 Number of Observations: 5000 Number of Groups: 1000 Does this item have DIF? YES =(

Item 3 Fixed effects: score ~ 0 + item1 * gender + item2 * gender + item3 * gender + item4 * gender Value Std.Error DF t-value p-value item gender gender item item item item1:gender gender1:item gender1:item gender1:item Correlation: item1 gendr0 gendr1 item2 item3 item4 itm1:1 gnd1:2 gender gender item item item item1:gender gender1:item gender1:item gender1:item gnd1:3 gender0 gender1 item2 item3 item4 item1:gender1 gender1:item2 gender1:item3 gender1:item Standardized Within-Group Residuals: Min Q1 Med Q3 Max e e e e e+00 Number of Observations: 5000 Number of Groups: 1000 Does this item have DIF? YES =(

Item 4 Fixed effects: score ~ 0 + item1 * gender + item2 * gender + item3 * gender + item4 * gender Value Std.Error DF t-value p-value item gender gender item item item item1:gender gender1:item gender1:item gender1:item Correlation: item1 gendr0 gendr1 item2 item3 item4 itm1:1 gnd1:2 gender gender item item item item1:gender gender1:item gender1:item gender1:item gnd1:3 gender0 gender1 item2 item3 item4 item1:gender1 gender1:item2 gender1:item3 gender1:item Standardized Within-Group Residuals: Min Q1 Med Q3 Max e e e e e+00 Number of Observations: 5000 Number of Groups: 1000 Does this item have DIF? NO =)

Thank you for a great two years!