Item Response Theory in a Multi-level Framework Saralyn Miller Meg Oliphint EDU 7309
Agenda Item Response Theory – Conceptual Overview – Show Models (Rasch, 2PL, 3PL) – Example – In Class Practice Differential Item Functioning – Conceptual Overview – Example – In Class Practice
Defining Terms IRT – Item Response Theory - provides a framework for evaluating how well assessments work, and how well individual items on assessments work DIF – Differential Item Functioning– people from different groups with same ability function differently on certain items CTT – Classical Test Theory – – Observed Score + Error = True Score
IRT vs. CTT – Situating IRT IRT allows for greater reliability IRT can be used in CAT IRT allows for difficulty and ability to be on the same scale CTT is simple to compute IRT can be analyzed using multi-level modeling
How does IRT work?
Defining Item Parameters a i – ability parameter – point on the ability scale (θ) that intersects with the probability of P(θ) b i – difficulty parameter – point on θ where the ICC has its maximum slope c i – guessing parameter – added to ability (based number of response choices) to account for possible guessing
IRT formula 3 PL Model (which includes 1PL and 2PL) Rasch Model
Example: LSAT data 5 Test items; 1000 examinees library(ltm) head(LSAT) Item 1 Item 2 Item 3 Item 4 Item
Example: LSAT data descript(LSAT) Descriptive statistics for the 'LSAT' data-set Sample: 5 items and 1000 sample units; 0 missing values Proportions for each level of response: 0 1 logit Item Item Item Item Item Frequencies of total scores: Freq
Example: LSAT data Point Biserial correlation with Total Score: Included Excluded Item Item Item Item Item Cronbach's alpha: value All Items Excluding Item Excluding Item Excluding Item Excluding Item Excluding Item Pairwise Associations: Item i Item j p.value e e-04
##Fitting the Rasch model## fitRasch1<-rasch(LSAT,constraint=cbind(length(LSAT)+1,1)) summary(fitRasch1) Call: rasch(data = LSAT, constraint = cbind(length(LSAT) + 1, 1)) Model Summary: log.Lik AIC BIC Coefficients: value std.err z.vals Dffclt.Item Dffclt.Item Dffclt.Item Dffclt.Item Dffclt.Item Dscrmn NA NA Integration: method: Gauss-Hermite quadrature points: 21 Optimization: Convergence: 0 max(|grad|): 6.3e-05 quasi-Newton: BFGS
coef(fitRasch1,prob=TRUE,order=TRUE) Dffclt Dscrmn P(x=1|z=0) Item Item Item Item Item patterns<- rbind("all.zeros"=rep(0,5),"mix1"=rep(0:1,length=5),"mix2"=rep(1:0, length=5),"all.ones"=rep(1,5)) residuals(fitRasch1,resp.patterns=patterns,order=FALSE) Item 1 Item 2 Item 3 Item 4 Item 5 Obs Exp Resid all.zeros mix mix all.ones
Item Characteristic Curve plot(fitRasch1,legend=TRUE,pch=rep(1:20,each=5),xlab="LSAT",col=rep(1:5,2),lwd=2,cex=1.2,sub=paste("Call:",deparse(fitRasch1$call)))
Item Information Curve plot(fitRasch1, type = "IIC", legend = TRUE, pch = rep(1:2, each = 5), xlab = "Attitude",col = rep(1:5, 2), lwd = 2, cex = 1.2, sub = paste("Call: ", deparse(fitRasch1$cal l)))
Test Information Curve info1<-plot(fitRasch1,type="IIC",items=0,lwd=2,xlab="LSAT")
Multi-level analysis of IRT Hierarchical generalized linear models (HGLM) – Framework used for the nesting structure of item responses. – We are going to focus on the intercept model where items are dichotomous. Items are nested in examinees. – Item Responses (1 st level) – Examinees (2 nd level) The HGLM model is a fully crossed design since all examinees answer all test items. We will use a type of Rasch modeling.
Fully Nested DesignFully Crossed Design Person 1 Item 1 Item 2 Item 3 Item 4 Item 6 Item 5 Person 2 Person 1 Item 1 Item 2 Item 3 Item 4 Item 6 Item 5 Person 2
HGLM Rasch Model At level 1, all items are inserted into the model and usually the last item is used as the reference item (intercept). At level 2, we have fixed and random effects where examinee ability is random, but item difficulty is fixed.
Multi-level Formulas At level 1 we are obtaining the log-odds of the probability that person j obtains a correct score (one) on item i: At level 2 under this model, intercepts are random. This means we are allowing an examinee’s ability to be random. Slopes are not random. This means item difficulties are fixed. Now we can substitute the formulas above, back into the equation for the probability that person j answers item I correctly.
Kyle’s data multi level kyle<-read.table("mlm2.txt",header=T) ##All items must be factors to use nlme### kyle$person<-as.factor(kyle$person) kyle$resp<-as.factor(kyle$resp) kyle$i1<-as.factor(kyle$i1) kyle$i2<-as.factor(kyle$i2) kyle$i3<-as.factor(kyle$i3) kyle$i4<-as.factor(kyle$i4) kyle$i5<-as.factor(kyle$i5) kyle$i6<-as.factor(kyle$i6) kyle$i7<-as.factor(kyle$i7) kyle$i8<-as.factor(kyle$i8) kyle$i9<-as.factor(kyle$i9) kyle$i10<-as.factor(kyle$i10)
Kyle’s data multi level library(nlme)<- glmmPQL(resp~i1+i2+i3+i4+i5+i6+i7+i8+i9,random=~1|person,family=binomial,data=kyle) summary( Linear mixed-effects model fit by maximum likelihood Data: kyle AIC BIC logLik NA NA NA Random effects: Formula: ~1 | person (Intercept) Residual StdDev: Variance function: Structure: fixed weights Formula: ~invwt Fixed effects: resp ~ i1 + i2 + i3 + i4 + i5 + i6 + i7 + i8 + i9 Value Std.Error DF t-value p-value (Intercept) i i i i i i i i i
Kyle’s data multi level library(nlme)<- glmmPQL(resp~i1+i2+i3+i4+i5+i6+i7+i8+i9,random=~1|person,family=binomial,data=kyle) summary( ###Rest of output### Correlation: (Intr) i11 i21 i31 i41 i51 i61 i71 i81 i i i i i i i i i Standardized Within-Group Residuals: Min Q1 Med Q3 Max Number of Observations: 200 Number of Groups: 20
Calculate Difficulties Fixed effects: resp ~ i1 + i2 + i3 + i4 + i5 + i6 + i7 + i8 + i9 Value Std.Error (Intercept) i i i i i i i i i To calculate item difficulty, we must use the following: I1[-5.84-(-2.54)] = -3.3 I I I I I I70.73 I80.23 I92.03 I102.54
Kyle’s Data (single level) kyle<-read.table("mlm2.txt",header=T) library(psych) library(ltm) head(kyle) resp person id i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 cons bcons denom
Kyle’s Data (single level) ##data is already stacked for multi-level analysis so data needs to be unstacked### kyle$item<-rep(1:10, 20)<-kyle[,c(1,2,17)] kyle1<-reshape(, timevar="item", idvar="person", direction="wide") head(kyle1) person resp.1 resp.2 resp.3 resp.4 resp.5 resp.6 resp.7 resp.8 resp.9 resp ##create new subset without “person” variable## kyle2<- subset(kyle1,select=c(resp.1,resp.2,resp.3,resp.4,resp.5,resp.6,resp.7,resp.8,resp.9,resp.10))
Kyle’s Data (single level) ##constraints where disc=1### fitRasch1<-rasch(kyle2,constraint=cbind(length(kyle2)+1,1)) summary(fitRasch1) Call: rasch(data = kyle2, constraint = cbind(length(kyle2) + 1, 1)) Model Summary: log.Lik AIC BIC Coefficients: value std.err z.vals Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dffclt.resp Dscrmn NA NA Integration: method: Gauss-Hermite quadrature points: 21 Optimization: Convergence: 0 max(|grad|): quasi-Newton: BFGS
Kyle’s Data (single level) #items ordered by difficulty and probability of positive response by the average individual# coef(fitRasch1,prob=TRUE,order=TRUE) Dffclt Dscrmn P(x=1|z=0) resp resp resp resp resp resp resp resp resp resp
Compare Difficulties (kyle data) ItemsDifficulty – IRTDifficulty – Multi-level IRT Item Item Item Item Item Item Item Item Item Item
Example in Class – Multi-level of LSAT data in ltm package ##need to reshape the data## LSAT1<-reshape(LSAT,varying=list(1:5),direction="long") LSAT1<-LSAT1[order(LSAT1$id),] colnames(LSAT1)<-c("item","score","id") LSAT1$item1<-ifelse(LSAT1$item==1,1,0) LSAT1$item2<-ifelse(LSAT1$item==2,1,0) LSAT1$item3<-ifelse(LSAT1$item==3,1,0) LSAT1$item4<-ifelse(LSAT1$item==4,1,0) LSAT1$item5<-ifelse(LSAT1$item==5,1,0) LSAT1[1:15,] ###MAKE VARIABLES FACTORS### ###RUN ANALYSIS### ###COMPUTE DIFFICULTIES###
1. MAKE VARIABLES FACTORS 2. RUN ANALYSIS 3. COMPUTE DIFFICULTIES ItemDifficulty – IRTDifficulty – Multi-level IRT Item 1 Item 2 Item 3 Item 4 Item 5
Compare Difficulties ItemDifficulty – IRTDifficulty – Multi-level IRT Item Item Item Item Item
Example in Class – Multi-level of LSAT data in ltm package<-glmmPQL(score~item1+item2+item3+item4,random=~1|id,family=binomial,data=LSAT1) summary( Linear mixed-effects model fit by maximum likelihood Data: LSAT1 AIC BIC logLik NA NA NA Random effects: Formula: ~1 | id (Intercept) Residual StdDev: Variance function: Structure: fixed weights Formula: ~invwt Fixed effects: score ~ item1 + item2 + item3 + item4 Value Std.Error DF t-value p-value (Intercept) item item item item Correlation: (Intr) item11 item21 item31 item item item item Standardized Within-Group Residuals: Min Q1 Med Q3 Max Number of Observations: 5000 Number of Groups: 1000
Differential Item Functioning DIF is a way to detect biased (unfair) questions in a given test An item is said to have DIF if: People in the same group With the same ability Answer the question differently Classic Example: Question about math that requires heavy reading load or Questions about calculating ERA
Differential Item Functioning Can be detected using logistic regression: – Looking for SS for an item to have DIF Can be detected in a multi-level modeling framework: – Looking at the interaction effect between the grouping variable and that certain item – If the DIF estimate is larger than twice the standard error, the item is biased Keep in mind: DIF = bad!
DIF Example with LSAT Data > LSAT1$gender<-as.factor(rep(0:1, each=500)) > head(LSAT1) item score id item1 item2 item3 item4 item5 gender
Item 1 glmm.dif.LSAT<- glmmPQL(score~0+item1*gender+item2*gender+item3*gender+item4*gender,random=~1|id,fa mily=binomial,data=LSAT1) summary(glmm.dif.LSAT) Fixed effects: score ~ 0 + item1 * gender + item2 * gender + item3 * gender + item4 * gender Value Std.Error DF t-value p-value item gender gender item item item item1:gender gender1:item gender1:item gender1:item Correlation: item1 gendr0 gendr1 item2 item3 item4 itm1:1 gnd1:2 gender gender item item item item1:gender gender1:item gender1:item gender1:item gnd1:3 gender0 gender1 item2 item3 item4 item1:gender1 gender1:item2 gender1:item3 gender1:item Standardized Within-Group Residuals: Min Q1 Med Q3 Max e e e e e+00 Number of Observations: 5000 Number of Groups: 1000 LOOK! This item does not have DIF!
Item 2 Fixed effects: score ~ 0 + item1 * gender + item2 * gender + item3 * gender + item4 * gender Value Std.Error DF t-value p-value item gender gender item item item item1:gender gender1:item gender1:item gender1:item Correlation: item1 gendr0 gendr1 item2 item3 item4 itm1:1 gnd1:2 gender gender item item item item1:gender gender1:item gender1:item gender1:item gnd1:3 gender0 gender1 item2 item3 item4 item1:gender1 gender1:item2 gender1:item3 gender1:item Standardized Within-Group Residuals: Min Q1 Med Q3 Max e e e e e+00 Number of Observations: 5000 Number of Groups: 1000 Does this item have DIF? YES =(
Item 3 Fixed effects: score ~ 0 + item1 * gender + item2 * gender + item3 * gender + item4 * gender Value Std.Error DF t-value p-value item gender gender item item item item1:gender gender1:item gender1:item gender1:item Correlation: item1 gendr0 gendr1 item2 item3 item4 itm1:1 gnd1:2 gender gender item item item item1:gender gender1:item gender1:item gender1:item gnd1:3 gender0 gender1 item2 item3 item4 item1:gender1 gender1:item2 gender1:item3 gender1:item Standardized Within-Group Residuals: Min Q1 Med Q3 Max e e e e e+00 Number of Observations: 5000 Number of Groups: 1000 Does this item have DIF? YES =(
Item 4 Fixed effects: score ~ 0 + item1 * gender + item2 * gender + item3 * gender + item4 * gender Value Std.Error DF t-value p-value item gender gender item item item item1:gender gender1:item gender1:item gender1:item Correlation: item1 gendr0 gendr1 item2 item3 item4 itm1:1 gnd1:2 gender gender item item item item1:gender gender1:item gender1:item gender1:item gnd1:3 gender0 gender1 item2 item3 item4 item1:gender1 gender1:item2 gender1:item3 gender1:item Standardized Within-Group Residuals: Min Q1 Med Q3 Max e e e e e+00 Number of Observations: 5000 Number of Groups: 1000 Does this item have DIF? NO =)
Thank you for a great two years!