Presentation is loading. Please wait.

Presentation is loading. Please wait.

Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

Similar presentations


Presentation on theme: "Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate."— Presentation transcript:

1 Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate School of Education & Information Studies National Center for Research on Evaluation, Standards, and Student Testing (CRESST) CRESST Conference 2004 Los Angeles

2 Rationale Research shows that cut-scores vary as a function of many factors: raters, procedures, and over time. How does one defend a particular cut- score? Averaging several values, use of collateral information are current options. High-stakes accountability hinges on the comparability of performance standards over time. Some method is required to monitor cut- scores for consistency across groups and over time. (Green, et al)

3 Purpose of Study An approach for estimating the impact from procedural factors and rater characteristics and time. Monitoring the consistency of cut-scores across several groups.

4 Transforming Judgments into Scale Scores Figure 1: Working with the Grade 3 SAT-9 mathematics scale

5 Performance Distribution for Four Urban Schools Figure 2: Grade 3 SAT-9 mathematics scale score distribution for four schools

6 Potential Impact of Revising a Cut-score Revised Cut-score (as fraction of sem) school -0.500.51 A41%37%32%29%26% B78%75%70%67%63% C25%23%19%15%13% D40%36%32%28%25% Table 1: Potential impact on school performance when cut-score changes

7 Data & Model Simulate Data for a standard setting study design : a ramdomized block comfounded factorial design (Kirk, 1995) Factors of standard setting study a.Rater Dimensions (Teacher, Non-Teacher, etc.) b.Procedural Factors/Treatments 1.Type of Feedback (Outcome or impact Feedback- “yes” or “No”, etc) 2.Item Sampling in Booklet (Number of items, etc) 3.Type of Task (A modified Angoff, a contrasting group approach, or Bookmark method, etc)

8 Treating Binary Outcomes (2) Binary outcome (1) Logit link function (pass if rater j thinks, the passing candidate has a good chance of getting the ith item right in session t)

9 IRT Model for Cut-score - I Procedural Factors Impacting A Rater’s Cut-scores (3) Whereis the fixed effect due to session characteristics s is random effect, which evolves over time ROUND jt, and a function of rater characteristics, X pj Item Response Model (IRT) (4)

10 Estimating Factors Impacting A Rater’s Cut-scores (5) are distributed bivariate normal with means (0, 0) and variance-covariances IRT Model for Cut-score - II

11 Likelihood (7) Prior distribution of  j Conditional posterior of the rater random effects  j is where Condition on, y has probability (6) Joint marginal likelihood (8)

12 Multiple Studies Consistency & Stability Procedural Factors Impacting A Rater’s Cut-scores for separate study g (g=1.2.3….,g) Whereis the fixed effect due to session characteristics s is random effect, which evolves over time SESSION jt, and a function of rater characteristics, X pj (9) Group Factors Impacting A Rater’s Severity (10)

13 Simulation SAS Proc NLMixed 150 raters who are randomly exposed 4 rounds to STD setting exercise varying on 3 session factors.  Session Factor 1: Feedback type  Session Factor 2: Item Targeting in Booklet  Session Factor 3: Type of Standard Setting Task  Rater Characteristics: Teacher, Non-Teacher  Change over Round (time)

14 Selected Results Model (reasonably) recovers parameters within sampling uncertainty across 3 studies. Average cut-score (All Teachers) for each rater group at the last Round is not significantly different from 619, while the first Round results were significantly different. Results from the model for multiple studies are similarly encouraging.

15 Suggestions Large-scale testing programs should monitor their cut-score estimates for consistency and stability. For stable performance scale, estimates of cut-scores and factor effects should be replicable to a reasonable degree across groups and over time. The model in this paper can be modified based on actual data so that we verify and balance out the variation due to the relevant factors of the study.


Download ppt "Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate."

Similar presentations


Ads by Google