Download presentation
Presentation is loading. Please wait.
Published byWilliam Strickland Modified over 9 years ago
1
Adventures in Equating Land: Facing the Intra-Individual Consistency Index Monster * *Louis Roussos retains all rights to the title
2
Overview of Equating Designs and Methods Designs –Single Group –Random Groups –Common Item Nonequivalent Groups (CING) Methods –Mean –Linear –Equipercentile –IRT True or Observed
3
Guidelines for Selecting Common Items for Multiple-Choice (MC) Only Exams Representative of the total test (Kolen & Brennan, 2004) 20% of the total test Same item positions Similar average/spread of item difficulties (Durans, Kubiak, & Melican, 1997) Content representative (Klein & Jarjoura, 1985)
4
Challenges in Equating Mixed-Format Tests (Kolen & Brennan, 2004; Muraki, Hombo, & Lee, 2000) Constructed Response (CR) scored by raters Small number of tasks –Inadequate sampling of construct –Changes in construct across forms Common Items –Content/difficulty balance of common items –MC only may result in inadequate representation of groups/construct IRT –Small number of tasks may result in unstable parameter estimates –Typically assume a single dimension underlies both item types Format Effects
5
Current Research Number of CR Items –Smaller RMSD with larger numbers of items and/or score points (Li and Yin, 2008; Fitzpatrick and Yen, 2001) –Misclassification (Fitzpatrick and Yen, 2001) Fewer than 12 items, more score points resulted in smaller error rates Greater than 12 items, error rates less than 10% regardless of score points Trend Scoring (Tate, 1999, 2000; Kim, Walker, McHale, 2008) –Rescoring samples of CR items –Smaller bias and equating error
6
Cont. Format Effects (FE) –MC and CR measure similar constructs (Ercikan et al., 1993; Traub, 1993) –Males scored higher on MC; females higher on CR ( DeMars, 1998; Garner & Engelhard, 1999) –Kim and Kolen, 2006 Narrow-range tests (e.g., credentialing) Wide-range tests (e.g., achievement) Individual Consistency Index (Tatsuoka & Tatsuoka, 1982) –Detecting aberrant response patterns –Not specifically in the context of mixed-format tests
7
Purpose and Research Questions Purpose: Examine the impact of equating mixed format tests when student subscores differ across item types. Specifically, To what extent does the intra-individual consistency of examinee responses across item formats impact equating results? How does the selection of common items differentially impact equating results with varying levels of intra-individual consistency?
8
Data “Old Form” (OL) treated as “truth” –Large-scale 6 th grade testing program –Mathematics –54 point test 34 multiple choice (MC) 5 short answer (SA) 5 constructed response (CR) worth 4 points each –Approx. 70,000 examinees “New Form” (NE) –Exactly the same items as OL –Samples of examinees from OL
9
2006-07 Scoring Test 39 Items OL (old form) All Examinees NE (new form) Samples of 3,000 Examinees 2006-07 Scoring Test 39 Items Both OL and NE contain the exact same items Only difference between the forms are the examinees
10
Intra-Individual Consistency Consistency of student responses across formats Regression of dichotomous item subscores (MC and SA) onto polytomous item subscores (CR) Standardized residuals –Range from approximately -4.00 to +8.00 –Example: Index of +2.00 Student subscores on CR under-predicted by two standard deviations based on MC subscores
11
Samples Three groups of examinees based on intra- individual consistency index –Below -1.50 (NEG) –-1.50 to +1.50 (MID) –Above +1.50 (POS) 3,000 examinees per sample Sampled from each group based on percentages Samples selected to have same quartiles and median as whole group of examinees
12
Sampling Conditions 60/20/20 –60% sampled from one of the groups (i.e., NEG, MID, POS) –20% sample from each of the remaining groups –Repeated for each of the three groups 40/30/30
13
Common Items Six sets of common items –MC only (12 points) –CR only (12 points) –MC (4) and CR (8) –MC (8) and CR (4) –MC (4), CR (4), and SA (4) –MC (7), CR (4), and SA (1) Representative of total test in terms of content, difficulty and length
14
Equating Common-item nonequivalent groups design Item parameters calibrated using Parscale 4.1 –3-parameter logistic model (3PL) for MC items –2PL model for SA items –Graded Response Model for CR items IRT scale transformation –Mean/mean, mean/sigma, Stocking-Lord, and Haebara IRT true score equating
15
OLNE “Common” Items Equating conducted using only a selection of items treated as common Equating OL and NE All items shared in common “Truth” established by equating NE to OL using all items as common items
16
Evaluation Bias and RMSE –At each score point –Averaged over score points Classification Consistency
17
Results: 60% Mid
18
Results: 40% Mid
19
In the extreme…
20
Across the Score Scale: Average Bias
21
Across the Score Scale: Average RMSE
22
Across the Score Scale: Misclassification Rates
23
Classification Consistency: Proficient
24
Discussion Different equating results based on sampling conditions Differences more exaggerated when using common items sets with mostly CR items Mid 60 most similar to data, small differences across common item selections
25
Limitations and Implications Limitations –Sampling conditions –Common item selections –Only one equating method Implications for future research –Sampling conditions, common item selections, additional equating methods –Other content areas and grade levels –Other testing programs –Simulation studies
26
Thanks! Rob Keller Mike, Louis, Won, Candy, and Jessalyn
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.