Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.

Similar presentations


Presentation on theme: "Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and."— Presentation transcript:

1 Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and Scale Score Stability in Alternate Assessments NCSA – June 24, 2015

2 Item Parameter and Scale Score Stability in Alternate Assessments Sources of Drift  Distortion in test scores can be caused by shifts in item performance over time (Goldstein, 1983; Hambleton & Rogers, 1989) owing to the changes of  cognitive or noncognitive examinee characteristics (Bulut et al., 2015)  examinees’ opportunities to learn (Albano & Rodriguez, 2013)  curriculum or teaching methods (DeMars, 2004; Miller & Linn, 1988) 1

3 Item Parameter and Scale Score Stability in Alternate Assessments Item Parameter Drift  In item response theory (IRT), if item performance changes over time, item parameter drift (IPD) occurs.  Drifted item parameters can result in systematic errors in equating, scaling, and consequently scoring (Kolen & Brennan, 2004).  Therefore, it is important to check item performance  across various subgroups of examinees (gender, ethnic groups, etc.)  Across test administrations over time 2

4 Item Parameter and Scale Score Stability in Alternate Assessments Purpose of This Study  In alternate assessments, it is even more crucial to monitor IPD and score stability because  student population is more heterogeneous.  fluctuations of population is more common across years, and  test design changes from field test to operational.  This study focuses on 1. assessing IPD of items and 2. examining score stability in alternate assessments. 4

5 Item Parameter and Scale Score Stability in Alternate Assessments Data Source 5 Math & ReadingScience State 13 – 5, 6 – 8, 9 -105, 8, 10 State 23 – 5, 6 – 8, 10 - 125, 8, 10-12

6 Item Parameter and Scale Score Stability in Alternate Assessments Data Source 6 Test Year State 1 (N=400/130) State 2 (N=7,500/2,500) 2011Field Test 2012Operational 2013 OperationalField Test 2014Operational

7 Item Parameter and Scale Score Stability in Alternate Assessments Test Design  Field Test (FT) Form  Multiple fixed forms linked by common items  9 or 15 tasks in each form  6 to 8 items in each task  Students responded to all tasks.  Operational (OP) Form  Single form with various test lengths  One form of 12 OP tasks and 3 FT tasks in State 1  Three forms of 12 OP tasks and 1 FT task in State 2  Students respond to a subset of tasks based on their abilities 7

8 Item Parameter and Scale Score Stability in Alternate Assessments Test Administration 8 1 1 2 2 3 3

9 Item Parameter and Scale Score Stability in Alternate Assessments Item Calibration 9

10 Item Parameter and Scale Score Stability in Alternate Assessments Parameter Drift Analysis (1)  Item calibration in operational setting:  First year free calibration of field-test items:  Items were calibrated by subject for mathematics and reading to create vertical scales.  Items were calibrated by grade/grade band for science  Later year concurrent calibration of field-test items  using operational items with good fit statistics as anchor items 10

11 Item Parameter and Scale Score Stability in Alternate Assessments Parameter Drift Analysis (2)  In this study, free calibrations were conducted using 2014 data.  The new parameters were equated to the existing scale using:  Mean/Mean (Loyd & Hoover, 1980)  Haebara (Haebara, 1980)  Stocking-Lord (Stocking & Lord, 1983)  Items with the difference of average difficulties greater than 0.3 were iteratively deleted from the anchor set.  Different equating methods may lead to different anchor sets. 11

12 Item Parameter and Scale Score Stability in Alternate Assessments Evaluation Criteria 1) Root-mean-square deviation (RMSD) of item parameters 2) RMSD of ability estimates 3) Mean absolute percentile difference(MAPD)  MAPD considers the number of examinees that are influenced by drifted parameters.  Since the ability distributions for alternative assessments are negatively skewed, kernel-smoothed empirical cumulative distribution was used to compute MAPD at quadrature points from -4 to 4 with the interval of 0.5. 12

13 Item Parameter and Scale Score Stability in Alternate Assessments Results (Drift Rate) 13 MM: Mean/Mean; HB: Haebara; SL: Stocking-Lord

14 Item Parameter and Scale Score Stability in Alternate Assessments Results (RMSD) 13 MM: Mean/Mean; HB: Haebara; SL: Stocking-Lord

15 Item Parameter and Scale Score Stability in Alternate Assessments Results (RMSD) 14 MM: Mean/Mean; HB: Haebara; SL: Stocking-Lord

16 Item Parameter and Scale Score Stability in Alternate Assessments Results (MAPD) 15 MM: Mean/Mean; HB: Haebara; SL: Stocking-Lord

17 Item Parameter and Scale Score Stability in Alternate Assessments Results (MAPD) 16 MM: Mean/Mean; HB: Haebara; SL: Stocking-Lord

18 Item Parameter and Scale Score Stability in Alternate Assessments Summary (1)  Drift rate  Drift rate is generally higher in State 1 than State 2  Item parameter drift:  Larger IPD in State 1 than State 2  In State 1, the drifts are between 0.2 to 0.31.  In State 2, the drifts are between 0.1 to 0.17.  This might be due to greater time difference and smaller sample size. 17

19 Item Parameter and Scale Score Stability in Alternate Assessments Summary (2)  Impact of IPD on scores (θ):  Larger RMSD in State 1 than State 2  In State 1, the RMSD are between 0.02 to 0.09.  In State 2, the RMSD are between are 0.01 to 0.05.  Larger IPD may not lead to larger changes in scores.  IPD can occur in two directions.  The effect of IPD may be canceled out.  Mean absolute percentile difference(MAPD)  The MAPD of the two states are both below 0.012. 17

20 Item Parameter and Scale Score Stability in Alternate Assessments Summary (3)  Results are aligned with previous studies  Huyn and Meyer (2009, 2010)  Wei (2013)  Wells, Hambleton, and Meng (2011)  Wells, Hambleton, Kirkpatrick, and Meng (2014)  Wells., Subkoviak, Serlin (2002) 18

21 Item Parameter and Scale Score Stability in Alternate Assessments Limitations of the Study  Only two states are included in this study.  Because of small sample sizes in alternate assessments, the impact of sampling is inevitable.  The duration between test administrations (3 years in State 1 and 1 year in State 2) may not be long enough to observe IPD that have significant impact on scores. 18

22 Item Parameter and Scale Score Stability in Alternate Assessments Selected References  Albano, A. D., & Rodriguez, M. C. (2013). Examining differential math performance by gender and opportunity to learn. Educational and Psychological Measurement, 73, 836-856.  Bulut, O., Palma, J., Rodriguez, M. C., & Stanke, L. (2015). Evaluating measurement invariance in the measurement of developmental assets in Latino English language groups across developmental stages. Sage Open, 2, 1-18.  Wells, C. S. Subkovial, M. J., & Serlin, R. (2002) The Effect of Item Parameter Drift on Examinee Ability Estimates. Applied Psychological Measurement. Vol. 26 No.  Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149.  Kolen, M. & Brennan, R. (2004). Test equating, scaling, and linking : methods and practices (2nd ed.). New York: Springer.  Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193.  Miller, A. D., & Linn, R. L. (1988). Invariance of item characteristic functions with variations in instructional coverage. Journal of Educational Measurement, 25, 205–219.  Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210.  Wei, X. E. (2013). Impacts of Item Parameter Drift on Person Ability Estimation in Multistage Testing.. Technical Report.

23 Thank you! For further information please contact: Ming Lei mlei@air.org


Download ppt "Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and."

Similar presentations


Ads by Google