Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Similar presentations


Presentation on theme: "A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009."— Presentation transcript:

1 A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009 by Educational Testing Service. All rights reserved. No reproduction in whole or in part is permitted without the express written permission of the copyright owner Paper presented at the Statistical and Applied Mathematical Sciences Institute, 9 July, 2009.

2 Few wish to assess others, fewer still wish to be assessed, but everyone wants to see the scores. Paul W. Holland (2001)

3 Few wish to assess others, fewer still wish to be assessed, but everyone wants to see the diagnostic scores.

4 Outline Examples of diagnostic score reports. Approaches to report diagnostic scores. Problems with existing diagnostic scores in education. A method to evaluate if diagnostic scores have added value. Applications of the method to operational test data. Conclusions and recommendations.

5 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 5 What Are Diagnostic Scores? Diagnostic scores refer to scores on any meaningful cluster of items (subtests). Typically, they refer to scores on content areas. For example, on a test for prospective teachers of children, diagnostic scores are scores on the content areas Reading, Science, Social Studies, and Mathematics.

6

7 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 7

8

9 9 Subscores, Augmented Subscores, and Objective Performance Index Subscores: Raw/percent scores on the subtests. Augmented subscore (Wainer et al., 2001): A weighted average of the subscore of interest (e.g., reading) and the other subscores (e.g., science, social studies, and mathematics). Objective Performance Index (Yen, 1987): A weighted average of (i) the observed subscore, and (ii) an estimate of the subscore based on the examinee’s overall test performance.

10 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 10 Cognitive Diagnostic Models (CDM) Assumptions:  solving each test item requires one or more skills (Q matrix)  each examinee has a latent ability parameter corresponding to each of the skills  the probability of a score depends on the skills the item requires and the ability parameters The ability estimates are the diagnostic scores.

11 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 11 Examples of CDMs Rule Space Method (RSM; Tatsuoka, 1983, 2009): An early attempt at diagnostic scoring. Attribute Hierarchy Method (Leighton, Gierl, & Hunka, 2004): Extension of RSM The DINA and NIDA models (Junker and Sijtsma, 2001). Multiple classification latent class model (Maris, 1999). General diagnostic model (GDM; von Davier, 2008). Reparameterized unified model (RUM; Hartz, 2002; Roussos et al., 2007).

12 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 12 Examples of CDMs..Continued Bayesian Network (Almond et al., 2007). Multidimensional item response theory (De la Torre & Patz, 2005; Yao & Boughton, 2007). Multicomponent latent trait model (e.g., Embretson, 1997). The higher-order latent-trait model (de La Torre, 2005). The DINO and NIDO models. Many excellent reviews of CDM’s exist (e.g., Rupp & Templin, 2008; von Davier et al., 2008; DiBello, Roussos, & Stout, 2007).

13 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 13 Is It Possible to Report High-quality Diagnostic Scores for the Existing Educational Tests? Standards 1.12, 2.1, 5.12 etc. of Standards for Educational & Psychological Testing (1999) demand proof of adequate reliability, validity, and distinctness of diagnostic scores.

14 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 14 Classical Test Theory x=test score. Partition the score x as: x=x t + x e, E(x e )=0, Cov(x t, x e )=0, V(x e )=σ e 2, V(x t )= σ t 2, Reliability = Correlation between scores on a test and a parallel form of the test = ρ 2 (x,x t ). Validity measures the extent to which a test is doing the job it is supposed to do (=correlation between x and a criterion score y).

15 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 15 Is It Possible to Report High-quality Diagnostic Scores?...Cont’d Diagnostic scores in educational tests most often  have few items, but cover broad domains— low reliability  are highly correlated  are outcomes of retrofitting

16 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 16 Is It Possible to Report High-quality Diagnostic Scores?...Cont’d Luecht, Gierl, Tan & Huff (2006): “Inherently unidimensional item and test information cannot be decomposed to produce useful multidimensional score profiles—no matter how well intentioned or which psychometric model is used to extract the information. Our obvious recommendation is not to try to extract something that is not there.”

17 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 17 An Empirical Check of Reliability of Diagnostic Scores

18 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 18 An Empirical Check of Reliability of Diagnostic Scores….continued Of the 6,035 examinees who scored 4 (1 st quartile) or lower on science on Form A, 49 percent scored higher than 1 st quartile on science on Form B. Of the 383 examinees who scored 8 (3 rd quartile) on Math and 4 on science on Form A, 32 percent had science score higher than or equal to their Math score on Form B. r Science A, Science B =0.48. r Science A,Total B =0.63.

19 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 19 A Method Based on Classical Test Theory (Haberman, 2008) Compute the PRMSE of the subscore (=reliability) the total score A subscore has added value over the total score only if the PRMSE of the subscore is larger than the PRMSE of the total score. A subscore has added value A subscore can be predicted better by the corresponding subscore on a parallel form than by the total score on the parallel form.

20 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 20 A Method Based on Classical Test Theory…Continued Subscore s=s t +s e ; Total score x=x t + x e PRMSE for the subscore = ρ 2 (s,s t )= Subscore reliability. PRMSE for the total score = ρ 2 (x,s t ) = ρ 2 (x,x t ) ρ 2 (x t,s t ).

21 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 21 A Method Based on Classical Test Theory…Continued Can report a weighted average of the subscore and the total score (0.4xReading+0.2xTotal) if its PRMSE is large enough. Special case of the augmented subscore (Wainer et al., 2001). The computations need only simple summary statistics.

22 What About Validity? Reliability is an aspect of construct validity. Recent work of Haberman (2008): A subscore that is not distinct or not reliable has limited value w.r.t. validity. Thus, the method also examines whether the subscores have adequate validity (though additional validity studies are recommended).

23 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 23 An Example: GRE Subject Biology Sub- score The PRMSE’s for the subscore, total score, and weighted average PRMSE sub (=reliability) PRMSE total PRMSE wtd Cell. & Molec..89.78.91 Organis- mal.85.89.91 Ecol. & Evol..87.79.89

24 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 24 Results from a Survey of Operational Data (Sinharay, 2009) Test Name # Sub scores Average Length Average reliability Average Corr-dis # subscores added value # Wtd av. with added value Old SAT-V 3260.790.9501 Sch. Std. Prg: Eng 4150.700.9800 DSTP (8 th gr. M) 4190.771.0000 Teachers of math. 3160.620.9500 Old SAT 2690.920.7622 Praxis Series™ 4250.720.7824 SweSAT 5240.780.6945

25 Percent of subscores with added value for different subscore length and average disattenuated correlation

26 Percent of subscores with added value for different average subscore reliability and average disattenuated correlation

27 Percent of weighted averages with added value

28 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 28 Main Findings from the Survey of Operational Data More than 50% of the tests had no subscore with added value. Weighted averages had added value more often than subscores. The subscores that had added value were based on a sufficient number of items (20+) sufficiently distinct from each other (disattenuated correlation less than 0.9)

29 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 29 Reporting of Aggregate-level Subscores To determine if aggregate-level subscores have added value, use an approach based on PRMSEs similar to that used for individual-level subscores. The computation of PRMSEs is a bit different and is based on between and within- aggregation sum of squares.

30 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 30 A Method Based on Classical Test Theory…Continued s=s A +s e ; x=x A + x e PRMSE for the aggregate-level subscore = ρ 2 (s av,s A ) = σ 2 (s A )/[σ 2 (s A )+ σ 2 (s e )/n] PRMSE for the aggregate-level total score = ρ 2 (x av,s A ) = ρ 2 (x av,x A ) ρ 2 (x A,s A )

31 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 31 Reporting of Subscores Based on MIRT Models Fit, using a stabilized Newton-Raphson method (Haberman, von Davier, & Lee, 2008), a MIRT model with item response function where θ i corresponds to subscore i

32 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 32 Reporting of Subscores Based on MIRT…Continued The diagnostic scores are the posterior means of the ability parameters. Calculate the proportional reduction in mean squared error (PRMSE MIRT ) Compare PRMSE MIRT to PRMSE wtd to examine if MIRT does better than CTT.

33 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 33 Results for Aggregate-level Subscores and MIRT-based Subscores Aggregate-level subscores, just like individual-level subscores, rarely have added value. The PRMSE MIRT is very close to PRMSE wtd for the several tests we looked at.

34 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 34 Conclusions and Recommendations Most of the existing diagnostic scores on educational tests lack quality. Evidence of adequate reliability, validity, and distinctness of the diagnostic scores should be provided. If a CDM is used, it should be demonstrated that the model parameters can be reliably estimated in a timely manner and the model fits the data better than a simpler model.

35 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 35 Conclusions and Recommendations To report meaningful diagnostic scores for some tests, changing the structure by using assessment engineering practices (Luecht et al., 2006) may be necessary. Alternatives: Scale anchoring (Beaton & Allen, 1992) and item mapping (Zwick et al., 2001).

36 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 36 References for the Haberman method Haberman (2008). Journal of Educational and Behavioral Statistics. Sinharay, Haberman, & Puhan (2007). Educational Measurement: Issues and Practice. Sinharay & Haberman (2008). Measurement. Haberman, Sinharay, & Puhan (2009). British Journal of Math. & Stat. Psychology.

37 Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 37 References for the Haberman method Puhan, Sinharay, Haberman, & Larkin (in press). Applied Measurement in Education. Sinharay (2009). ETS RR. Haberman & Sinharay (2009). ETS RR. Sinharay, Puhan, & Haberman (2009). Invited presentation at the annual meeting of NCME.


Download ppt "A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009."

Similar presentations


Ads by Google