A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009.

Slides:

Advertisements

Similar presentations

Standardized Scales.

Advertisements

DRK-12 Diagnostic Assessment Panel, Prof. André A. Rupp, Dec 3, Opportunities and Challenges for Developing and Evaluating Diagnostic Assessments.

 Degree to which inferences made using data are justified or supported by evidence  Some types of validity ◦ Criterion-related ◦ Content ◦ Construct.

Psychometric Aspects of Linking Tests to the CEF Norman Verhelst National Institute for Educational Measurement (Cito) Arnhem – The Netherlands.

Issues of Reliability, Validity and Item Analysis in Classroom Assessment by Professor Stafford A. Griffith Jamaica Teachers Association Education Conference.

Cal State Northridge Psy 427 Andrew Ainsworth PhD

Reliability and Validity

VALIDITY AND RELIABILITY

 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.

Part II Sigma Freud & Descriptive Statistics

MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT

Skills Diagnosis with Latent Variable Models. Topic 1: A New Diagnostic Paradigm.

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.

Chapter Fifteen Understanding and Using Standardized Tests.

Chapter 4 Validity.

Test Validity: What it is, and why we care.

Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.

When Measurement Models and Factor Models Conflict: Maximizing Internal Consistency James M. Graham, Ph.D. Western Washington University ABSTRACT: The.

CSCE 582: Bayesian Networks Paper Presentation conducted by Nick Stiffler Ben Fine.

Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.

Class 3: Thursday, Sept. 16 Reliability and Validity of Measurements Introduction to Regression Analysis Simple Linear Regression (2.3)

Class 2: Tues., Sept. 14th Correlation (2.2) Introduction to Measurement Theory: –Reliability of measurements and correlation –Example that demonstrates.

Incomplete Block Designs

Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.

Assessing Achievement and Aptitude

Chapter 7 Evaluating What a Test Really Measures

Test Validity S-005. Validity of measurement Reliability refers to consistency –Are we getting something stable over time? –Internally consistent? Validity.

Multivariate Methods EPSY 5245 Michael C. Rodriguez.

Measurement and Data Quality

Ch 6 Validity of Instrument

Instrumentation.

Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.

Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.

Diagnostics Mathematics Assessments: Main Ideas  Now typically assess the knowledge and skill on the subsets of the 10 standards specified by the National.

CRESST ONR/NETC Meetings, July 2003, v1 ONR Advanced Distributed Learning Impact of Language Factors on the Reliability and Validity of Assessment.

6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)

Measurement Validity.

Learning Objective Chapter 9 The Concept of Measurement and Attitude Scales Copyright © 2000 South-Western College Publishing Co. CHAPTER nine The Concept.

Measurement and Questionnaire Design. Operationalizing From concepts to constructs to variables to measurable variables A measurable variable has been.

Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.

1 EPSY 546: LECTURE 1 SUMMARY George Karabatsos. 2 REVIEW.

Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.

Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.

Validity and Item Analysis Chapter 4. Validity Concerns what the instrument measures and how well it does that task Not something an instrument has or.

Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”

SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.

CJT 765: Structural Equation Modeling Class 8: Confirmatory Factory Analysis.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

Chapter 11 Intelligence. Mental quality consisting of the ability to learn from experience, solve problems, and use knowledge to adapt to new situations.

RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.

C R E S S T / U C L A Psychometric Issues in the Assessment of English Language Learners Presented at the: CRESST 2002 Annual Conference Research Goes.

2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)

Applied Quantitative Analysis and Practices LECTURE#17 By Dr. Osman Sadiq Paracha.

Reliability EDUC 307. Reliability  How consistent is our measurement?  the reliability of assessments tells the consistency of observations.  Two or.

Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.

Lesson 3 Measurement and Scaling. Case: “What is performance?” brandesign.co.za.

Measurement Chapter 6. Measuring Variables Measurement Classifying units of analysis by categories to represent variable concepts.

Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.

® The Use of Confirmatory Factor Analysis to Support the Use of Subscores for a Large-Scale Student Outcomes Assessment Rochelle S. Michel, Ph.D. Evaluation.

VALIDITY by Barli Tambunan/

Evaluation of measuring tools: validity

Journalism 614: Reliability and Validity

EPSY 5245 EPSY 5245 Michael C. Rodriguez

Understanding and Using Standardized Tests

How can one measure intelligence?

Unit 11: Testing and Individual Differences

Cal State Northridge Psy 427 Andrew Ainsworth PhD

Chapter 8 VALIDITY AND RELIABILITY

Presentation transcript:

A Critical Evaluation of Diagnostic Score Reporting: Some Theory and Applications Sandip Sinharay, Gautam Puhan, and Shelby J. Haberman Copyright 2009 by Educational Testing Service. All rights reserved. No reproduction in whole or in part is permitted without the express written permission of the copyright owner Paper presented at the Statistical and Applied Mathematical Sciences Institute, 9 July, 2009.

Few wish to assess others, fewer still wish to be assessed, but everyone wants to see the scores. Paul W. Holland (2001)

Few wish to assess others, fewer still wish to be assessed, but everyone wants to see the diagnostic scores.

Outline Examples of diagnostic score reports. Approaches to report diagnostic scores. Problems with existing diagnostic scores in education. A method to evaluate if diagnostic scores have added value. Applications of the method to operational test data. Conclusions and recommendations.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 5 What Are Diagnostic Scores? Diagnostic scores refer to scores on any meaningful cluster of items (subtests). Typically, they refer to scores on content areas. For example, on a test for prospective teachers of children, diagnostic scores are scores on the content areas Reading, Science, Social Studies, and Mathematics.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 7

9 Subscores, Augmented Subscores, and Objective Performance Index Subscores: Raw/percent scores on the subtests. Augmented subscore (Wainer et al., 2001): A weighted average of the subscore of interest (e.g., reading) and the other subscores (e.g., science, social studies, and mathematics). Objective Performance Index (Yen, 1987): A weighted average of (i) the observed subscore, and (ii) an estimate of the subscore based on the examinee’s overall test performance.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 10 Cognitive Diagnostic Models (CDM) Assumptions:  solving each test item requires one or more skills (Q matrix)  each examinee has a latent ability parameter corresponding to each of the skills  the probability of a score depends on the skills the item requires and the ability parameters The ability estimates are the diagnostic scores.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 11 Examples of CDMs Rule Space Method (RSM; Tatsuoka, 1983, 2009): An early attempt at diagnostic scoring. Attribute Hierarchy Method (Leighton, Gierl, & Hunka, 2004): Extension of RSM The DINA and NIDA models (Junker and Sijtsma, 2001). Multiple classification latent class model (Maris, 1999). General diagnostic model (GDM; von Davier, 2008). Reparameterized unified model (RUM; Hartz, 2002; Roussos et al., 2007).

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 12 Examples of CDMs..Continued Bayesian Network (Almond et al., 2007). Multidimensional item response theory (De la Torre & Patz, 2005; Yao & Boughton, 2007). Multicomponent latent trait model (e.g., Embretson, 1997). The higher-order latent-trait model (de La Torre, 2005). The DINO and NIDO models. Many excellent reviews of CDM’s exist (e.g., Rupp & Templin, 2008; von Davier et al., 2008; DiBello, Roussos, & Stout, 2007).

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 13 Is It Possible to Report High-quality Diagnostic Scores for the Existing Educational Tests? Standards 1.12, 2.1, 5.12 etc. of Standards for Educational & Psychological Testing (1999) demand proof of adequate reliability, validity, and distinctness of diagnostic scores.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 14 Classical Test Theory x=test score. Partition the score x as: x=x t + x e, E(x e )=0, Cov(x t, x e )=0, V(x e )=σ e 2, V(x t )= σ t 2, Reliability = Correlation between scores on a test and a parallel form of the test = ρ 2 (x,x t ). Validity measures the extent to which a test is doing the job it is supposed to do (=correlation between x and a criterion score y).

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 15 Is It Possible to Report High-quality Diagnostic Scores?...Cont’d Diagnostic scores in educational tests most often  have few items, but cover broad domains— low reliability  are highly correlated  are outcomes of retrofitting

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 16 Is It Possible to Report High-quality Diagnostic Scores?...Cont’d Luecht, Gierl, Tan & Huff (2006): “Inherently unidimensional item and test information cannot be decomposed to produce useful multidimensional score profiles—no matter how well intentioned or which psychometric model is used to extract the information. Our obvious recommendation is not to try to extract something that is not there.”

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 17 An Empirical Check of Reliability of Diagnostic Scores

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 18 An Empirical Check of Reliability of Diagnostic Scores….continued Of the 6,035 examinees who scored 4 (1 st quartile) or lower on science on Form A, 49 percent scored higher than 1 st quartile on science on Form B. Of the 383 examinees who scored 8 (3 rd quartile) on Math and 4 on science on Form A, 32 percent had science score higher than or equal to their Math score on Form B. r Science A, Science B =0.48. r Science A,Total B =0.63.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 19 A Method Based on Classical Test Theory (Haberman, 2008) Compute the PRMSE of the subscore (=reliability) the total score A subscore has added value over the total score only if the PRMSE of the subscore is larger than the PRMSE of the total score. A subscore has added value A subscore can be predicted better by the corresponding subscore on a parallel form than by the total score on the parallel form.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 20 A Method Based on Classical Test Theory…Continued Subscore s=s t +s e ; Total score x=x t + x e PRMSE for the subscore = ρ 2 (s,s t )= Subscore reliability. PRMSE for the total score = ρ 2 (x,s t ) = ρ 2 (x,x t ) ρ 2 (x t,s t ).

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 21 A Method Based on Classical Test Theory…Continued Can report a weighted average of the subscore and the total score (0.4xReading+0.2xTotal) if its PRMSE is large enough. Special case of the augmented subscore (Wainer et al., 2001). The computations need only simple summary statistics.

What About Validity? Reliability is an aspect of construct validity. Recent work of Haberman (2008): A subscore that is not distinct or not reliable has limited value w.r.t. validity. Thus, the method also examines whether the subscores have adequate validity (though additional validity studies are recommended).

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 23 An Example: GRE Subject Biology Sub- score The PRMSE’s for the subscore, total score, and weighted average PRMSE sub (=reliability) PRMSE total PRMSE wtd Cell. & Molec Organis- mal Ecol. & Evol

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 24 Results from a Survey of Operational Data (Sinharay, 2009) Test Name # Sub scores Average Length Average reliability Average Corr-dis # subscores added value # Wtd av. with added value Old SAT-V Sch. Std. Prg: Eng DSTP (8 th gr. M) Teachers of math Old SAT Praxis Series™ SweSAT

Percent of subscores with added value for different subscore length and average disattenuated correlation

Percent of subscores with added value for different average subscore reliability and average disattenuated correlation

Percent of weighted averages with added value

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 28 Main Findings from the Survey of Operational Data More than 50% of the tests had no subscore with added value. Weighted averages had added value more often than subscores. The subscores that had added value were based on a sufficient number of items (20+) sufficiently distinct from each other (disattenuated correlation less than 0.9)

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 29 Reporting of Aggregate-level Subscores To determine if aggregate-level subscores have added value, use an approach based on PRMSEs similar to that used for individual-level subscores. The computation of PRMSEs is a bit different and is based on between and within- aggregation sum of squares.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 30 A Method Based on Classical Test Theory…Continued s=s A +s e ; x=x A + x e PRMSE for the aggregate-level subscore = ρ 2 (s av,s A ) = σ 2 (s A )/[σ 2 (s A )+ σ 2 (s e )/n] PRMSE for the aggregate-level total score = ρ 2 (x av,s A ) = ρ 2 (x av,x A ) ρ 2 (x A,s A )

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 31 Reporting of Subscores Based on MIRT Models Fit, using a stabilized Newton-Raphson method (Haberman, von Davier, & Lee, 2008), a MIRT model with item response function where θ i corresponds to subscore i

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 32 Reporting of Subscores Based on MIRT…Continued The diagnostic scores are the posterior means of the ability parameters. Calculate the proportional reduction in mean squared error (PRMSE MIRT ) Compare PRMSE MIRT to PRMSE wtd to examine if MIRT does better than CTT.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 33 Results for Aggregate-level Subscores and MIRT-based Subscores Aggregate-level subscores, just like individual-level subscores, rarely have added value. The PRMSE MIRT is very close to PRMSE wtd for the several tests we looked at.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 34 Conclusions and Recommendations Most of the existing diagnostic scores on educational tests lack quality. Evidence of adequate reliability, validity, and distinctness of the diagnostic scores should be provided. If a CDM is used, it should be demonstrated that the model parameters can be reliably estimated in a timely manner and the model fits the data better than a simpler model.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 35 Conclusions and Recommendations To report meaningful diagnostic scores for some tests, changing the structure by using assessment engineering practices (Luecht et al., 2006) may be necessary. Alternatives: Scale anchoring (Beaton & Allen, 1992) and item mapping (Zwick et al., 2001).

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 36 References for the Haberman method Haberman (2008). Journal of Educational and Behavioral Statistics. Sinharay, Haberman, & Puhan (2007). Educational Measurement: Issues and Practice. Sinharay & Haberman (2008). Measurement. Haberman, Sinharay, & Puhan (2009). British Journal of Math. & Stat. Psychology.

Confidential and Proprietary. Copyright © 2007 by Educational Testing Service. 37 References for the Haberman method Puhan, Sinharay, Haberman, & Larkin (in press). Applied Measurement in Education. Sinharay (2009). ETS RR. Haberman & Sinharay (2009). ETS RR. Sinharay, Puhan, & Haberman (2009). Invited presentation at the annual meeting of NCME.