Comparability Challenges and Solution Approaches for Emerging Assessment and Accountability Options Brian Gong Center for Assessment Presentation in the session on “Establishing Comparability of Assessment Results in an Era of Flexibility ” CCSSO National Conference on Student Assessment June 28, 2017 Austin, TX
Overview Description of context of new calls for comparability Three issues, some possible solution approaches Summary Comparability - Gong
Comparability & interpretation The key to comparability is interpretation and use – we want enough comparability to support our intended interpretations and uses Deep knowledge in measurement field about what affects comparability, what types of interpretations can be supported, what methods may be used to promote and evaluate comparability of scores and interpretations However, new desired uses/context and new types of assessments challenge us to consider what we mean by “comparable” and how to support interpretations of comparability with new methods Comparability - Gong
“Comparability” – We assume it In almost every test interpretation and use today, we assume that test scores are comparable We aggregate scores We interpret trend in performance over time We compare individuals and groups to each other We produce derivative scores that assume we can mathematically operate on multiple scores (e.g., index, growth, value-added) We make policy decisions and take practical actions upon the basis of these test score interpretations (e.g., school accountability, teacher evaluation, instructional intervention) Comparability - Gong
BUT… we are uneasy Because we also want many conditions that are not strictly the same (standardized) Different test forms for security Different test forms for efficiency (e.g., CAT) Different test forms for validity (sampling of domain) Different items, translations/formats, cognitive demand, and administration conditions for validity (accommodations, special populations) Different tests for policy and other reasons (each state; ACT/SAT; NAEP; TIMSS/PISA; AP/IB; Common Core?) Different tests across time Different tests across populations Different tests across time and populations Comparability - Gong
In addition, we want Different content/skills as grades progress Individual choice for production, application, and specialization Individualized information for diagnosis and program evaluation for individuals, subgroups, and programs Comparability - Gong
Our dilemma We want to act as though test scores were strictly comparable, but We also want a lot of conditions that prohibit making the tests and/or testing conditions the same, and in some cases we know the same items are invalid for different individuals So… How can we conceptually understand dimensions that inform our interpretations and uses? What technical tools and approaches are available to support us in making interpretations that involves “comparability of test scores”? Comparability - Gong
New options, new flexibility Multiple tests that are sort of the same purpose, but share no items and use special studies to make comparable (e.g., state high school test and college entrance exams) Multiple tests that are quite different in purpose and share no items (e.g., state test and commercial interim assessment, or other commercial assessment, e.g., OECD District-level PISA with state) Tests that may allow references from one testing program to another by sharing items (e.g., drawing on item banks with sufficient information to link to scales and/or performance levels) – openly available Comparability - Gong
Why might a state want this type of flexibility? Researchers have mapped state proficiency cuts to NAEP, and will likely continue to do so, enabling state-to-NAEP and indirectly state-to-state comparisons of proficiency State might want item-level linking because it wants: Comparisons to a scale other than NAEP Comparisons at the scale-score level Control over and detailed knowledge of the technical aspects Control over the timing, interpretation, publicity Needs trusted resources to do linking to external test because cannot develop on own Comparability - Gong
Comparability Continua Content Comparability less more Content Basis of Test Variations Same content area Same content standards Same test specs Same test items Score Comparability less more Pass/fail score/ decision Achievement level score Scale score Raw score Score Level Comparability - Gong 10
Comparability Continua – 2 Population Comparability less more Population characteristics Adjusted in inter- pretation s Adjusted character- istics Similar character- istics Same students Reporting-level Comparability less more State District School Student Level of reporting unit Comparability - Gong 11
Context of interpretation and use We can solve some of our problems by better specification of what we mean Don’t always need to create comparability at the “more” end of the continuum for content or scores Example: accountability is social valuing; may not need comparable test scores from assessment (e.g., 1%, 2%, ELP, very-low on- grade assessments) Example: claim about comparable achievement level performance at the state level Comparability - Gong
Item-bank linking: researchable task “Extreme linking” (Dorans) is commonly done, with appropriate safeguards and checks Other challenges to traditional linking— notably CAT—have been researched and acceptable solutions have led to wide use (e.g., parameter invariance over item order, test length, time of administration, etc.) Similarly, other item-bank solutions will need to specify under which conditions what types of comparability can be maintained, and show that empirically—but this is an exciting option Comparability - Gong
Summary Use the flexibility available to achieve your policy goals and intended uses Specify, specify, specify – so you know what is comparable and what is not, by intention or by constraint (where are you on the continua of content comparability, score comparability, population comparability, reporting unit comparability?) Validity and strict comparability may not go together Use the tools available to support appropriate comparability Focus on valid interpretations, as well as technical demands Empirically check your results Comparability - Gong
Questions? Comments? Thank you! Comparability - Gong
Brian Gong bgong@nciea.org Comparability - Gong