Comparability of Assessment Results in the Era of Flexibility

Comparability of Assessment Results in the Era of Flexibility
Jessica Baghian, Louisiana Department of Education Brian Gong, Center for Assessment Jeffrey Nellhaus, Parcc Inc. CCSSO National Conference on Student Assessment Austin, Texas June 28, 2017

Topics for this Session
Jeff Nellhaus, Parcc Inc. How comparable are State and NAEP standards for proficiency? What purposes are served by comparability? Jessica Baghian, Louisiana Department of Education How has Louisiana achieved comparability to other states? Why is comparability to other states important to Louisiana? Brian Gong, Center for Assessment What are the challenges for achieving comparability? What approaches can be taken to address the challenges?

How Comparable are State and NAEP Standards for Proficient Performance?
Beginning in 2005, NCES began to report states’ cut scores for proficiency in terms of their NAEP scale score equivalents For example, if a state reported 60% of its students performing Proficient or above on its grade 8 math assessment, NCES determined what the cut score for Proficient on the NAEP scale would have to be for 60% of the students in the state to perform Proficient or above on the NAEP grade 8 math test The NCES report helped answer the question: “Is State X’s standard for proficiency comparable to NAEP’s standard for proficiency?”

NAEP Scale Score Equivalent of State Cut Scores for Proficient Performance: 2009, Grade 8 Mathematics 299 300 262 229 *See slide 12 for notes and sources.

The Good News … Between 2009 and 2015, states’ cut scores for proficiency, in terms of their NAEP scale score equivalents, became Closer to the NAEP cut score for proficiency Higher on average Closer to each other While most states’ cut scores for proficiency in 2015 remained in NAEP’s Basic Level, between 2009 and 2015, the number of states with cut scores in NAEP’s Below Basic Level decreased Proficient Level increased

Change in States’ NAEP Equivalent Cut Scores for Proficiency 2009–2015
*See slide 12 for notes and sources.

NAEP Scale Score Equivalent of State Cut Scores for Proficient Performance: 2009, Grade 8 Mathematics 299 300 262 229 *See slide 12 for notes and sources.

NAEP Scale Score Equivalent of State Cut Scores for Proficient Performance: 2013, Grade 8 Mathematics *See slide 12 for notes and sources.

Some Reasons why States’ Performance Standards for Proficiency Have Become More NAEP-Like
Waivers provided by USED for setting improvement targets for school and district accountability Public reporting of states’ NAEP equivalent cut scores for proficiency exposed the variability in states’ standards New generation tests based on college- and career-ready content and performance standards The use of NAEP and other external benchmarks of college- and career- readiness in standard-setting for new generation tests

Use of NAEP Results in Standard-Setting for PARCC
The PARCC Assessment RFP in 2013 required that its standard-setting process be informed by established benchmarks for proficiency and college- and career-readiness “The offeror shall describe a set of benchmarks to inform standard setting, including the percentage of students at or above proficient on the most recent NAEP assessments the college-readiness benchmarks on ACT and SAT relevant benchmarks on international assessments the college- and career-ready benchmark on SBAC assessments”

What Purposes are Served by Comparability ?
Credibility Comparability to build & maintain public support Comparability to other states, NAEP, other tests – SAT, ACT, TIMSS Accountability Comparability for making high stakes decisions Comparability across forms – within and across years, and across paper- and computer-based test forms Trend Comparability to report change over time Comparability across forms, across years, and to former testing program Research/Best Practice Comparability for research, identify best practices NAEP plays this role, but data is needed at school and district level and more frequently Reason for Consortia – common measure

Notes and Sources Notes Sources
Figure on slide 4: In Nebraska, each district develops local assessments to report on standards. Therefore, the state was not included in the analyses. California was not included because the state does not test general mathematics. Figure on slide 6: Nebraska was not included in the 2009 analysis because it did not offer a statewide assessment to report on standards. Figure on slide 8: California and Virginia were not included because the states do not assess general mathematics in grade 8. Sources Phillips, G. (2016). National Benchmarks for State Achievement Standards. American Institutes for Research. Washington, DC U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2013 Mathematics Assessments. U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2009, 2011, 2013 Mathematics Assessments. U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2009 Mathematics Assessments. U.S. Department of Education, Office of Planning, Evaluation and Policy Development, EDFacts SY 2008–09, Washington, DC, The National Longitudinal School-Level State Assessment Score Database (NLSLSASD) 2010.

Contact Information Jeffrey Nellhaus

Louisiana’s Comprehensive Assessment System

Ensuring Comparability with Other States
Louisiana students are just as smart and capable as any in the country. However, academic results have not always afforded Louisiana’s students the opportunities of their peers in other states.

Goals of Louisiana's System
Purpose Approach High quality, fully aligned content in the shortest form possible Set an instructional vision Hold our system accountable for rigorous reading, writing, and math learning Fully assess the complete scope of standards Reduce assessment times where possible Find and use the best items to build forms Modified, shortened form Comparability Equity that ensures Louisiana students are held to the same expectations as students anywhere Credibility so that parents know results mean the same in Louisiana as in other states Maintain stable scale and cut scores Use admin, scoring, and processing rules Run an external audit Cohesion grade to grade and within a grade Grade to grade students experience the same quality of items and results to monitor growth Within a grade, provide information on progress to mastery in order to adjust instruction Provide K-high school summative assessment Provide aligned diagnostics and interims for each grade

Ensuring Comparability with Other States: Areas Studied
To ensure Louisiana’s claims of comparability were defensible, the Department conducted a third party validation by the Center for Assessment. The Center studied four core areas:

Ensuring Comparability with Other States: Claims Studied
The Center also studied three claims:

Assistant Superintendent, Louisiana Department of Education
Contact Information Jessica Baghian Assistant Superintendent, Louisiana Department of Education

Comparability Challenges and Solution Approaches for Emerging Assessment and Accountability Options
Brian Gong Center for Assessment Presentation in the session on “Establishing Comparability of Assessment Results in an Era of Flexibility ” CCSSO National Conference on Student Assessment June 28, Austin, TX

Overview Description of context of new calls for comparability
Three issues, some possible solution approaches Summary Comparability - Gong

Comparability & interpretation
The key to comparability is interpretation and use – we want enough comparability to support our intended interpretations and uses Deep knowledge in measurement field about what affects comparability, what types of interpretations can be supported, what methods may be used to promote and evaluate comparability of scores and interpretations However, new desired uses/context and new types of assessments challenge us to consider what we mean by “comparable” and how to support interpretations of comparability with new methods Comparability - Gong

“Comparability” – We assume it
In almost every test interpretation and use today, we assume that test scores are comparable We aggregate scores We interpret trend in performance over time We compare individuals and groups to each other We produce derivative scores that assume we can mathematically operate on multiple scores (e.g., index, growth, value-added) We make policy decisions and take practical actions upon the basis of these test score interpretations (e.g., school accountability, teacher evaluation, instructional intervention) Comparability - Gong

BUT… we are uneasy Because we also want many conditions that are not strictly the same (standardized) Different test forms for security Different test forms for efficiency (e.g., CAT) Different test forms for validity (sampling of domain) Different items, translations/formats, cognitive demand, and administration conditions for validity (accommodations, special populations) Different tests for policy and other reasons (each state; ACT/SAT; NAEP; TIMSS/PISA; AP/IB; Common Core?) Different tests across time Different tests across populations Different tests across time and populations Comparability - Gong

In addition, we want Different content/skills as grades progress
Individual choice for production, application, and specialization Individualized information for diagnosis and program evaluation for individuals, subgroups, and programs Comparability - Gong

Our dilemma We want to act as though test scores were strictly comparable, but We also want a lot of conditions that prohibit making the tests and/or testing conditions the same, and in some cases we know the same items are invalid for different individuals So… How can we conceptually understand dimensions that inform our interpretations and uses? What technical tools and approaches are available to support us in making interpretations that involves “comparability of test scores”? Comparability - Gong

New options, new flexibility
Multiple tests that are sort of the same purpose, but share no items and use special studies to make comparable (e.g., state high school test and college entrance exams) Multiple tests that are quite different in purpose and share no items (e.g., state test and commercial interim assessment, or other commercial assessment, e.g., OECD District-level PISA with state) Tests that may allow references from one testing program to another by sharing items (e.g., drawing on item banks with sufficient information to link to scales and/or performance levels) – openly available Comparability - Gong

Why might a state want this type of flexibility?
Researchers have mapped state proficiency cuts to NAEP, and will likely continue to do so, enabling state-to-NAEP and indirectly state-to-state comparisons of proficiency State might want item-level linking because it wants: Comparisons to a scale other than NAEP Comparisons at the scale-score level Control over and detailed knowledge of the technical aspects Control over the timing, interpretation, publicity Needs trusted resources to do linking to external test because cannot develop on own Comparability - Gong

Comparability Continua
Content Comparability less more Content Basis of Test Variations Same content area Same content standards Same test specs Same test items Score Comparability less more Pass/fail score/ decision Achievement level score Scale score Raw score Score Level Comparability - Gong 29

Comparability Continua – 2
Population Comparability less more Population characteristics Adjusted in interpretation s Adjusted characteristics Similar characteristics Same students Reporting-level Comparability less more State District School Student Level of reporting unit Comparability - Gong 30

Context of interpretation and use
We can solve some of our problems by better specification of what we mean Don’t always need to create comparability at the “more” end of the continuum for content or scores Example: accountability is social valuing; may not need comparable test scores from assessment (e.g., 1%, 2%, ELP, very-low on-grade assessments) Example: claim about comparable achievement level performance at the state level Comparability - Gong

Item-bank linking: researchable task
“Extreme linking” (Dorans) is commonly done, with appropriate safeguards and checks Other challenges to traditional linking—notably CAT— have been researched and acceptable solutions have led to wide use (e.g., parameter invariance over item order, test length, time of administration, etc.) Similarly, other item-bank solutions will need to specify under which conditions what types of comparability can be maintained, and show that empirically—but this is an exciting option Comparability - Gong

Summary Use the flexibility available to achieve your policy goals and intended uses Specify, specify, specify – so you know what is comparable and what is not, by intention or by constraint (where are you on the continua of content comparability, score comparability, population comparability, reporting unit comparability?) Validity and strict comparability may not go together Use the tools available to support appropriate comparability Focus on valid interpretations, as well as technical demands Empirically check your results Comparability - Gong

Questions? Comments? Thank you! Comparability - Gong

Brian Gong bgong@nciea.org
Comparability - Gong

Comparability of Assessment Results in the Era of Flexibility

Similar presentations

Presentation on theme: "Comparability of Assessment Results in the Era of Flexibility"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparability of Assessment Results in the Era of Flexibility

Similar presentations

Presentation on theme: "Comparability of Assessment Results in the Era of Flexibility"— Presentation transcript:

Similar presentations

About project

Feedback