Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments IES Research Conference June 28 th, 2010 Marie-Andrée Somers (Presenter)

Slides:



Advertisements
Similar presentations
Performance Assessment
Advertisements

Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
Experimental Research Designs
Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han.
Power Considerations for Educational Studies with Restricted Samples that Use State Tests as Pretest and Outcome Measures June 2010 Presentation at the.
Threats to Conclusion Validity. Low statistical power Low statistical power Violated assumptions of statistical tests Violated assumptions of statistical.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Using Hierarchical Growth Models to Monitor School Performance: The effects of the model, metric and time on the validity of inferences THE 34TH ANNUAL.
Agenda: Block Watch: Random Assignment, Outcomes, and indicators Issues in Impact and Random Assignment: Youth Transition Demonstration –Who is randomized?
Sample Size Determination
Impact Evaluation Session VII Sampling and Power Jishnu Das November 2006.
EVAL 6970: Experimental and Quasi- Experimental Designs Dr. Chris L. S. Coryn Dr. Anne Cullen Spring 2012.
Validity Lecture Overview Overview of the concept Different types of validity Threats to validity and strategies for handling them Examples of validity.
8/20/2015Slide 1 SOLVING THE PROBLEM The two-sample t-test compare the means for two groups on a single variable. the The paired t-test compares the means.
Are the results valid? Was the validity of the included studies appraised?
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Overview of MSP Evaluation Rubric Gary Silverstein, Westat MSP Regional Conference San Francisco, February 13-15, 2008.
Assessment Group for Provincial Assessments, June Kadriye Ercikan University of British Columbia.
Moving from Development to Efficacy & Intervention Fidelity Topics National Center for Special Education Research Grantee Meeting: June 28, 2010.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
Standardization and Test Development Nisrin Alqatarneh MSc. Occupational therapy.
Program Evaluation. Program evaluation Methodological techniques of the social sciences social policy public welfare administration.
The Effect of Computers on Student Writing: A Meta-Analysis of Studies from 1992 to 2002 Amie Goldberg, Michael Russell, & Abigail Cook Technology and.
Chapter 8 Introduction to Hypothesis Testing
What Was Learned from a Second Year of Implementation IES Research Conference Washington, DC June 8, 2009 William Corrin, Senior Research Associate MDRC.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Stephen C. Court Educational Research and Evaluation, LLC A Presentation at the First International Conference on Instructional Sensitivity Achievement.
EVALUATION APPROACHES Heather Aquilina 24 March 2015.
Conceptualizing Intervention Fidelity: Implications for Measurement, Design, and Analysis Implementation Research Methods Meeting September 20-21, 2010.
CCSSO Criteria for High-Quality Assessments Technical Issues and Practical Application of Assessment Quality Criteria.
Assisting GPRA Report for MSP Xiaodong Zhang, Westat MSP Regional Conference Miami, January 7-9, 2008.
EBC course 10 April 2003 Critical Appraisal of the Clinical Literature: The Big Picture Cynthia R. Long, PhD Associate Professor Palmer Center for Chiropractic.
IMPRINT Developer’s Workshop December 6-7, 2005 Meta-analytic Reviews of the Effects of Temperature and Vibration on Performance J.L. Szalma & G. Conway.
Issues in Assessment Design, Vertical Alignment, and Data Management : Working with Growth Models Pete Goldschmidt UCLA Graduate School of Education &
MSRP Year 1 (Preliminary) Impact Research for Better Schools RMC Corporation.
Classifying Designs of MSP Evaluations Lessons Learned and Recommendations Barbara E. Lovitts June 11, 2008.
Enhancing the Technical Quality of the North Carolina Testing Program: An Overview of Current Research Studies Nadine McBride, NCDPI Melinda Taylor, NCDPI.
Evaluating Impacts of MSP Grants Hilary Rhodes, PhD Ellen Bobronnikov February 22, 2010 Common Issues and Recommendations.
META-ANALYSIS, RESEARCH SYNTHESES AND SYSTEMATIC REVIEWS © LOUIS COHEN, LAWRENCE MANION & KEITH MORRISON.
Evaluating Impacts of MSP Grants Ellen Bobronnikov Hilary Rhodes January 11, 2010 Common Issues and Recommendations.
School-level Correlates of Achievement: Linking NAEP, State Assessments, and SASS NAEP State Analysis Project Sami Kitmitto CCSSO National Conference on.
Describing the Academic Content of International Mathematics and Science Assessments & Their Relationship To State Academic Content Standards & Assessments.
Evaluating Impacts of MSP Grants Ellen Bobronnikov January 6, 2009 Common Issues and Potential Solutions.
Chapter 10 Copyright © Allyn & Bacon 2008 This multimedia product and its contents are protected under copyright law. The following are prohibited by law:
1 Lecture 10: Meta-analysis of intervention studies Introduction to meta-analysis Selection of studies Abstraction of information Quality scores Methods.
Design of Clinical Research Studies ASAP Session by: Robert McCarter, ScD Dir. Biostatistics and Informatics, CNMC
Developing an evaluation of professional development Webinar #2: Going deeper into planning the design 1.
Handout Six: Sample Size, Effect Size, Power, and Assumptions of ANOVA EPSE 592 Experimental Designs and Analysis in Educational Research Instructor: Dr.
PARCC Field Test Study Comparability of High School Mathematics End-of- Course Assessments National Conference on Student Assessment San Diego June 2015.
Effectiveness of Selected Supplemental Reading Comprehension Interventions: Impacts on a First Cohort of Fifth-Grade Students June 8, 2009 IES Annual Research.
Jamal Abedi, UCLA/CRESST Major psychometric issues Research design issues How to address these issues Universal Design for Assessment: Theoretical Foundation.
1 Teacher Evaluation Institute July 23, 2013 Roanoke Virginia Department of Education Division of Teacher Education and Licensure.
Research And Evaluation Differences Between Research and Evaluation  Research and evaluation are closely related but differ in four ways: –The purpose.
IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.
Selecting the Best Measure for Your Study
How many study subjects are required ? (Estimation of Sample size) By Dr.Shaik Shaffi Ahamed Associate Professor Dept. of Family & Community Medicine.
Shudong Wang NWEA Liru Zhang Delaware Department of Education
Analyzing Reliability and Validity in Outcomes Assessment Part 1
Resource 1. Evaluation Planning Template
Gerald Dyer, Jr., MPH October 20, 2016
Student Mobility and Achievement Growth In State Assessment Mohamed Dirir Connecticut Department of Education Paper presented at National Conference.
Evaluating Impacts: An Overview of Quantitative Methods
EAST GRADE course 2019 Introduction to Meta-Analysis
Understanding Statistical Inferences
CCSSO National Conference on Student Assessment June 21, 2010
AACC Mini Conference June 8-9, 2011
Sample Sizes for IE Power Calculations.
Meta-analysis, systematic reviews and research syntheses
Presentation transcript:

Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments IES Research Conference June 28 th, 2010 Marie-Andrée Somers (Presenter) Pei Zhu Edmond Wong MDRC An Empirical Assessment Based on Four Recent Evaluations

Two key concerns with using state tests in an evaluation…  They may not be suitable for the evaluation Validity concerns: They may not be aligned with outcomes of interest (do not provide a valid inference about program impacts) Reliability concerns: They may be too difficult for low-performing students (unreliable)  Variation in scale/content of state tests also complicates the task of combining impact findings across states and grades 2

About This Study  Funded by Institute of Education Sciences (IES)  Purpose is to “bring data to bear” on several topics covered in May et al. discussion paper: Are state tests suitable for evaluation purposes?  As a measure of the outcome(s) of interest?  As a measure of student achievement at baseline? How should impacts on state tests be pooled?  Are impact findings sensitive to methods of rescaling and aggregating test scores across states and/or grades?

Overview of Analytical Approach  We identified 4 large-scale randomized experiments where achievement was measured using both (i) state tests AND (ii) a study test The study test provides a benchmark for gauging the suitability of state tests  Two types of analyses: Impact analyses: We compared estimated impacts on state tests and on the « benchmark » study test Descriptive analyses: We also examined published information on the characteristics/content of tests

Data and Samples  Studies represent diversity with respect to grade levels and outcomes  Analysis sample includes students with a state test score and a study test score Study AStudy BStudy CStudy D Targeted Outcome General Reading Achievement General Math Achievement Specific Reading Outcome Specific Math Outcome LevelElementary High SchoolMiddle School Sample for Analysis 1,032 (9 states) 944 (7 states) 1,065 (4 states) 4,387 (9 states)

Approach for Estimating Impacts  Impact on state tests: Rescaling: Scores are z-scored by state and grade using the sample mean and standard deviation Pooling approach: Impacts by state and grade are aggregated using precision weighting  Impact on the study test: Rescaled/pooled using the same approach for comparability

 Two dimensions of suitability Validity:  Whether the content of state tests is aligned with the outcomes of interest in the evaluation Reliability:  Whether state tests provide a reliable measure of achievement for the target population (in this case, low-performing students)  A key concern: State tests have low reliability and do not yield valid inferences about program effectiveness Criteria for Assessing “Suitability”

 Implications for the impact findings: Poor Validity:  Could fail to detect impacts on the outcome of interest (invalid inference about program effectiveness) Affects the magnitude of the estimated impact on state tests Low Reliability:  Student achievement is estimated with greater error Affects the standard error of the estimated impact on state tests

Criteria for Assessing “Suitability”  Reliability: Compare the standard error of the estimated impact on state tests vs. the study test Smaller standard error is better (more precision)  Validity: Compare the magnitude of the impact estimates, in light of estimation error… Compare the statistical significance of the impact findings (i.e., conclusions about program effectiveness based on p-value) If both estimates are statistically significant, then also compare their magnitudes

Criteria for Assessing Validity  The extent to which the magnitude of the impact estimates are expected to differ depends on the outcome that state tests are intended to measure  Two types of intervention: Targeted outcome is general achievement (Studies A and B)  The outcome of interest is “general achievement” in math or reading  Both state tests and the study test measure the targeted outcome (general achievement) If state tests are valid, then the impact on the study test and state tests should be similar

Criteria for Assessing Validity  Two types of intervention (ctd.) Targeted outcome is a specific skill (Studies C and D)  There are two outcomes of interest: Targeted skill (short-term) and General achievement (longer term)  Study test is used to measure the short-term outcome (specific skill), while state tests are used to measure the longer-term outcome (general achievement) If state tests are valid, then the impact on state tests should be smaller than the impact on the study test

Benchmark: Impact on the Study Test

P-Value & Magnitude (Validity) Targeted Outcome is General Achievement p = p = 0.119

P-Value & Magnitude (Validity) Targeted Outcome is General Achievement p = p = p = p = 0.189

P-Value & Magnitude (Validity) Targeted Outcome is a Specific Skill p = p = 0.002

P-Value & Magnitude (Validity) Targeted Outcome is a Specific Skill p = p = 0.002p = 0.007

P-Value & Magnitude (Validity) Targeted Outcome is a Specific Skill p = p = 0.002p = p = 0.219

Standard Errors (Reliability)

State-Study Ratio:

Conclusion  Findings suggest that state tests can be used as a complement to a study-administered test State tests are suitable (valid and reliable) in 3 of 4 studies Whether state tests can be used as a substitute for a study test is an open question  Limited availability in some grades and subjects Available for all states/grades in only 1 of 4 studies  May not be able to use them to measure a specific targeted skill  Possibly less reliable  Findings from descriptive analysis lead to the same conclusions as the impact analysis…

Questions?  Marie-Andrée Somers  Pei Zhu  Edmond Wong