Measuring College Value-Added: A Delicate Instrument

Slides:



Advertisements
Similar presentations
Standardized Scales.
Advertisements

Sources of bias in experiments and quasi-experiments sean f. reardon stanford university 11 december, 2006.
Inadequate Designs and Design Criteria
Experimental Research Designs
MEASURING COLLEGE VALUE-ADDED: A DELICATE INSTRUMENT Richard J. Shavelson SK Partners & Stanford University AERA Ben Domingue University of Colorado Boulder.
Statistical Decision Making
Correlation AND EXPERIMENTAL DESIGN
Jeff Beard Lisa Helma David Parrish Start Presentation.
Reliability, Validity, Trustworthiness If a research says it must be right, then it must be right,… right??
Review: What influences confidence intervals?
Sampling and Experimental Control Goals of clinical research is to make generalizations beyond the individual studied to others with similar conditions.
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
Chapter 7 Correlational Research Gay, Mills, and Airasian
EVAL 6970: Experimental and Quasi- Experimental Designs Dr. Chris L. S. Coryn Dr. Anne Cullen Spring 2012.
Chapter 9 Experimental Research Gay, Mills, and Airasian
Analysis of Clustered and Longitudinal Data
Experimental Design The Gold Standard?.
Direct vs Indirect Assessment of Student Learning: An Introduction Dr. Sheila Handy, Chair Business Management and Co-Chair University Assessment Committee.
Experimental Research
EVAL 6970: Cost Analysis for Evaluation Dr. Chris L. S. Coryn Nick Saxton Fall 2014.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
T tests comparing two means t tests comparing two means.
Cross-Cultural Research Methods. Methodological concerns with Cross-cultural comparisons  Equivalence  Response Bias  Interpreting and Analyzing Data.
Causal Inference and Adequate Yearly Progress Derek Briggs University of Colorado at Boulder National Center for Research on Evaluation, Standards, and.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Assessment Update Report to the University Senate 3 October 2006.
EDU 8603 Day 6. What do the following numbers mean?
Issues in Assessment Design, Vertical Alignment, and Data Management : Working with Growth Models Pete Goldschmidt UCLA Graduate School of Education &
Assumptions of value-added models for estimating school effects sean f reardon stephen w raudenbush april, 2008.
Propensity Score Matching for Causal Inference: Possibilities, Limitations, and an Example sean f. reardon MAPSS colloquium March 6, 2007.
“Value added” measures of teacher quality: use and policy validity Sean P. Corcoran New York University NYU Abu Dhabi Conference January 22, 2009.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.
Reasoning in Psychology Using Statistics Psychology
The Power of Comparison in Learning & Instruction Learning Outcomes Supported by Different Types of Comparisons Dr. Jon R. Star, Harvard University Dr.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Chapter 6 - Standardized Measurement and Assessment
Causal inferences This week we have been discussing ways to make inferences about the causal relationships between variables. One of the strongest ways.
1 Collecting and Interpreting Quantitative Data Deborah K. van Alphen and Robert W. Lingard California State University, Northridge.
Research Questions  What is the nature of the distribution of assignment quality dimensions of rigor, knowledge construction, and relevance in Math and.
Understanding Populations & Samples
Understanding Populations & Samples
8 Experimental Research Design.
Reliability and Validity
GS/PPAL Research Methods and Information Systems
Daniel Muijs Saad Chahine
Direct vs Indirect Assessment of Student Learning: An Introduction
Vertical Scaling in Value-Added Models for Student Learning
EXPERIMENTAL RESEARCH
Reasoning in Psychology Using Statistics
Director of Policy Analysis and Research
METHODOLOGY AND MEASUREMENT ASPECTS
Educational Analytics
Teaching and Educational Psychology
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
Chapter Eight: Quantitative Methods
PHLS 8334 Class 2 (Spring 2017).
Reliability and Validity of Measurement
Making Causal Inferences and Ruling out Rival Explanations
Analyzing Reliability and Validity in Outcomes Assessment Part 1
Chapter 6 Research Validity.
Review: What influences confidence intervals?
Cross Sectional Designs
Scientific Method Steps
Experimental Research
Reasoning in Psychology Using Statistics
Experimental Research
Analyzing Reliability and Validity in Outcomes Assessment
Reasoning in Psychology Using Statistics
Collecting and Interpreting Quantitative Data
Presentation transcript:

Measuring College Value-Added: A Delicate Instrument Richard J. Shavelson Ben Domingue SK Partners & University of Colorado Stanford University Boulder AERA 2014

Motivation To Measure Value Added Increasing costs, stop-outs/dropouts, student and institutional diversity, and internationalization of higher education lead to questions of quality Nationally (U.S.)—best reflected in Spellings Commission report and the Voluntary System of Accountability’s response to increase transparency and measure value added to learning Internationally (OECD)—Assessment of Higher Education Learning Outcomes (AHELO) and its desire to, at some point if continued, measure value added internationally

Reluctance To Measure Value Added “We don’t really know how to measure outcomes”—Stanford President Emeritus, Gerhard Casper (2014) Multiple conceptual and statistical issues involved in measuring value added in higher education Problems of measuring learning outcomes and value added exacerbated in international comparisons (language, institutional variation, outcomes sought, etc.)

Increasing Global Focus On Higher Education 4 How does education quality vary across colleges and their academic programs? How do learning outcomes vary across student sub-populations? Is education quality related to cost? student attrition? AHELO-VAM Working Group (2013)

Purpose Of Talk 5 Identify conceptual issues associated with measuring value added in higher education Identify statistical modeling decisions involved in measuring value added Provide empirical evidence of these issues using data from Colombia’s: Mandatory college leaving exams and AHELO generic skills assessment

Value Added Defined Value added refers to a statistical estimate (“measure”) of the addition that colleges “add” to students’ learning once prior existing differences among students in different institutions have been accounted for

Some Key Assumptions Underlying Value-Added Measurement Manipulability: Students could theoretically be exposed to any treatment (i.e., go to any college). No interference between units: A student’s outcome depends only upon his or her assignment to a given treatment (e.g., no peer effects). The metric assumption: Test score outcomes are on an interval scale. Homogeneity: The causal effect does not vary as a function of a student characteristic. Strongly ignorable treatment: Assignment to treatment is essentially random after conditioning on control variables. Functional form: The functional form (typically linear) used to control for student characteristics is the correct one. Value-added measures attempt to provide causal estimates of the effect of colleges on student learning; they fall short Assumptions for drawing causal inferences from observational data are well known (e.g., Holland, 1986; Reardon & Raudenbush, 2009)

Some Key Decisions Underlying Value-Added Measurement What is the treatment & compared to what? If college A is the treatment what is the control or comparison? What is the duration of treatment (e.g., 3, 4, 5, 6, + years?) What treatment are we interested in? Teaching-learning without adjusting for context effects? Teaching-learning with peer context? What is the unit of comparison? Institution or college or major (assume same treatment for all)? Practical tradeoff between treatment-definition precision and adequate sample size for estimation Students change majors/colleges—what treatment are effects being attributed to?

Some Key Decisions Underlying Value Added Measurement (Cont’d.) 9 What should be measured as outcomes? Generic skills (e.g., critical thinking, problem solving) generally or in a major? Subject-specific knowledge and problem solving? How should it be measured? Selected response (multiple choice) Constructed response (argumentative essay with justification) Etc. How valid are measures when translated for cross-national assessment? What covariates should be used to make adjustment to account for selection bias? Single covariate—parallel pretest scores with outcome scores Multiple covariates: Cognitive, affective, biographical (e.g., SES) Institutional Context Effects: average pretest score, average SES How to deal with student (ability and other) “sorting”? Choice of college to attend “not random!”

Does All This Worrying Matter: Colombia Data! 10 Yes! Data (>64,000 students, 168 IHEs and 19 Reference Groups such as engineering, law and education) from Colombia’s unique college assessment system All high school seniors take college entrance exam: SABER 11—language, math, chemistry, and social sciences) All college graduates take exit exam: SABER PRO— quantitative reasoning (QR), critical reading (CR), writing, and English plus subject-specific exams Focus on generic skills of QR and CR

Value-added Models Estimated 2-level hierarchical mixed effects model 1. Student within reference group 2. Reference group Covariates: Individual level SABER 11 vector of 4 scores due to reliability issues SES (INES) Reference Group level Mean SABER 11 or Mean INSE Model 1: No context effect—i.e., no mean SABER 11 or INSE Model 2: Context with mean INSE Model 3: Context with mean SABER 11

Results Bearing On Assumptions & Decisions Sorting or manipulability assumption (ICCs for models that include only a random intercept at the grouping shown) Context effects (Fig. A—32 RGs with adequate Ns) Strong Ignorable Treatment Assignment assumption (Figs. B—SABER 11 and C—SABER PRO) Effects vary by model (ICCs in Fig. D)

VA Measures—Delicate Instruments! Impact on Engineering Schools Black dot: “High Quality Intake” School Gray dot: “Average Quality Intake” School

Generalizations Of Findings SABER PRO Subject Exams in Law and Education VA estimates not sensitive to variation in Generic v. subject- specific outcome measured Greater college differences (ICCs) with subject-specific outcomes than with generic outcomes AHELO Generic Skills Assessment VA estimates with AHELO equivalent to those found with SABER PRO tests Smaller college differences (ICCs) on AHELO generic skills outcomes than on SABER PRO outcomes

Thank You!