Stephen C. Court Presented at

Slides:



Advertisements
Similar presentations
A District-initiated Appraisal of a State Assessments Instructional Sensitivity HOLDING ACCOUNTABILITY TESTS ACCOUNTABLE Stephen C. Court Presented in.
Advertisements

Assessing Student Performance
Performance Assessment
National Accessible Reading Assessment Projects Defining Reading Proficiency for Accessible Large Scale Assessments Principles and Issues Paper American.
The Framework for Teaching Charlotte Danielson
Testing for Tomorrow Growth Model Testing Measuring student progress over time.
Test Development.
Pre and Post Assessments A quick and easy way to assess your Student Learning Outcomes.
REGRESSION, IV, MATCHING Treatment effect Boualem RABTA Center for World Food Studies (SOW-VU) Vrije Universiteit - Amsterdam.
Module 1: Teaching functional skills – from building to applying skills 0 0.
Summative Assessment Kansas State Department of Education ASSESSMENT LITERACY PROJECT1.
Gwinnett Teacher Effectiveness System Training
1 Maine’s Impact Study of Technology in Mathematics (MISTM) David L. Silvernail, Director Maine Education Policy Research Institute University of Southern.
Mark D. Reckase Michigan State University The Evaluation of Teachers and Schools Using the Educator Response Function (ERF)
Chapter 8 and 9: Teacher- Centered and Learner-Centered Instruction EDG 4410 Ergle.
1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.
Briefing: NYU Education Policy Breakfast on Teacher Quality November 4, 2011 Dennis M. Walcott Chancellor NYC Department of Education.
The Research Consumer Evaluates Measurement Reliability and Validity
Language Arts Connecticut Mastery Test By Grace Romano.
Comparing Growth in Student Performance David Stern, UC Berkeley Career Academy Support Network Presentation to Educating for Careers/ California Partnership.
Stephen C. Court Educational Research and Evaluation, LLC A Presentation at the First International Conference on Instructional Sensitivity Achievement.
CAN INSTRUCTIONALLY INSENSITIVE ACCOUNTABILITY TESTS EVER EVALUATE EDUCATORS FAIRLY? W. James Popham University of California, Los Angeles Winter Conference.
A Terse Self-Test about Testing
1 Alignment of Alternate Assessments to Grade-level Content Standards Brian Gong National Center for the Improvement of Educational Assessment Claudia.
Potential Biases in Student Ratings as a Measure of Teaching Effectiveness Kam-Por Kwan EDU Tel: etkpkwan.
Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.
Principles of High Quality Assessment
Personality, 9e Jerry M. Burger
New Hampshire Enhanced Assessment Initiative: Technical Documentation for Alternate Assessments Alignment Inclusive Assessment Seminar Brian Gong Claudia.
Types of Evaluation.
Using MCA growth data to identify classrooms making unexpected positive growth Beating the Odds in Middle School Math – Classroom profiles found in some.
EVAL 6970: Experimental and Quasi- Experimental Designs Dr. Chris L. S. Coryn Dr. Anne Cullen Spring 2012.
What does the Research Say About... POP QUIZ!!!. The Rules You will be asked to put different educational practices in order from most effective to least.
Hypothesis Testing II The Two-Sample Case.
Inferences about School Quality using opportunity to learn data: The effect of ignoring classrooms. Felipe Martinez CRESST/UCLA CCSSO Large Scale Assessment.
Building Effective Assessments. Agenda  Brief overview of Assess2Know content development  Assessment building pre-planning  Cognitive factors  Building.
Developing teachers’ mathematics knowledge for teaching Challenges in the implementation and sustainability of a new MSP Dr. Tara Stevens Department of.
Foundations of Recruitment and Selection I: Reliability and Validity
Understanding Statistics
Evaluating the Vermont Mathematics Initiative (VMI) in a Value Added Context H. ‘Bud’ Meyers, Ph.D. College of Education and Social Services University.
Stephen C. Court Educational Research and Evaluation, LLC A Presentation at the First International Conference on Instructional Sensitivity Achievement.
Data analysis was conducted on the conceptions and misconceptions regarding hybrid learning for those faculty who taught in traditional classroom settings.
Assisting GPRA Report for MSP Xiaodong Zhang, Westat MSP Regional Conference Miami, January 7-9, 2008.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Mathematics and Science Partnerships: Summary of the FY2006 Annual Reports U.S. Department of Education.
1 Math 413 Mathematics Tasks for Cognitive Instruction October 2008.
Assessment and Testing
Assessment. Levels of Learning Bloom Argue Anderson and Krathwohl (2001)
McGraw-Hill/Irwin © 2012 The McGraw-Hill Companies, Inc. All rights reserved. Obtaining Valid and Reliable Classroom Evidence Chapter 4:
Evaluating Impacts of MSP Grants Ellen Bobronnikov January 6, 2009 Common Issues and Potential Solutions.
Chapter 10 Copyright © Allyn & Bacon 2008 This multimedia product and its contents are protected under copyright law. The following are prohibited by law:
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Writing Learning Outcomes Best Practices. Do Now What is your process for writing learning objectives? How do you come up with the information?
2 Objectives To develop understanding of Functional Skills To explore resources and strategies for building towards functionality in the context of probability.
Dr. Marciano B. Melchor University of Ha’il, KINGDOM OF SAUDI ARABIA May 2013.
What does the Research Say About . . .
Nuts and Bolts of Assessment
Issues in Evaluating Educational Research
Assessment.
EXPERIMENTAL RESEARCH
Assessment.
What does the Research Say About . . .
Evaluation of An Urban Natural Science Initiative
Analytics in Higher Education: Methods Overview
Analysis based on normal distributions
Elementary Statistics
Exploring Assessment Options NC Teaching Standard 4
Validity and Reliability II: The Basics
Understanding Statistical Inferences
Reminder for next week CUELT Conference.
Presentation transcript:

Empirical Methods to Evaluate the Instructional Sensitivity of Accountability Tests Stephen C. Court Presented at Association of Educational Assessment - Europe 10th Annual Conference Innovation in Assessment to meet changing needs 5 - 7 November 2009 Valletta, Malta

Basic Assumption of Accountability Systems Student test scores accurately reflect instructional quality Higher scores = greater learning due to higher quality teaching Lower scores = less learning due to lower quality teaching In short, it is assumed, accountability tests are instructionally sensitive.

Reality The assumption rarely holds. Most accountability tests are not sensitive to instruction because they simply were not constructed to be instructionally sensitive. The tests are built to the same general “Army Alpha” specifications - originally designed during the First World War – used to differentiate between officer candidates and enlisted personnel.

Consequences of Instructional Insensitivity In principle: Lack of fairness Lack of trustworthy evidence to support validity arguments In practice Bad policy Bad evaluation Bad things happen in the classroom

The situation in Kansas - SES SES disparities between districts

The situation in Kansas - Test Scores Disparities in state assessment scores and proficiency rates

The Situation in Kansas Can the instruction in high-poverty districts be so much worser than the instruction in low-poverty districts? Or, are construct-irrelevant factors (such as SES) masking the effects of instruction?

The basic question: What methods can be employed to evaluate the instructional sensitivity of accountability tests?

Definition Instructional Sensitivity “the degree to which students’ performances on a test… accurately reflect the quality of instruction provided specifically to promote students’ mastery of the knowledge and skills being assessed.” (Popham, 2008)

Two-pronged Approach Judgmental strategies Empirical studies At last year’s AEA conference in Hissar, Popham (2008) advocated a two-pronged approach to evaluating instructional sensitivity: Judgmental strategies Empirical studies

Empirical Study Following the guidance of Popham (2007)… three Kansas school districts conducted an empirical study of the Kansas assessments.

Description of the Kansas Study Teachers were invited to complete a brief online rating form. Participation was voluntary. Each teacher identified the 3-4 indicators (curricular aims) he or she had taught best during the 2008-2009 school year. Student results were matched to responding teachers.

Study Participants 575 teachers responded  14,000 students 320 teachers (grades 3-5 reading and math) 129 reading teachers (grades 6-8) 126 math teachers (grades 6-8)  14,000 students

A Gold Standard Typically, test scores are used to confirm teacher perceptions…as if the test scores are infallible and the teachers are always suspect. In fact, for the first 40 years of inquiry into instructional sensitivity, teacher perceptions were never even part of the mix. Instructional sensitivity studies always contrasted two sets of scores – e.g. pre-test/post-test, not-taught/taught, etc. Asking teachers to identify their best-taught indicators has changed the instructional sensitivity issue both conceptually and operationally.

Old and New Model Instructional Sensitivity A = Non-Learning B = Learning C = Slip D = Maintain A = True Fail B = False Pass C = False Fail D = True Pass

Kansas Study Propensity Score Matching Propensity scores were generated from logistic regression: Several demographic and prior performance characteristics were regressed on overall proficiency rate. Probabilities were used to match “Not-Best-Taught” with “Best-Taught” students using “nearest neighbor” method. Purpose: to form quasi- “random equivalent groups” of similar size for each content area, grade level, and indicator configuration.

Basic Contrast The basic contrast involved “best-taught” versus “not-best-taught” For example… Grade 3 Reading – Indicator 1… Given average class size, 160 teachers responded 30 teachers identified Indicator 1 as one of their best-taught. From among the pool of other teachers and their students, the propensity score matching was used to form an equivalent group of 750 students from 30 teachers.

Initial Analysis Scheme Conduct independent t-tests with mean indicator score as dependent variable Best-taught versus Other students as independent variable

Initial Analysis Scheme Initial logic: If best-taught students outperform other students, indicator is sensitive to instruction. If mean differences are small or in the wrong direction, indicator is insensitive to instruction.

Problem But significant performance differences between best-taught and other students do not necessarily represent significant differences in instructional sensitivity. Instead, instructional sensitivity is about whether the indicator accurately distinguishes effective from ineffective instruction – without confounding from any form of construct irrelevant easiness or difficulty.

Basic Concept In its simplest form, Popham’s definition of instructional sensitivity can be depicted as a 2x2 contingency table.

In Context

Basic Concepts Mean Least effective = B/(A+B) Mean Most effective = D/(C+D) But Mean Least effective = False Pass/(True Fail + False Pass) makes no sense at all. In fact, it returns to the outcome as infallible and the teacher perceptions as suspect: If the pass-rate for the two groups are statistically similar, then the degree of difference between less and most effective must be questioned.

Conceptually Correct Rather than comparing means, we instead need to look at the combined proportions of true fail and true pass. That is, (A + D) / (A + B + C + D) Which can be shortened to (A + D) / N

(A + D) / N Index 1 Ranges from 0 to 1 (Completely Insensitive to Totally Sensitive) In practice: Values < .50 are worse than random guessing

Totally Sensitive (A + D)/N = (50 + 50)/100 = 1.0 A totally sensitive test would cluster students into A or D.

Totally Insensitive (A+D)/N = (0+0)/100 = 0.0 A totally insensitive test clusters students into B and C

Useless (A+D)/N = (25+25)/100 = 0.50 0.50 = mere chance Values < 0.50 are worse than chance.

Index 1 Equivalents Index 1 is conceptually equivalent to: Mann-Whitney U Wilcoxon statistic Transposing Cell A and Cell B, then running a t-test Area Under the Curve (AUC) in Receiver Operating Characteristic (ROC) curve analysis

ROC Curve Analysis Has rarely been used in domain of educational research More commonly used in medicine and radiology data mining (information retrieval) artificial intelligence (machine learning) The use of ROC curves was first introduced during WWII in response to the challenge of how to accurately identify enemy planes on radar screens.

AUC Context ROC Curve Analysis – especially the AUC - is more useful for several reasons: Easily computed Easily interpreted Decomposable into sensitivity and specificity Sensitivity = D / (B+D) Specificity = C / (A+C) Easily graphed as (Sensitivity) versus (1 – Specificity) Readily expandable to polytomous situations Multiple test items in a subscale Multiple subscales in a test Multiple groups being tested

Basic Interpretation (Descriptive) Easy to compute: (A+D)/N Easy to interpret… .90-1.0 = excellent (A) .80-.90 = good (B) .70-.80 = fair (C) .60-.70 = poor (D) .50-.60 = fail (F) Less than .50 is worse than guessing!

Basic Interpretation Most statistical software packages – e.g., SAS, SPSS - include a ROC procedure. The area under the curve table displays estimates of the area, standard error of the area, confidence limits for the area, and the p-value of a hypothesis test.

ROC Hypothesis Test The null hypothesis: true AUC = .50. So, use of ROC Curve Analysis in this context would support rigorous psychometric inquiry into instructional sensitivity. Yet, the A, B, C, D, F system could be reported in ways that even the least experienced reporters or policy-makers can readily understand.

Area Under Curve (AUC) - Graphed Curve 1 = .50  Pure chance…no better than random guess Curve 4 = 1.0  Totally Sensitive  completely accurate discrimination between effective and less-effective instruction Curve 3 is better than Curve 2

ROC Curve Interpretation Greater AUC values indicate greater separation between distributions e.g., Most effective versus less effective Best taught versus Not-best-taught 1.0 = complete separation – that is, total sensitivity

ROC Curve Interpretation AUC values close to .50 indicate no separation between distributions. AUC = .50 indicates complete overlap No difference might as well guess

Procedural Review Step 1: Cross-tabulate not/pass status with teacher identification of best-taught indicators Step 2: (Optional) Use logistic regression and propensity score matching to create randomly-equivalent groups – or, as close as you can get Step 3: Use (A+D)/N or formal ROC Curve Analysis to evaluate instructional sensitivity at the smallest grain-size possible – preferably, at the wrong/right level of individual items.

In Closing The assumption that accountability tests are sensitive to instruction rarely holds. Inferences drawn from test scores about school quality and teaching effectiveness must be validated before action is taken. The empirical approaches presented here should prove helpful in determining if the inference that a test is instructionally sensitive is indeed warranted.

Presenter’s email address: scourt@usd259.net Questions, comments, or suggestions are welcome