Stephen C. Court Educational Research and Evaluation, LLC A Presentation at the First International Conference on Instructional Sensitivity Achievement.

Slides:



Advertisements
Similar presentations
Stephen C. Court Presented at
Advertisements

A District-initiated Appraisal of a State Assessments Instructional Sensitivity HOLDING ACCOUNTABILITY TESTS ACCOUNTABLE Stephen C. Court Presented in.
How to Make a Test & Judge its Quality. Aim of the Talk Acquaint teachers with the characteristics of a good and objective test See Item Analysis techniques.
Summative Assessment Kansas State Department of Education ASSESSMENT LITERACY PROJECT1.
The Research Consumer Evaluates Measurement Reliability and Validity
Using Multiple Choice Tests for Assessment Purposes: Designing Multiple Choice Tests to Reflect and Foster Learning Outcomes Terri Flateby, Ph.D.
Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
ASSESSMENT LITERACY PROJECT4 Student Growth Measures - SLOs.
Stephen C. Court Educational Research and Evaluation, LLC A Presentation at the First International Conference on Instructional Sensitivity Achievement.
Issues of Technical Adequacy in Measuring Student Growth for Educator Effectiveness Stanley Rabinowitz, Ph.D. Director, Assessment & Standards Development.
Student Growth Percentile Model Question Answered
Classroom Assessment (1)
Chapter 4 Validity.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
New Hampshire Enhanced Assessment Initiative: Technical Documentation for Alternate Assessments Alignment Inclusive Assessment Seminar Brian Gong Claudia.
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
Classroom Assessment A Practical Guide for Educators by Craig A
Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.
Understanding Validity for Teachers
Classroom Assessment Reliability. Classroom Assessment Reliability Reliability = Assessment Consistency. –Consistency within teachers across students.
Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.
Curriculum-Based Measurements The What. Curriculum-Based Measurements  Curriculum-Based Measurements (CBM) Assessment tools derived from the curriculum,
Planning for Assessment Presented by Donna Woods.
Challenges in Developing a University Admissions Test & a National Assessment A Presentation at the Conference On University & Test Development in Central.
Copyright © 2001 by The Psychological Corporation 1 The Academic Competence Evaluation Scales (ACES) Rating scale technology for identifying students with.
Creating Assessments with English Language Learners in Mind In this module we will examine: Who are English Language Learners (ELL) and how are they identified?
John Cronin, Ph.D. Director The Kingsbury NWEA Measuring and Modeling Growth in a High Stakes Environment.
Interpreting Assessment Results using Benchmarks Program Information & Improvement Service Mohawk Regional Information Center Madison-Oneida BOCES.
Test item analysis: When are statistics a good thing? Andrew Martin Purdue Pesticide Programs.
October 15. In Chapter 9: 9.1 Null and Alternative Hypotheses 9.2 Test Statistic 9.3 P-Value 9.4 Significance Level 9.5 One-Sample z Test 9.6 Power and.
CRT Dependability Consistency for criterion- referenced decisions.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Classroom Assessment A Practical Guide for Educators by Craig A. Mertler Chapter 7 Portfolio Assessments.
Assessment in Education Patricia O’Sullivan Office of Educational Development UAMS.
Teaching Today: An Introduction to Education 8th edition
Reliability & Validity
OHIO’S ALTERNATE ASSESSMENT FOR STUDENTS WITH SIGNIFICANT COGNITIVE DISABILITIES(AASCD).. AT A GLANCE.
Illustration of a Validity Argument for Two Alternate Assessment Approaches Presentation at the OSEP Project Directors’ Conference Steve Ferrara American.
Creating Assessments The three properties of good assessments.
Assessing Learning for Students with Disabilities Tom Haladyna Arizona State University.
SGP Logic Model Provides a method that enables SGP data to contribute to performance evaluation when data are missing. The distribution of SGP data combined.
The Do’s and Don’ts of High-Stakes Student Achievement Testing Andrew Porter Vanderbilt University August 2006.
Sub-regional Workshop on Census Data Evaluation, Phnom Penh, Cambodia, November 2011 Evaluation of Age and Sex Distribution United Nations Statistics.
A ssessment & E valuation. Assessment Answers questions related to individuals, “What did the student learn?” Uses tests and other activities to determine.
McGraw-Hill/Irwin © 2012 The McGraw-Hill Companies, Inc. All rights reserved. Obtaining Valid and Reliable Classroom Evidence Chapter 4:
ARG symposium discussion Dylan Wiliam Annual conference of the British Educational Research Association; London, UK:
Georgia will lead the nation in improving student achievement. 1 Georgia Performance Standards Day 3: Assessment FOR Learning.
Evaluating Impacts of MSP Grants Ellen Bobronnikov January 6, 2009 Common Issues and Potential Solutions.
Introduction to Item Analysis Objectives: To begin to understand how to identify items that should be improved or eliminated.
Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov February 16, 2011.
Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments IES Research Conference June 28 th, 2010 Marie-Andrée Somers (Presenter)
Evaluating Classification Performance
Reliability EDUC 307. Reliability  How consistent is our measurement?  the reliability of assessments tells the consistency of observations.  Two or.
Assessment Assessment is the collection, recording and analysis of data about students as they work over a period of time. This should include, teacher,
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
Project VIABLE - Direct Behavior Rating: Evaluating Behaviors with Positive and Negative Definitions Rose Jaffery 1, Albee T. Ongusco 3, Amy M. Briesch.
Assessment in Education ~ What teachers need to know.
Copyright © Springer Publishing Company, LLC. All Rights Reserved. DEVELOPING AND USING TESTS – Chapter 11 –
Measurement and Scaling Concepts
Exam Analysis Camp Teach & Learn May 2015 Stacy Lutter, D. Ed., RN Nursing Graduate Students: Mary Jane Iosue, RN Courtney Nissley, RN Jennifer Wierworka,
Copyright © 2009 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 47 Critiquing Assessments.
Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov March 23, 2011.
Classroom Assessment Validity And Bias in Assessment.
Week 3 Class Discussion.
Scoring: Measures of Central Tendency
Assessing the Common Core Standards
Validity and Reliability II: The Basics
AACC Mini Conference June 8-9, 2011
REVIEW I Reliability scraps Index of Reliability
Presentation transcript:

Stephen C. Court Educational Research and Evaluation, LLC A Presentation at the First International Conference on Instructional Sensitivity Achievement and Assessment Institute University of Kansas, Lawrence, Kansas, November 13-15, 2013

Utilizes existing data Utilizes existing data Applies statistical procedures that are practical for most settings Applies statistical procedures that are practical for most settings Items: 2013 Grade 5 and Grade 8 Reading and Math Teacher Grouping: Median SGP (50-50) based on 2011 and 2012 test results Layer: 2013 Proficient/Not Proficient Item x Grouping w/Layer

Purpose is not: To use the item to validate the accuracy of the grouping method. Purpose: To appraise the instructional sensitivity of the item. Therefore, accept the grouping method as accurate…for the sake of argument.

ContentN students N items Insensitive Marginally Sensitive Grade 5 Math % 25% 9% Reading %18% 0% Grade 8 Math %38%15% Reading %12% 0% Average 71%23% 6% 6%

Classification based on significance level: Sensitive: <.001 Marginally sensitive:.001 to.05 Insensitive: >.05 Typical MH Estimate range for Insensitive: Typical MH Estimate range for Marginally Sensitive: Typical MH Estimate threshold for Sensitive: > 1.500

Corresponding ROC analysis associates sig. levels of.000 with AUC values of just.560. Yet, AUC values typically interpreted as:  = excellent (A)  = good (B)  = fair (C)  = poor (D)  = fail (F)

Sig. levels derived from N-based standard errors may not serve well as basis of interpretation. Alternative schemes for interpreting MH Estimates: MH Range Interpretation < 2 (inc. values < 1) Fail 2 to 5 Poor 5 to 10 Good > 10 Excellent MH Range Interpretation < 5 Insensitive 5 to 10 Marginally Sensitive > 10 Sensitive

Teacher Grouping  Baseline: Random Equivalent Halves 0% Sensitive; 100% Insensitive No discrimination; complete overlap

Two consecutive years of class proficiency above 80%. Extreme discrimination between groups: The mean difference between the two Proficiency groups was more than 50 times greater than the mean difference between the two SGP-based teacher groups. Grouping Method Between group Mean difference Min.Max. SGP Proficiency

The proficiency-based grouping method tended to be more sensitive to instructional sensitivity. For grade 5 reading:

In practice, we do not design tests and create items for extremely separated groups. We design tests and create items for everyone – all schools, all teachers, and all students. The fact that the method works for extreme groups provides evidence that the SGP-DIF method proposed by Popham and Ryan yields a basis for valid appraisals of item sensitivity:

Except under the most extreme conditions, the items are not sufficiently sensitive to instruction – i.e., to contrasts of more effective vs. less effective teachers. In turn, they should not contribute to test scores that serve as the basis for high-stakes decisions regarding:  school effectiveness  instructional quality  teacher competence.

We would not render summative judgment of a teacher’s competence on the basis of a just one set of test scores. Therefore, nor should we render summative judgment on an item’s sensitivity to instruction on the basis of one appraisal approach. Instead, use five methods – e.g., MH, ROC, LR, MI, and BSEM: 1.For each method, classify items as 0 = Insensitive; 1 = Marginally Sensitive; and 2 = Sensitive. 0 = Insensitive; 1 = Marginally Sensitive; and 2 = Sensitive. 2. Summing the classifications  a sensitivity index from 0 to Review items with sum < X, where X is set judgmentally on the basis of test purpose and stakes attached. basis of test purpose and stakes attached. For high-stakes, such as teacher evaluation, X might be set at 7 or 8. For high-stakes, such as teacher evaluation, X might be set at 7 or 8.

Examine actual items for contextual characteristics that explain why the items function as they do under  Extreme conditions  Real-world conditions Examine not just p-values, biserials, reliability coefficients, sensitivity statistics, etc. but also actual item and stimulus format and content – for example: Stems: question vs. completion Stems: question vs. completion Options: short vs. long Options: short vs. long Alignment: Webb, Porter, etc. Alignment: Webb, Porter, etc. Cognitive load: DOK, Bloom, etc. Cognitive load: DOK, Bloom, etc.