C R E S S T / U C L A UCLA Graduate School of Education & Information Studies Center for the Study of Evaluation National Center for Research on Evaluation,

Slides:



Advertisements
Similar presentations
Project VIABLE: Behavioral Specificity and Wording Impact on DBR Accuracy Teresa J. LeBel 1, Amy M. Briesch 1, Stephen P. Kilgus 1, T. Chris Riley-Tillman.
Advertisements

Introduction to the Environment Rating Scales
Square Peg and Round Hole… As parents and educators, the change in grading systems requires a fundamental switch in our thinking… 4=A 1=F 2=D 3=B.
Standardized Scales.
New York State’s Teacher and Principal Evaluation System VOLUME I: NYSED APPR PLAN SUBMISSION “TIPS”
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 6 Validity.
In Today’s Society Education = Testing Scores = Accountability Obviously, Students are held accountable, But also!  Teachers  School districts  States.
Quiz Do random errors accumulate? Name 2 ways to minimize the effect of random error in your data set.
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Potential Biases in Student Ratings as a Measure of Teaching Effectiveness Kam-Por Kwan EDU Tel: etkpkwan.
Longitudinal Experiments Larry V. Hedges Northwestern University Prepared for the IES Summer Research Training Institute July 28, 2010.
National Center on Educational Outcomes N C E O Strategies and Tools for Teaching English Language Learners with Disabilities April 9, 2005 Kristi Liu.
MSc Applied Psychology PYM403 Research Methods Validity and Reliability in Research.
RELIABILITY consistency or reproducibility of a test score (or measurement)
Teaching and Testing Pertemuan 13
SAMPLING DISTRIBUTIONS. SAMPLING VARIABILITY
The Basics  A population is the entire group on which we would like to have information.  A sample is a smaller group, selected somehow from.
Classroom Assessment A Practical Guide for Educators by Craig A
Validity Lecture Overview Overview of the concept Different types of validity Threats to validity and strategies for handling them Examples of validity.
1 Evaluating Psychological Tests. 2 Psychological testing Suffers a credibility problem within the eyes of general public Two main problems –Tests used.
Classroom Assessment Reliability. Classroom Assessment Reliability Reliability = Assessment Consistency. –Consistency within teachers across students.
LOGO Teacher evaluation Dr Kia Karavas Session 5 Evaluation and testing in language education.
REFLECTING ON ASSESSMENT DESIGN. INTRODUCTION & PURPOSE.
Inferences about School Quality using opportunity to learn data: The effect of ignoring classrooms. Felipe Martinez CRESST/UCLA CCSSO Large Scale Assessment.
1 G Lect 5b G Lecture 5b A research question involving means The significance test approach »The problem of s 2 »Student’s t distribution.
Multiple Choice vs. Performance Based Tests in High School Physics Classes Katie Wojtas.
Classroom Assessment A Practical Guide for Educators by Craig A
The Absolutely True Diary of a Part-Time Indian By Ann Gentile.
ATIA 2009 Accessible Online State Assessment Compared to Paper-Based Testing: Is There a Difference in Results? Presenters: Linnie Lee, Bluegrass Technology.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
CCSSO Criteria for High-Quality Assessments Technical Issues and Practical Application of Assessment Quality Criteria.
General Information Iowa Writing Assessment The Riverside Publishing Company, 1994 $39.00: 25 test booklets, 25 response sheets 40 minutes to plan, write.
STUDENT AIMS PERFORMANCE IN A PREDOMINANTLY HISPANIC DISTRICT Lance Chebultz Arizona State University 2012.
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
CRESST ONR/NETC Meetings, July 2003, v1 ONR Advanced Distributed Learning Impact of Language Factors on the Reliability and Validity of Assessment.
Illustration of a Validity Argument for Two Alternate Assessment Approaches Presentation at the OSEP Project Directors’ Conference Steve Ferrara American.
Validity In our last class, we began to discuss some of the ways in which we can assess the quality of our measurements. We discussed the concept of reliability.
Reading First Overview of 2004 Site Visits Jane Granger, M.S.
Practices and Predictors of the Use of Accommodations by University Faculty to Support College Students with Disabilities Leena Jo Landmark, M.Ed., and.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Michigan School Report Card Update Michigan Department of Education.
JS Mrunalini Lecturer RAKMHSU Data Collection Considerations: Validity, Reliability, Generalizability, and Ethics.
Measurement Issues General steps –Determine concept –Decide best way to measure –What indicators are available –Select intermediate, alternate or indirect.
Estimators and estimates: An estimator is a mathematical formula. An estimate is a number obtained by applying this formula to a set of sample data. 1.
Alternative Assessment Chapter 8 David Goh. Factors Increasing Awareness and Development of Alternative Assessment Educational reform movement Goals 2000,
T tests comparing two means t tests comparing two means.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Sampling Theory and Some Important Sampling Distributions.
C R E S S T / U C L A Psychometric Issues in the Assessment of English Language Learners Presented at the: CRESST 2002 Annual Conference Research Goes.
No Child Left Behind Impact on Gwinnett County Public Schools’ Students and Schools.
C R E S S T / CU University of Colorado at Boulder National Center for Research on Evaluation, Standards, and Student Testing Design Principles for Assessment.
Reliability EDUC 307. Reliability  How consistent is our measurement?  the reliability of assessments tells the consistency of observations.  Two or.
The Value of USAP in Software Architecture Design Presentation by: David Grizzanti.
C R E S S T / U C L A UCLA Graduate School of Education & Information Studies Center for the Study of Evaluation National Center for Research on Evaluation,
Stat 100 Mar. 27. Work to Do Read Ch. 3 and Ch. 4.
CHAPTER 3 – Numerical Techniques for Describing Data 3.1 Measures of Central Tendency 3.2 Measures of Variability.
Project VIABLE - Direct Behavior Rating: Evaluating Behaviors with Positive and Negative Definitions Rose Jaffery 1, Albee T. Ongusco 3, Amy M. Briesch.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Methods Participants & Procedures Participants were draw from a larger study that included rd, 4 th, and 5 th grade students and sixty seven teachers.
Exploring Data Use & School Performance in an Urban School District Kyo Yamashiro, Joan L. Herman, & Kilchan Choi UCLA Graduate School of Education & Information.
Copyright © 2009 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 47 Critiquing Assessments.
Classroom Assessment A Practical Guide for Educators by Craig A
Questions What are the sources of error in measurement?
PRMSP/AlACiMa Project
Classroom Assessment: Bias
Reliability and Validity of Measurement
Evaluation of measuring tools: reliability
Assessment Literacy: Test Purpose and Use
Perspectives on Equating: Considerations for Alternate Assessments
Presentation transcript:

C R E S S T / U C L A UCLA Graduate School of Education & Information Studies Center for the Study of Evaluation National Center for Research on Evaluation, Standards, and Student Testing Rating Performance Assessments of Students With and Without Disabilities: A Generalizability Study of Teacher Bias Jose-Felipe Martinez-Fernandez Ann M. Mastergeorge American Educational Research Association New Orleans April 1-5, 2001

C R E S S T / U C L A Introduction Performance assessments are increasingly popular methods for the evaluation of academic performance. A number of studies have shown that well trained raters can be reliable scorers of performance assessments for the general population of students. This study addressed whether any bias exists from trained raters when scoring performance assessments of students with disabilities.

C R E S S T / U C L A Purpose Compare the sources of score variability for students with and without disabilities in Language Arts and Mathematics performance assessments. Determine if important differences exist across student groups in terms of variance components, and if so whether rater (teacher) bias plays a role. Complement results with raters’ perceptions on bias (their own and other’s).

C R E S S T / U C L A Method Student and Rater samples come from a larger district-wide validation study involving thousands of performance assessments. Teachers from each grade and content area were trained as Raters. A total of 6 studies (each with different raters and students) were performed for 3 rd, 7 th and 9 th grade assessments in Language Arts and Mathematics.

C R E S S T / U C L A Method (continued) For each study, 60 assessments (30 from regular education students and 30 from students who received some kind of accommodation) were rated by 4 raters in two occasions. Raters were aware of each student’s disability status only in the 2 nd rating occasion. Bias is defined as systematic differences in the scores across occasions. No practice or memory effects expected. Score scale ranges from 1 to 4.

C R E S S T / U C L A Method (continued) Two kinds of Generalizability designs: First a “nested-within-disability” design with all 60 students [P(D) x R x O]. Second, separate fully crossed [P x R x O] designs for each disability group of 30 students. Math assessments consisted of two tasks. Both a random [P x R x O x T] design and a fixed [P x R x O] design averaging over tasks were used. A survey inquired about raters’ perceptions regarding bias in rating students with disabilities (their own and other raters’).

C R E S S T / U C L A Score Distributions

C R E S S T / U C L A Generalizability Results Nested Design: Language Arts [Score=Rater x Occasion x Person (Disability)]

C R E S S T / U C L A Generalizability Results (continued) Nested Design: Mathematics [Score=Task x Rater x Occasion x Person (Disability)]

C R E S S T / U C L A Generalizability Results (continued) Crossed Design by Disability: Language Arts [Score=Rater x Occasion x Person]

C R E S S T / U C L A Generalizability Results (continued) Crossed Design by Disability: Mathematics [Score=Task x Rater x Occasion x Person]

C R E S S T / U C L A Generalizability Results (continued) Crossed Design by Disability: Mathematics with Task facet fixed [Score=Person x Rater x Occasion, averaging over the two tasks]

C R E S S T / U C L A Rater Survey Rater Perceptions ( ** p<.01. N=40 )

C R E S S T / U C L A Rater Survey (continued) Mean Score of Raters on Self and Others Regarding Fairness and Bias on Scoring

C R E S S T / U C L A Discussion Variance Components: Person (P) component is always the largest (50% to 70% of variance across designs). However there still exists a good amount of measurement error (triple interaction, ignored facets). Some differences exist between regular education and disability groups in terms of variance components

C R E S S T / U C L A Discussion (continued) Differences between groups: Total amount of variance is always less in the disability groups (more skewed distribution). Variance due to persons (P) and therefore Dependability coefficients are lower for the disability group in Language Arts. This is also true in Mathematics if we use a fixed averaged task facet, but not with two random tasks.

C R E S S T / U C L A Discussion (continued) Rater Bias: No Rater (R) main effects. No leniency differences across raters. No “rating occasion” (O) effect. Overall there is no bias introduced by rater knowledge of disability status. No rater interactions with tasks or occasions.

C R E S S T / U C L A Discussion (continued) However, there is a non-negligible Person by Rater (PxR) interaction which is considerably larger for disability students.  This does not necessarily constitute bias but can still compromise validity of scores for accommodated students.  Are features in papers from students with disabilities differentially salient to different raters?

C R E S S T / U C L A Discussion (continued) There is a Large Person by Task (PxT) interaction in Math, but it is considerably smaller for students with disabilities:  Disability students may not be as aware of the different nature of the tasks so that this somehow natural interaction (Miller & Linn, 2000 and others) would show.  Accommodations may not be having the intended leveling effects.  With a random task facet the lower PxT interaction “increases reliability” for disability students.

C R E S S T / U C L A Discussion (continued) From Rater Survey: Teachers believe that there is a certain bias and unfairness from raters when scoring performance assessments from students with disabilities. Raters see themselves as more fair and unbiased than the general population of raters. Whether this is due to training, or to initially high self-perceptions is not clear. A not uncommon “I’m great but others aren’t as much” kind of effect could be the sole reason.

C R E S S T / U C L A Future Directionsand Questions Are there different patterns for different kinds of disabilities/accommodations? Are accommodations being used appropriately and having the intended effects? Do patterns hold for raters at the local school sites who in general receive less training? Does rater background influence the size and nature of these effects and interactions? How does the testing occasion facet influence variance components/other interactions?