C R E S S T / U C L A UCLA Graduate School of Education & Information Studies Center for the Study of Evaluation National Center for Research on Evaluation,

C R E S S T / U C L A UCLA Graduate School of Education & Information Studies Center for the Study of Evaluation National Center for Research on Evaluation, Standards, and Student Testing Rating Performance Assessments of Students With and Without Disabilities: A Generalizability Study of Teacher Bias Jose-Felipe Martinez-Fernandez Ann M. Mastergeorge American Educational Research Association New Orleans April 1-5, 2001

C R E S S T / U C L A Introduction Performance assessments are increasingly popular methods for the evaluation of academic performance. A number of studies have shown that well trained raters can be reliable scorers of performance assessments for the general population of students. This study addressed whether any bias exists from trained raters when scoring performance assessments of students with disabilities.

C R E S S T / U C L A Purpose Compare the sources of score variability for students with and without disabilities in Language Arts and Mathematics performance assessments. Determine if important differences exist across student groups in terms of variance components, and if so whether rater (teacher) bias plays a role. Complement results with raters’ perceptions on bias (their own and other’s).

C R E S S T / U C L A Method Student and Rater samples come from a larger district-wide validation study involving thousands of performance assessments. Teachers from each grade and content area were trained as Raters. A total of 6 studies (each with different raters and students) were performed for 3 rd, 7 th and 9 th grade assessments in Language Arts and Mathematics.

C R E S S T / U C L A Method (continued) For each study, 60 assessments (30 from regular education students and 30 from students who received some kind of accommodation) were rated by 4 raters in two occasions. Raters were aware of each student’s disability status only in the 2 nd rating occasion. Bias is defined as systematic differences in the scores across occasions. No practice or memory effects expected. Score scale ranges from 1 to 4.

C R E S S T / U C L A Method (continued) Two kinds of Generalizability designs: First a “nested-within-disability” design with all 60 students [P(D) x R x O]. Second, separate fully crossed [P x R x O] designs for each disability group of 30 students. Math assessments consisted of two tasks. Both a random [P x R x O x T] design and a fixed [P x R x O] design averaging over tasks were used. A survey inquired about raters’ perceptions regarding bias in rating students with disabilities (their own and other raters’).

C R E S S T / U C L A Score Distributions

C R E S S T / U C L A Generalizability Results Nested Design: Language Arts [Score=Rater x Occasion x Person (Disability)]

C R E S S T / U C L A Generalizability Results (continued) Nested Design: Mathematics [Score=Task x Rater x Occasion x Person (Disability)]

C R E S S T / U C L A Generalizability Results (continued) Crossed Design by Disability: Language Arts [Score=Rater x Occasion x Person]

C R E S S T / U C L A Generalizability Results (continued) Crossed Design by Disability: Mathematics [Score=Task x Rater x Occasion x Person]

C R E S S T / U C L A Generalizability Results (continued) Crossed Design by Disability: Mathematics with Task facet fixed [Score=Person x Rater x Occasion, averaging over the two tasks]

C R E S S T / U C L A Rater Survey Rater Perceptions ( ** p<.01. N=40 )

C R E S S T / U C L A Rater Survey (continued) Mean Score of Raters on Self and Others Regarding Fairness and Bias on Scoring

C R E S S T / U C L A Discussion Variance Components: Person (P) component is always the largest (50% to 70% of variance across designs). However there still exists a good amount of measurement error (triple interaction, ignored facets). Some differences exist between regular education and disability groups in terms of variance components

C R E S S T / U C L A Discussion (continued) Differences between groups: Total amount of variance is always less in the disability groups (more skewed distribution). Variance due to persons (P) and therefore Dependability coefficients are lower for the disability group in Language Arts. This is also true in Mathematics if we use a fixed averaged task facet, but not with two random tasks.

C R E S S T / U C L A Discussion (continued) Rater Bias: No Rater (R) main effects. No leniency differences across raters. No “rating occasion” (O) effect. Overall there is no bias introduced by rater knowledge of disability status. No rater interactions with tasks or occasions.

C R E S S T / U C L A Discussion (continued) However, there is a non-negligible Person by Rater (PxR) interaction which is considerably larger for disability students.  This does not necessarily constitute bias but can still compromise validity of scores for accommodated students.  Are features in papers from students with disabilities differentially salient to different raters?

C R E S S T / U C L A Discussion (continued) There is a Large Person by Task (PxT) interaction in Math, but it is considerably smaller for students with disabilities:  Disability students may not be as aware of the different nature of the tasks so that this somehow natural interaction (Miller & Linn, 2000 and others) would show.  Accommodations may not be having the intended leveling effects.  With a random task facet the lower PxT interaction “increases reliability” for disability students.

C R E S S T / U C L A Discussion (continued) From Rater Survey: Teachers believe that there is a certain bias and unfairness from raters when scoring performance assessments from students with disabilities. Raters see themselves as more fair and unbiased than the general population of raters. Whether this is due to training, or to initially high self-perceptions is not clear. A not uncommon “I’m great but others aren’t as much” kind of effect could be the sole reason.

C R E S S T / U C L A Future Directionsand Questions Are there different patterns for different kinds of disabilities/accommodations? Are accommodations being used appropriately and having the intended effects? Do patterns hold for raters at the local school sites who in general receive less training? Does rater background influence the size and nature of these effects and interactions? How does the testing occasion facet influence variance components/other interactions?

C R E S S T / U C L A UCLA Graduate School of Education & Information Studies Center for the Study of Evaluation National Center for Research on Evaluation,

Similar presentations

Presentation on theme: "C R E S S T / U C L A UCLA Graduate School of Education & Information Studies Center for the Study of Evaluation National Center for Research on Evaluation,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

C R E S S T / U C L A UCLA Graduate School of Education & Information Studies Center for the Study of Evaluation National Center for Research on Evaluation,

Similar presentations

Presentation on theme: "C R E S S T / U C L A UCLA Graduate School of Education & Information Studies Center for the Study of Evaluation National Center for Research on Evaluation,"— Presentation transcript:

Similar presentations

About project

Feedback