Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam www.dylanwiliam.net Annual Conference of the Chartered Institute of Educational.

Slides:



Advertisements
Similar presentations
Survey Methodology Reliability and Validity EPID 626 Lecture 12.
Advertisements

Part II Sigma Freud & Descriptive Statistics
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
Part II Sigma Freud & Descriptive Statistics
National assessment: how to make it better Dylan Wiliam King’s College London.
Measurement Reliability and Validity
Chapter 4A Validity and Test Development. Basic Concepts of Validity Validity must be built into the test from the outset rather than being limited to.
The reliability of educational assessments Dylan Wiliam Ofqual Annual Lecture, Coventry: 7 May 2009.
Designing Content Targets for Alternate Assessments in Science: Reducing depth, breadth, and/or complexity Brian Gong Center for Assessment Web seminar.
CH. 9 MEASUREMENT: SCALING, RELIABILITY, VALIDITY
Assessment: Reliability, Validity, and Absence of bias
Not invented here: the baffling insularity of assessment practices in higher education Dylan Wiliam Keynote presentation at the University.
VALIDITY.
The Program Review Process: NCATE and the State of Indiana Richard Frisbie and T. J. Oakes March 8, 2007 (source:NCATE, February 2007)
Teaching and Testing Pertemuan 13
Validity Lecture Overview Overview of the concept Different types of validity Threats to validity and strategies for handling them Examples of validity.
Chapter 4. Validity: Does the test cover what we are told (or believe)
Classroom Assessment A Practical Guide for Educators by Craig A. Mertler Chapter 9 Subjective Test Items.
Technical Issues Two concerns Validity Reliability
Measurement and Data Quality
Validity and Reliability
GUIDELINES FOR SETTING A GOOD QUESTION PAPER
Respected Professor Kihyeon Cho
Standardized Testing (1) EDU 330: Educational Psychology Daniel Moos.
Becoming a Teacher Ninth Edition
Using formative assessment. Aims of the session This session is intended to help us to consider: the reasons for assessment; the differences between formative.
Reflections on pedagogy Dylan Wiliam Pedagogy, Space, Place Conference November
ASSESSMENT IN EDUCATION ASSESSMENT IN EDUCATION. Copyright Keith Morrison, 2004 PERFORMANCE ASSESSMENT... Concerns direct reality rather than disconnected.
Foundations of Educational Measurement
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Completion, Short-Answer, and True-False Items
Classroom Assessments Checklists, Rating Scales, and Rubrics
Writing research proposal/synopsis
Validity & Practicality
Goals Increase understanding of ‘big ideas’ in your curriculum Increase understanding of learning progressions.
Exam Taking Kinds of Tests and Test Taking Strategies.
Teaching Today: An Introduction to Education 8th edition
1. What do you know when you know the test results? The meanings of educational assessments Annual Conference of the International Association for Educational.
Chap. 2 Principles of Language Assessment
Research Process Parts of the research study Parts of the research study Aim: purpose of the study Aim: purpose of the study Target population: group whose.
EDU 8603 Day 6. What do the following numbers mean?
CT 854: Assessment and Evaluation in Science & Mathematics
The Teaching Process. Problem/condition Analyze Design Develop Implement Evaluate.
Topic 2.1 The purpose and principles of assessment.
Validity and Reliability Neither Valid nor Reliable Reliable but not Valid Valid & Reliable Fairly Valid but not very Reliable Think in terms of ‘the purpose.
1. Designing an assessment system Presentation to the Scottish Qualifications Authority, August 2007 Dylan Wiliam Institute of Education, University of.
Assessment and Testing
Using an interpretive lens rather than an evaluative one.
McGraw-Hill/Irwin © 2012 The McGraw-Hill Companies, Inc. All rights reserved. Obtaining Valid and Reliable Classroom Evidence Chapter 4:
ARG symposium discussion Dylan Wiliam Annual conference of the British Educational Research Association; London, UK:
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
Chapter 6 - Standardized Measurement and Assessment
Goals Increase understanding of ‘big ideas’ Apply understanding of ‘big ideas’ to your own curriculum and identify evidence that will be collected to determine.
Michigan Assessment Consortium Common Assessment Development Series Module 16 – Validity.
CERTIFICATE IN ASSESSING VOCATIONAL ACHIEVEMENT (CAVA) Unit 1: Understanding the principles and practices of assessment.
Evaluation and Assessment Evaluation is a broad term which involves the systematic way of gathering reliable and relevant information for the purpose.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Copyright © Springer Publishing Company, LLC. All Rights Reserved. DEVELOPING AND USING TESTS – Chapter 11 –
Understanding Standards: Advanced Higher Event
Survey Methodology Reliability and Validity
VALIDITY by Barli Tambunan/
Concept of Test Validity
Writing Tasks and Prompts
Questioning in maths: Diagnosis
Classroom Assessment Validity And Bias in Assessment.
What do you know when you know the test results
Designing an assessment system
Assessments-Purpose and Principles
Designing an assessment system
Presentation transcript:

Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam Annual Conference of the Chartered Institute of Educational Assessors, London: 23 April 2008

Overview Six degrees of integration  Function  Formative versus summative  Quality  Validity versus reliability  Format  Multiple-choice versus constructed response  Scope  Continuous versus one-off  Authority  Teacher-produced versus expert-produced  Locus  School-based versus externally marked

Function Quality Format Scope Authority Locus

A statement of the blindlingly obvious You can’t work out how good something is until you know what it’s intended to do… Function, then quality

Formative and summative Descriptions of  Instruments  Purposes  Functions An assessment functions formatively when evidence about student achievement elicited by the assessment is interpreted and used to make decisions about the next steps in instruction that are likely to be better, or better founded, than the decisions they would have taken in the absence of that evidence.

Gresham’s law and assessment Usually (incorrectly) stated as “Bad money drives out good” “The essential condition for Gresham's Law to operate is that there must be two (or more) kinds of money which are of equivalent value for some purposes and of different value for others” (Mundell, 1998) The parallel for assessment: Summative drives out formative The most that summative assessment (more properly, assessment designed to serve a summative function) can do is keep out of the way

Function Quality Format Scope Authority Locus

Reliability Reliability is a measure of the stability of assessment outcomes under changes in things that (we think) shouldn’t make a difference, such as  marker/rater  occasion  item selection

Test length and reliability From To Just about the only way to increase the reliability of a test is to make it longer, or narrower (which amounts to the same thing).

Reliability is not what we really want Take a test which is known to have a reliability of around 0.90 for a particular group of students. Administer the test to the group of students and score it Give each student a random script rather than their own Record the scores assigned to each student What is the reliability of the scores assigned in this way? A.0.10 B.0.30 C.0.50 D.0.70 E.0.90

Reliability v consistency Classical measures of reliability  are meaningful only for groups  are designed for continuous measures Marks versus grades  Scores suffer from spurious accuracy  Grades suffer from spurious precision Classification consistency  A more technically appropriate measure of the reliability of assessment  Closer to the intuitive meaning of reliability

Reliability & classification consistency Classification consistency of National Curriculum Assessment in England

Validity Traditional definition: a property of assessments  A test is valid to the extent that it assesses what it purports to assess  Key properties (content validity)  Relevance  Representativeness  Fallacies  Two tests with the same name assess the same thing  Two tests with different names assess different things  A test valid for one group is valid for all groups

Trinitarian doctrines of validity Content validity Criterion-related validity  Concurrent validity  Predictive validity Construct validity

Validity Validity is a property of inferences, not of assessments “One validates, not a test, but an interpretation of data arising from a specified procedure” (Cronbach, 1971; emphasis in original) The phrase “A valid test” is therefore a category error (like “A happy rock”)  No such thing as a valid (or indeed invalid) assessment  No such thing as a biased assessment Reliability is a pre-requisite for validity  Talking about “reliability and validity” is like talking about “swallows and birds”  Validity includes reliability

Modern conceptions of validity Validity subsumes all aspects of assessment quality  Reliability  Representativeness (content coverage)  Relevance  Predictiveness But not impact (Popham: right concern, wrong concept) “Validity is an integrative evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (Messick, 1989 p. 13)

Consequential validity? No such thing! As has been stressed several times already, it is not that adverse social consequences of test use render the use invalid, but, rather, that adverse social consequences should not be attributable to any source of test invalidity such as construct-irrelevant variance. If the adverse social consequences are empirically traceable to sources of test invalidity, then the validity of the test use is jeopardized. If the social consequences cannot be so traced—or if the validation process can discount sources of test invalidity as the likely determinants, or at least render them less plausible—then the validity of the test use is not overturned. Adverse social consequences associated with valid test interpretation and use may implicate the attributes validly assessed, to be sure, as they function under the existing social conditions of the applied setting, but they are not in themselves indicative of invalidity. (Messick, 1989, p )

Threats to validity Inadequate reliability Construct-irrelevant variance  Differences in scores are caused, in part, by differences not relevant to the construct of interest  The assessment assesses things it shouldn’t  The assessment is “too big” Construct under-representation  Differences in the construct are not reflected in scores  The assessment doesn’t assess things it should  The assessment is “too small” With clear construct definition all of these are technical—not value—issues But they interact strongly…

School effectiveness Do differences in exam results support inferences about school quality? Key issues:  Value-added  Sensitivity to instruction  Learning is slower than generally assumed  Sensitivity to instruction of tests is exacerbated by test-construction procedures Result: invalid attributions about the effects of schooling

Learning is hard and slow… Source: Leverhulme Numeracy Research Programme =?

Why does this matter? In England, school-level effects account for only 7% of the variability in GCSE scores. In terms of value-added, there is no statistically significant difference between the middle 80 percent of English secondary schools Correlation between teacher quality and student progress is low:  Average cohort progress: 0.3 sd per year  Good teachers (+1 sd) produce 0.4 sd per year  Poor teachers (-1 sd) produce 0.2 sd per year

So… Although teacher quality is the single most important determinant of student progress… …the effect is small compared to the accumulated achievement over the course of a learner’s education… …so that inferences that school outcomes are indications of the contributions made by the school are unlikely to be valid.

Function Quality Format Scope Authority Locus

Item formats “No assessment technique has been rubbished quite like multiple choice, unless it be graphology” Wood, 1991, p. 32) Myths about multiple-choice items  They are biased against females  They assess only candidates’ ability to spot or guess  They test only lower-order skills

Comparing like with like… Constructed-response items  Can be improved through guidance to markers  Can be developed cheaply, but are expensive to score  For a one-hour year-cohort assessment in England  Development: £5 000  Scoring:£ Multiple-choice items  Cannot be improved through guidance to markers  Can be developed cheaply, but are cheap to score  For a one-hour year-cohort assessment in England  Development: £ ?  Scoring: £5 000

Mathematics 1 What is the median for the following data set? A.22 B.38 and 44 C.41 D.46 E.77 F.This data set has no median

Mathematics 2 What can you say about the means of the following two data sets? Set 1: Set 2: A.The two sets have the same mean. B.The two sets have different means. C.It depends on whether you choose to count the zero.

Mathematics 3 Which of the shapes below contains a dotted line that is also a diagonal?

Wilson & Draney, 2004 Science The ball sitting on the table is not moving. It is not moving because: A. no forces are pushing or pulling on the ball. B. gravity is pulling down, but the table is in the way. C. the table pushes up with the same force that gravity pulls down D. gravity is holding it onto the table. E. there is a force inside the ball keeping it from rolling off the table

Science 2 You look outside and notice a very gentle rain. Suddenly, it starts raining harder. What happened? A.A cloud bumped into the cloud that was only making a little rain. B.A bigger hole opened in the cloud, releasing more rain. C.A different cloud, with more rain, moved into the area. D.The wind started to push more water out of the clouds.

Science 3 Jenna put a glass of cold water outside on a warm day. After a while, she could see small droplets on the outside of the glass. Why was this? A.The air molecules around the glass condensed to form droplets of liquid B.The water vapor in the air near the cold glass condensed to form droplets of liquid water C.Water soaked through invisible holes in the glass to form droplets of water on the outside of the glass D.The cold glass causes oxygen in the air to become water

Science 4 How could you increase the temperature of boiling water? A.Add more heat. B.Stir it constantly. C.Add more water. D.You can’t increase the temperature of boiling water.

Science 5 What can we do to preserve the ozone layer? A.Reduce the amount of carbon dioxide produced by cars and factories B.Reduce the greenhouse effect C.Stop cutting down the rainforests D.Limit the numbers of cars that can be used when the level of ozone is high E.Properly dispose of air-conditioners and fridges

English Where would be the best place to begin a new paragraph? No rules are carved in stone dictating how long a paragraph should be. However, for argumentative essays, a good rule of thumb is that, if your paragraph is shorter than five or six good, substantial sentences, then you should reexamine it to make sure that you've developed the ideas fully. A Do not look at that rule of thumb, however, as hard and fast. It is simply a general guideline that may not fit some paragraphs. B A paragraph should be long enough to do justice to the main idea of the paragraph. Sometimes a paragraph may be short; sometimes it will be long. C On the other hand, if your paragraph runs on to a page or longer, you should probably reexamine its coherence to make sure that you are sticking to only one main topic. Perhaps you can find subtopics that merit their own paragraphs. D Think more about the unity, coherence, and development of a paragraph than the basic length. E If you are worried that a paragraph is too short, then it probably lacks sufficient development. If you are worried that a paragraph is too long, then you may have rambled on to topics other than the one stated in your topic sentence.

English 2 In a piece of persuasive writing, which of these would be the best thesis statement? A.The typical TV show has 9 violent incidents B.There is a lot of violence on TV C.The amount of violence on TV should be reduced D.Some programs are more violent than others E.Violence is included in programs to boost ratings F.Violence on TV is interesting G.I don’t like the violence on TV H.The essay I am going to write is about violence on TV

History Why are historians concerned with bias when analyzing sources? A.People can never be trusted to tell the truth B.People deliberately leave out important details C.People are only able to provide meaningful information if they experienced an event firsthand D.People interpret the same event in different ways, according to their experience E.People are unaware of the motivations for their actions F.People get confused about sequences of events

Function Quality Format Scope Authority Locus

“All the women are strong, all the men are good-looking, and all the children are above average.” Garrison Keillor The Lake Wobegon effect revisited

Effects of narrow assessment Incentives to teach to the test  Focus on some subjects at the expense of others  Focus on some aspects of a subject at the expense of others  Focus on some students at the expense of others (“bubble” students) Consequences  Learning that is  Narrow  Shallow  Transient

Function Quality Format Scope Authority Locus

Authority Reliability requires random sampling from the domain of interest Increasing reliability requires increasing the size of the sample Using teacher assessment in certification is attractive:  Increases reliability (increased test time)  Increases validity (addresses aspects of construct under-representation) But problematic  Lack of trust (“Fox guarding the hen house”)  Problems of biased inferences (construct-irrelevant variance)  Can introduce new kinds of construct under-representation

Function Quality Format Scope Authority Locus

Locus Using external markers to mark student assessments involves spending more money in order to deny teachers professional learning opportunities Getting teachers involved in “common assessment”  Is not assessment for learning, nor formative assessment  But it is valuable, perhaps even essential, professional development

Final reflections

The challenge To design an assessment system that is:  Distributed  So that evidence collection is not undertaken entirely at the end  Synoptic  So that learning has to accumulate  Extensive  So that all important aspects are covered (breadth and depth)  Manageable  So that costs are proportionate to benefits  Trusted  So that stakeholders have faith in the outcomes

Constraints and affordances Beliefs about what constitutes learning; Beliefs in the reliability and validity of the results of various tools; A preference for and trust in numerical data, with bias towards a single number; Trust in the judgments and integrity of the teaching profession; Belief in the value of competition between students; Belief in the value of competition between schools; Belief that test results measure school effectiveness; Fear of national economic decline and education’s role in this; Belief that the key to schools’ effectiveness is strong top-down management;

The minimal take-aways… No such thing as a summative assessment No such thing as a reliable test No such thing as a valid test No such thing as a biased test “Validity including reliability”