Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam www.dylanwiliam.net Annual Conference of the Chartered Institute of Educational.

Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam www.dylanwiliam.net Annual Conference of the Chartered Institute of Educational Assessors, London: 23 April 2008

Overview Six degrees of integration  Function  Formative versus summative  Quality  Validity versus reliability  Format  Multiple-choice versus constructed response  Scope  Continuous versus one-off  Authority  Teacher-produced versus expert-produced  Locus  School-based versus externally marked

Function Quality Format Scope Authority Locus

A statement of the blindlingly obvious You can’t work out how good something is until you know what it’s intended to do… Function, then quality

Formative and summative Descriptions of  Instruments  Purposes  Functions An assessment functions formatively when evidence about student achievement elicited by the assessment is interpreted and used to make decisions about the next steps in instruction that are likely to be better, or better founded, than the decisions they would have taken in the absence of that evidence.

Gresham’s law and assessment Usually (incorrectly) stated as “Bad money drives out good” “The essential condition for Gresham's Law to operate is that there must be two (or more) kinds of money which are of equivalent value for some purposes and of different value for others” (Mundell, 1998) The parallel for assessment: Summative drives out formative The most that summative assessment (more properly, assessment designed to serve a summative function) can do is keep out of the way

Reliability Reliability is a measure of the stability of assessment outcomes under changes in things that (we think) shouldn’t make a difference, such as  marker/rater  occasion  item selection

Test length and reliability 0.700.750.800.850.900.95 0.701.0 0.751.31.0 0.801.71.31.0 0.852.41.91.41.0 0.903.93.02.31.61.0 0.958.16.34.83.42.11.0 From To Just about the only way to increase the reliability of a test is to make it longer, or narrower (which amounts to the same thing).

Reliability is not what we really want Take a test which is known to have a reliability of around 0.90 for a particular group of students. Administer the test to the group of students and score it Give each student a random script rather than their own Record the scores assigned to each student What is the reliability of the scores assigned in this way? A.0.10 B.0.30 C.0.50 D.0.70 E.0.90

Reliability v consistency Classical measures of reliability  are meaningful only for groups  are designed for continuous measures Marks versus grades  Scores suffer from spurious accuracy  Grades suffer from spurious precision Classification consistency  A more technically appropriate measure of the reliability of assessment  Closer to the intuitive meaning of reliability

Reliability & classification consistency Classification consistency of National Curriculum Assessment in England

Validity Traditional definition: a property of assessments  A test is valid to the extent that it assesses what it purports to assess  Key properties (content validity)  Relevance  Representativeness  Fallacies  Two tests with the same name assess the same thing  Two tests with different names assess different things  A test valid for one group is valid for all groups

Trinitarian doctrines of validity Content validity Criterion-related validity  Concurrent validity  Predictive validity Construct validity

Validity Validity is a property of inferences, not of assessments “One validates, not a test, but an interpretation of data arising from a specified procedure” (Cronbach, 1971; emphasis in original) The phrase “A valid test” is therefore a category error (like “A happy rock”)  No such thing as a valid (or indeed invalid) assessment  No such thing as a biased assessment Reliability is a pre-requisite for validity  Talking about “reliability and validity” is like talking about “swallows and birds”  Validity includes reliability

Modern conceptions of validity Validity subsumes all aspects of assessment quality  Reliability  Representativeness (content coverage)  Relevance  Predictiveness But not impact (Popham: right concern, wrong concept) “Validity is an integrative evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (Messick, 1989 p. 13)

Consequential validity? No such thing! As has been stressed several times already, it is not that adverse social consequences of test use render the use invalid, but, rather, that adverse social consequences should not be attributable to any source of test invalidity such as construct-irrelevant variance. If the adverse social consequences are empirically traceable to sources of test invalidity, then the validity of the test use is jeopardized. If the social consequences cannot be so traced—or if the validation process can discount sources of test invalidity as the likely determinants, or at least render them less plausible—then the validity of the test use is not overturned. Adverse social consequences associated with valid test interpretation and use may implicate the attributes validly assessed, to be sure, as they function under the existing social conditions of the applied setting, but they are not in themselves indicative of invalidity. (Messick, 1989, p. 88-89)

Threats to validity Inadequate reliability Construct-irrelevant variance  Differences in scores are caused, in part, by differences not relevant to the construct of interest  The assessment assesses things it shouldn’t  The assessment is “too big” Construct under-representation  Differences in the construct are not reflected in scores  The assessment doesn’t assess things it should  The assessment is “too small” With clear construct definition all of these are technical—not value—issues But they interact strongly…

School effectiveness Do differences in exam results support inferences about school quality? Key issues:  Value-added  Sensitivity to instruction  Learning is slower than generally assumed  Sensitivity to instruction of tests is exacerbated by test-construction procedures Result: invalid attributions about the effects of schooling

Learning is hard and slow… Source: Leverhulme Numeracy Research Programme 860+570=?

Why does this matter? In England, school-level effects account for only 7% of the variability in GCSE scores. In terms of value-added, there is no statistically significant difference between the middle 80 percent of English secondary schools Correlation between teacher quality and student progress is low:  Average cohort progress: 0.3 sd per year  Good teachers (+1 sd) produce 0.4 sd per year  Poor teachers (-1 sd) produce 0.2 sd per year

So… Although teacher quality is the single most important determinant of student progress… …the effect is small compared to the accumulated achievement over the course of a learner’s education… …so that inferences that school outcomes are indications of the contributions made by the school are unlikely to be valid.

Item formats “No assessment technique has been rubbished quite like multiple choice, unless it be graphology” Wood, 1991, p. 32) Myths about multiple-choice items  They are biased against females  They assess only candidates’ ability to spot or guess  They test only lower-order skills

Comparing like with like… Constructed-response items  Can be improved through guidance to markers  Can be developed cheaply, but are expensive to score  For a one-hour year-cohort assessment in England  Development: £5 000  Scoring:£1 000 000 Multiple-choice items  Cannot be improved through guidance to markers  Can be developed cheaply, but are cheap to score  For a one-hour year-cohort assessment in England  Development: £1 000 000?  Scoring: £5 000

Mathematics 1 What is the median for the following data set? 38 74 22 44 96 22 19 53 A.22 B.38 and 44 C.41 D.46 E.77 F.This data set has no median

Mathematics 2 What can you say about the means of the following two data sets? Set 1: 10121315 Set 2: 101213150 A.The two sets have the same mean. B.The two sets have different means. C.It depends on whether you choose to count the zero.

Mathematics 3 Which of the shapes below contains a dotted line that is also a diagonal?

Wilson & Draney, 2004 Science The ball sitting on the table is not moving. It is not moving because: A. no forces are pushing or pulling on the ball. B. gravity is pulling down, but the table is in the way. C. the table pushes up with the same force that gravity pulls down D. gravity is holding it onto the table. E. there is a force inside the ball keeping it from rolling off the table

Science 2 You look outside and notice a very gentle rain. Suddenly, it starts raining harder. What happened? A.A cloud bumped into the cloud that was only making a little rain. B.A bigger hole opened in the cloud, releasing more rain. C.A different cloud, with more rain, moved into the area. D.The wind started to push more water out of the clouds.

Science 3 Jenna put a glass of cold water outside on a warm day. After a while, she could see small droplets on the outside of the glass. Why was this? A.The air molecules around the glass condensed to form droplets of liquid B.The water vapor in the air near the cold glass condensed to form droplets of liquid water C.Water soaked through invisible holes in the glass to form droplets of water on the outside of the glass D.The cold glass causes oxygen in the air to become water

Science 4 How could you increase the temperature of boiling water? A.Add more heat. B.Stir it constantly. C.Add more water. D.You can’t increase the temperature of boiling water.

Science 5 What can we do to preserve the ozone layer? A.Reduce the amount of carbon dioxide produced by cars and factories B.Reduce the greenhouse effect C.Stop cutting down the rainforests D.Limit the numbers of cars that can be used when the level of ozone is high E.Properly dispose of air-conditioners and fridges

English Where would be the best place to begin a new paragraph? No rules are carved in stone dictating how long a paragraph should be. However, for argumentative essays, a good rule of thumb is that, if your paragraph is shorter than five or six good, substantial sentences, then you should reexamine it to make sure that you've developed the ideas fully. A Do not look at that rule of thumb, however, as hard and fast. It is simply a general guideline that may not fit some paragraphs. B A paragraph should be long enough to do justice to the main idea of the paragraph. Sometimes a paragraph may be short; sometimes it will be long. C On the other hand, if your paragraph runs on to a page or longer, you should probably reexamine its coherence to make sure that you are sticking to only one main topic. Perhaps you can find subtopics that merit their own paragraphs. D Think more about the unity, coherence, and development of a paragraph than the basic length. E If you are worried that a paragraph is too short, then it probably lacks sufficient development. If you are worried that a paragraph is too long, then you may have rambled on to topics other than the one stated in your topic sentence.

English 2 In a piece of persuasive writing, which of these would be the best thesis statement? A.The typical TV show has 9 violent incidents B.There is a lot of violence on TV C.The amount of violence on TV should be reduced D.Some programs are more violent than others E.Violence is included in programs to boost ratings F.Violence on TV is interesting G.I don’t like the violence on TV H.The essay I am going to write is about violence on TV

History Why are historians concerned with bias when analyzing sources? A.People can never be trusted to tell the truth B.People deliberately leave out important details C.People are only able to provide meaningful information if they experienced an event firsthand D.People interpret the same event in different ways, according to their experience E.People are unaware of the motivations for their actions F.People get confused about sequences of events

“All the women are strong, all the men are good-looking, and all the children are above average.” Garrison Keillor The Lake Wobegon effect revisited

Effects of narrow assessment Incentives to teach to the test  Focus on some subjects at the expense of others  Focus on some aspects of a subject at the expense of others  Focus on some students at the expense of others (“bubble” students) Consequences  Learning that is  Narrow  Shallow  Transient

Authority Reliability requires random sampling from the domain of interest Increasing reliability requires increasing the size of the sample Using teacher assessment in certification is attractive:  Increases reliability (increased test time)  Increases validity (addresses aspects of construct under-representation) But problematic  Lack of trust (“Fox guarding the hen house”)  Problems of biased inferences (construct-irrelevant variance)  Can introduce new kinds of construct under-representation

Locus Using external markers to mark student assessments involves spending more money in order to deny teachers professional learning opportunities Getting teachers involved in “common assessment”  Is not assessment for learning, nor formative assessment  But it is valuable, perhaps even essential, professional development

Final reflections

The challenge To design an assessment system that is:  Distributed  So that evidence collection is not undertaken entirely at the end  Synoptic  So that learning has to accumulate  Extensive  So that all important aspects are covered (breadth and depth)  Manageable  So that costs are proportionate to benefits  Trusted  So that stakeholders have faith in the outcomes

Constraints and affordances Beliefs about what constitutes learning; Beliefs in the reliability and validity of the results of various tools; A preference for and trust in numerical data, with bias towards a single number; Trust in the judgments and integrity of the teaching profession; Belief in the value of competition between students; Belief in the value of competition between schools; Belief that test results measure school effectiveness; Fear of national economic decline and education’s role in this; Belief that the key to schools’ effectiveness is strong top-down management;

The minimal take-aways… No such thing as a summative assessment No such thing as a reliable test No such thing as a valid test No such thing as a biased test “Validity including reliability”

Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam www.dylanwiliam.net Annual Conference of the Chartered Institute of Educational.

Similar presentations

Presentation on theme: "Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam www.dylanwiliam.net Annual Conference of the Chartered Institute of Educational."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam www.dylanwiliam.net Annual Conference of the Chartered Institute of Educational.

Similar presentations

Presentation on theme: "Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam www.dylanwiliam.net Annual Conference of the Chartered Institute of Educational."— Presentation transcript:

Similar presentations

About project

Feedback