Not invented here: the baffling insularity of assessment practices in higher education Dylan Wiliam www.dylanwiliam.net Keynote presentation at the University.

Slides:



Advertisements
Similar presentations
Writing constructed response items
Advertisements

Measuring Complex Achievement: Essay Questions
Survey Methodology Reliability and Validity EPID 626 Lecture 12.
Part II Sigma Freud & Descriptive Statistics
Part II Sigma Freud & Descriptive Statistics
In Today’s Society Education = Testing Scores = Accountability Obviously, Students are held accountable, But also!  Teachers  School districts  States.
Measurement Reliability and Validity
Chapter 4A Validity and Test Development. Basic Concepts of Validity Validity must be built into the test from the outset rather than being limited to.
Mr. White’s History Class
Designing Content Targets for Alternate Assessments in Science: Reducing depth, breadth, and/or complexity Brian Gong Center for Assessment Web seminar.
Assessment: Reliability, Validity, and Absence of bias
Reliability, Validity, Trustworthiness If a research says it must be right, then it must be right,… right??
Reliability or Validity Reliability gets more attention: n n Easier to understand n n Easier to measure n n More formulas (like stats!) n n Base for validity.
Teaching and Testing Pertemuan 13
PROFESSIONALDEVELOPMET PROGRAMME PROGRAMME 14 April 2011.
Validity Lecture Overview Overview of the concept Different types of validity Threats to validity and strategies for handling them Examples of validity.
Understanding Validity for Teachers
Stages of testing + Common test techniques
Classroom Assessment A Practical Guide for Educators by Craig A. Mertler Chapter 9 Subjective Test Items.
Writing level 3 essays An initial guide. Key principles The key principles of essay writing still apply: Understanding the topic Plan your response Structure.
Technical Issues Two concerns Validity Reliability
Measurement and Data Quality
Validity and Reliability
Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam Annual Conference of the Chartered Institute of Educational.
Foundations of Educational Measurement
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Completion, Short-Answer, and True-False Items
Writing research proposal/synopsis
CONSTRUCTING OBJECTIVE TEST ITEMS: MULTIPLE-CHOICE FORMS CONSTRUCTING OBJECTIVE TEST ITEMS: MULTIPLE-CHOICE FORMS CHAPTER 8 AMY L. BLACKWELL JUNE 19, 2007.
How to Write a History Essay. When writing an essay in history, you must approach it slightly different from the way you would go about writing an essay.
Session 2 Traditional Assessments Session 2 Traditional Assessments.
1. What do you know when you know the test results? The meanings of educational assessments Annual Conference of the International Association for Educational.
Research Process Parts of the research study Parts of the research study Aim: purpose of the study Aim: purpose of the study Target population: group whose.
EDU 8603 Day 6. What do the following numbers mean?
Performance Assessment OSI Workshop June 25 – 27, 2003 Yerevan, Armenia Ara Tekian, PhD, MHPE University of Illinois at Chicago.
CT 854: Assessment and Evaluation in Science & Mathematics
Validity and Reliability Neither Valid nor Reliable Reliable but not Valid Valid & Reliable Fairly Valid but not very Reliable Think in terms of ‘the purpose.
1. Designing an assessment system Presentation to the Scottish Qualifications Authority, August 2007 Dylan Wiliam Institute of Education, University of.
Using an interpretive lens rather than an evaluative one.
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
English Language Services
McGraw-Hill/Irwin © 2012 The McGraw-Hill Companies, Inc. All rights reserved. Obtaining Valid and Reliable Classroom Evidence Chapter 4:
The Development and Validation of the Evaluation Involvement Scale for Use in Multi-site Evaluations Stacie A. ToalUniversity of Minnesota Why Validate.
ARG symposium discussion Dylan Wiliam Annual conference of the British Educational Research Association; London, UK:
Criteria for selection of a data collection instrument. 1.Practicality of the instrument: -Concerns its cost and appropriateness for the study population.
 A test is said to be valid if it measures accurately what it is supposed to measure and nothing else.  For Example; “Is photography an art or a science?
Chapter 6 - Standardized Measurement and Assessment
How to structure good history writing Always put an introduction which explains what you are going to talk about. Always put a conclusion which summarises.
How To Write Thesis Proposal Pishtiwan Abdulla. What is Master Thesis proposal? Master’s thesis proposal is a kind of a draft written with the purpose.
Classroom aggregation technologies: third generation pedagogy Dylan Wiliam Institute of Education, University of London.
Classroom Assessment Chapters 4 and 5 ELED 4050 Summer 2007.
Michigan Assessment Consortium Common Assessment Development Series Module 16 – Validity.
Do not on any account attempt to write on both sides of the paper at once. W.C.Sellar English Author, 20th Century.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Chapter 6 Assessing Science Learning Updated Spring 2012 – D. Fulton.
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
Copyright © Springer Publishing Company, LLC. All Rights Reserved. DEVELOPING AND USING TESTS – Chapter 11 –
Academic Writing Fatima AlShaikh. A duty that you are assigned to perform or a task that is assigned or undertaken. For example: Research papers (most.
VALIDITY by Barli Tambunan/
Concept of Test Validity
EDU 385 Session 8 Writing Selection items
Writing Tasks and Prompts
Questioning in maths: Diagnosis
Classroom Assessment Validity And Bias in Assessment.
Growth mindset & Questioning
Designing an assessment system
Designing an assessment system
Bell Ringer: Jan. 7, 2019 Which of the following is equivalent to (2x - 6y) + (8x - 5y)? F. 6x + y G. 10x - 11y H. 7x + 2y J. 8x - 5y K. 2x + 6y.
Chapter 8 VALIDITY AND RELIABILITY
Presentation transcript:

Not invented here: the baffling insularity of assessment practices in higher education Dylan Wiliam Keynote presentation at the University of London External System’s 150th anniversary Assessment Symposium

Overview: some assessment tensions Function  Formative versus summative Quality  Validity versus reliability Format  Multiple-choice versus constructed response Scope  Continuous versus one-off

Function Quality Format Scope

A statement of the blindlingly obvious You can’t work out how good something is until you know what it’s intended to do… Function, then quality

Formative and summative Descriptions of  Instruments  Purposes  Functions An assessment functions formatively when evidence about student achievement elicited by the assessment is interpreted and used to make decisions about the next steps in instruction that are likely to be better, or better founded, than the decisions they would have taken in the absence of that evidence.

Gresham’s law and assessment Usually (incorrectly) stated as “Bad money drives out good” “The essential condition for Gresham's Law to operate is that there must be two (or more) kinds of money which are of equivalent value for some purposes and of different value for others” (Mundell, 1998) The parallel for assessment: Summative drives out formative The most that summative assessment (more properly, assessment designed to serve a summative function) can do is keep out of the way

Function Quality Format Scope

Validity Traditional definition: a property of assessments  A test is valid to the extent that it assesses what it purports to assess  Key properties (content validity)  Relevance  Representativeness “Trinitarian” doctrines of validity  Content validity  Criterion-related validity  Concurrent validity  Predictive validity  Construct validity

Validity Validity is a property of inferences, not of assessments “One validates, not a test, but an interpretation of data arising from a specified procedure” (Cronbach, 1971; emphasis in original) The phrase “A valid test” is therefore a category error (like “A happy rock”)  No such thing as a valid (or indeed invalid) assessment  No such thing as a biased assessment Reliability is a pre-requisite for validity  Talking about “reliability and validity” is like talking about “swallows and birds”  Validity includes reliability

Modern conceptions of validity Validity subsumes all aspects of assessment quality  Reliability  Representativeness (content coverage)  Relevance  Predictiveness “Validity is an integrative evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (Messick, 1989 p. 13)

Meanings and consequences Result interpretationResult use Evidential basisContent validity Construct validity+ utility Consequential basisValue implicationsSocial consequences Adverse social consequences … are not in themselves indicative of invalidity (Messick, 1989, p. 89) Right concern, wrong concept (Popham, 1997)

Threats to validity Inadequate reliability Construct-irrelevant variance  Differences in scores are caused, in part, by differences not relevant to the construct of interest  The assessment assesses things it shouldn’t  The assessment is “too big” Construct under-representation  Differences in the construct are not reflected in scores  The assessment doesn’t assess things it should  The assessment is “too small” With clear construct definition all of these are technical—not value—issues But they interact strongly…

Function Quality Format Scope

Item formats “No assessment technique has been rubbished quite like multiple choice, unless it be graphology” Wood, 1991, p. 32) Myths about multiple-choice items  They are biased against females  They assess only candidates’ ability to spot or guess  They test only lower-order skills

Mathematics 2 What can you say about the means of the following two data sets? Set 1: Set 2: A.The two sets have the same mean. B.The two sets have different means. C.It depends on whether you choose to count the zero.

Mathematics 3 Which of the shapes below contains a dotted line that is also a diagonal?

Wilson & Draney, 2004 Science The ball sitting on the table is not moving. It is not moving because: A. no forces are pushing or pulling on the ball. B. gravity is pulling down, but the table is in the way. C. the table pushes up with the same force that gravity pulls down D. gravity is holding it onto the table. E. there is a force inside the ball keeping it from rolling off the table

OU S354: Understanding space & time Below are five statements about the cosmic background radiation of our Universe. Select two options that are correct, according to the standard model of the Universe. A.The microwave radiation collected on Earth is dominated by signals of cosmic origin B.The total energy of the cosmic background radiation is currently much greater than that of matter C.In a closed universe, the cosmic background radiation would eventually appear as visible light D.The temperature of the cosmic background radiation was equal to that of the matter in the Universe until the appearance of galaxies E.The number of photons in the cosmic background radiation has remained approximately constant since the era of decoupling

English Where would be the best place to begin a new paragraph? No rules are carved in stone dictating how long a paragraph should be. However, for argumentative essays, a good rule of thumb is that, if your paragraph is shorter than five or six good, substantial sentences, then you should reexamine it to make sure that you've developed the ideas fully. A Do not look at that rule of thumb, however, as hard and fast. It is simply a general guideline that may not fit some paragraphs. B A paragraph should be long enough to do justice to the main idea of the paragraph. Sometimes a paragraph may be short; sometimes it will be long. C On the other hand, if your paragraph runs on to a page or longer, you should probably reexamine its coherence to make sure that you are sticking to only one main topic. Perhaps you can find subtopics that merit their own paragraphs. D Think more about the unity, coherence, and development of a paragraph than the basic length. E If you are worried that a paragraph is too short, then it probably lacks sufficient development. If you are worried that a paragraph is too long, then you may have rambled on to topics other than the one stated in your topic sentence.

English 2 In a piece of persuasive writing, which of these would be the best thesis statement? A.The typical TV show has 9 violent incidents B.There is a lot of violence on TV C.The amount of violence on TV should be reduced D.Some programs are more violent than others E.Violence is included in programs to boost ratings F.Violence on TV is interesting G.I don’t like the violence on TV H.The essay I am going to write is about violence on TV

History Why are historians concerned with bias when analyzing sources? A.People can never be trusted to tell the truth B.People deliberately leave out important details C.People are only able to provide meaningful information if they experienced an event firsthand D.People interpret the same event in different ways, according to their experience E.People are unaware of the motivations for their actions F.People get confused about sequences of events

Automated scoring technologies unstructuredstructured evidence structure Low-order High-order Multiple- choice items Skill level assessed c-rater m-rater e-rater simulations

Function Quality Format Scope

Continuous vs. one-off assessment Continuous assessment  Pros  High validity (including reliability)  Reduced stress (for some students)  Cons  Comparability of work done at different times  Questions about the accumulation of learning over the programme One-off assessment  Pros  Synoptic  Comparability issues minimized  Cons  Limited validity (especially reliability)  Stressful for some students (construct-irrelevant variance)

Reflections

The challenge To design an assessment system that is:  Distributed  So that evidence collection is not undertaken entirely at the end  Synoptic  So that learning has to accumulate  Extensive  So that all important aspects are covered (breadth and depth)  Manageable  So that costs are proportionate to benefits  Trusted  So that stakeholders have faith in the outcomes

The minimal take-aways… No such thing as a summative assessment No such thing as a reliable test No such thing as a valid test No such thing as a biased test “Validity including reliability”