Issues of Reliability, Validity and Item Analysis in Classroom Assessment by Professor Stafford A. Griffith Jamaica Teachers Association Education Conference.

Slides:



Advertisements
Similar presentations
Test Development.
Advertisements

Agenda Levels of measurement Measurement reliability Measurement validity Some examples Need for Cognition Horn-honking.
Measurement Concepts Operational Definition: is the definition of a variable in terms of the actual procedures used by the researcher to measure and/or.
Reliability IOP 301-T Mr. Rajesh Gunesh Reliability  Reliability means repeatability or consistency  A measure is considered reliable if it would give.
Conceptualization and Measurement
The Research Consumer Evaluates Measurement Reliability and Validity
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Psychometrics William P. Wattles, Ph.D. Francis Marion University.
VALIDITY AND RELIABILITY
Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.
Chapter 4A Validity and Test Development. Basic Concepts of Validity Validity must be built into the test from the outset rather than being limited to.
CH. 9 MEASUREMENT: SCALING, RELIABILITY, VALIDITY
RESEARCH METHODS Lecture 18
Chapter 4 Validity.
VALIDITY.
Concept of Measurement
Beginning the Research Design
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
Classical Test Theory By ____________________. What is CCT?
Understanding Validity for Teachers
Measurement and Data Quality
Reliability and Validity what is measured and how well.
Instrumentation.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
Technical Adequacy Session One Part Three.
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
Measurement Validity.
Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.
Presented By Dr / Said Said Elshama  Distinguish between validity and reliability.  Describe different evidences of validity.  Describe methods of.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Validity Validity is an overall evaluation that supports the intended interpretations, use, in consequences of the obtained scores. (McMillan 17)
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
JS Mrunalini Lecturer RAKMHSU Data Collection Considerations: Validity, Reliability, Generalizability, and Ethics.
SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.
Psychometrics. Goals of statistics Describe what is happening now –DESCRIPTIVE STATISTICS Determine what is probably happening or what might happen in.
Measurement MANA 4328 Dr. Jeanne Michalski
Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Sampling Design & Measurement Scaling
Chapter 6 - Standardized Measurement and Assessment
VALIDITY, RELIABILITY & PRACTICALITY Prof. Rosynella Cardozo Prof. Jonathan Magdalena.
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Reliability EDUC 307. Reliability  How consistent is our measurement?  the reliability of assessments tells the consistency of observations.  Two or.
TEST SCORES INTERPRETATION - is a process of assigning meaning and usefulness to the scores obtained from classroom test. - This is necessary because.
Measuring Research Variables
Copyright © Springer Publishing Company, LLC. All Rights Reserved. DEVELOPING AND USING TESTS – Chapter 11 –
Measurement and Scaling Concepts
Copyright © 2009 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 47 Critiquing Assessments.
Professor Jim Tognolini
Reliability Analysis.
Ch. 5 Measurement Concepts.
VALIDITY by Barli Tambunan/
Lecture 5 Validity and Reliability
ARDHIAN SUSENO CHOIRUL RISA PRADANA P.
Concept of Test Validity
Test Validity.
Workshop For External Examiners Joint Board of Teacher Education
Human Resource Management By Dr. Debashish Sengupta
Week 3 Class Discussion.
پرسشنامه کارگاه.
Reliability and Validity of Measurement
PSY 614 Instructor: Emily Bullock, Ph.D.
RESEARCH METHODS Lecture 18
By ____________________
Reliability Analysis.
How can one measure intelligence?
Measurement Concepts and scale evaluation
Chapter 8 VALIDITY AND RELIABILITY
Presentation transcript:

Issues of Reliability, Validity and Item Analysis in Classroom Assessment by Professor Stafford A. Griffith Jamaica Teachers Association Education Conference Assessment in Education Ritz Carlton Resort & Spa, Montego Bay April 2-4, 2013

Concept of a Test Some of the earliest forms of assessment or testing may be noted in biblical references. Adam and Eve, for example, were subjected to a simple test in the Garden of Eden based on a test item presented in a negative form. Another account, is taken from Judges 12: 4 - 6. It was an oral examination (shibolleth) devised by the Gileadite army to identify members of the defeated Ephraimite army who were attempting to escape under cover of a false identity. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

Outside of the biblical accounts, historians generally agree that the Chinese were the first to use large scale testing These were introduced as early as 2000 B.C. to measure the proficiency of candidates for public office and to reduce patronage Today, we think of a test as an item/question, problem or task or a mix of these, administered under prescribed conditions Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

It is designed to elicit responses that provide information to make judgements about a candidate. It is a systematic procedure for measuring a sample of a candidate’s behaviour that can give an accurate and truthful account of a candidate’s skills, knowledge or ability, or other characteristics, at the time the test was administered. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

Reliability of Test Scores Two essential requirements for a technically sound test are reliability and validity. Reliability is the extent to which test scores are consistent or dependable. Only to the extent that scores are reliable can they be useful in conveying information about a student’s performance. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

From a more technical standpoint, reliability is the extent to which scores are free from errors of measurement. Classical Test Theory (CTT) defines reliability as a property that is based on three considerations: observed scores, true scores and measurement errors. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

This may be represented simply as: Xo = Xt + Xe In Classical Test Theory, a person’s observed score is a function of that person’s true score, plus error.  This may be represented simply as: Xo = Xt + Xe Where Xo represents the observed score; Xt represents the true score; and Xe represents the error.   Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

The level of confidence we can have in test scores hinges on how much error we have in the observed scores of students. Reliability, or level of confidence we can have in test scores, is expressed as a index ranging from 0 to 1. It may therefore be .99 (high) or .10 (low). Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

The reliability coefficients commonly used to determine and report on the consistency with which a test measures are derived from various approaches: test-retest, alternative form, internal consistency, split-half and inter-rater (a special form of reliability). Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

Validity of Test Scores Validity is the extent to which a test does the job for which it is intended. Essentially, validity is about what inference can be made from the scores obtained on an instrument. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

The most widely encountered discussions refer to three lines of validity evidence: content validity (representativeness of the domain); criterion-related validity (correlation with/prediction of scores from another instrument); construct validity (association with some theoretical construct). Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

Validity is the most important technical quality of a test. An important way of assuring, or assessing validity is to use a subject matter by behaviour grid called a specifications table or a table of specifications. It helps to define the weighting to be given to various subject matter and behaviours (or objectives or skills). It helps to avoid the testing of extraneous material. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

Example of a Table of Specifications Cont Obj Kn Co Ap An Tot Classif of animals 2 4 - 10 Plants of the earth Pop and Evol 3 Var and Selec 1 5 Origin of Sol Sys Chan in Land Fea 6 Total 16 17 11 60 Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

It is important to work out the types of items/questions, their psychometric characteristics, the number of items and questions and how these will be scored. The specifications for test construction should be so clear that two test constructors would produce tests that are comparable and interchangeable. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

Item Analysis In writing and analysing test tasks, two critical indicators of goodness of the tasks should be considered: the facility (or difficulty) and the discrimination. The facility level for a task is the percentage of candidates responding correctly or satisfactory to it. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

It is expressed as an index: an f-value, or a p-value (which is really the probability of a person in a particular group responding correctly or satisfactorily). The formula for calculating p is very simple: p = R/T, that is, the number of students responding correctly to an item divided by the number of students responding to the item. Its value ranges from 0 to 1.00. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

The discrimination level for a task is the extent to which performance on the task separates the better candidates from the poorer ones. The calculation of this d-index is generally more complex than the calculation of the facility index and is often represented by a biserial or a point-biserial correlation index (r). It ranges from -1.00 to +1.00. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

Easier and relatively accurate estimates of the extent of discrimination of a task scored dichotomously are, however, obtained by: comparing the way the top performing students perform on the task with the way the bottom performing students perform on that task. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

The discrimination index for an item is calculated by: ranking students according to performance on the test; separating the top performing students and the bottom performing students; finding the p value of the item for the top performing students and the p value for the bottom performing students; subtracting the p value for the low performing students from the p value of the high performing students Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

The table indicates how students performed on an item with four possible responses (A, B, C and D). The correct response is C. Response A B C D Upper Group - 2 8 - Lower Group 4 3 2 1 The facility index of the item is (a) 1.00 (b) .10 (c) .05 (d) .50 The discrimination index of the item is (a) 6 (b) .60 (c) .06 (d) .66 Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

Summary Based on our discussions, I trust that in developing and using tests for assessment in the classroom, you will consider the need to: provide scores that are reliable provide scores that are valid develop and use items/tasks that are at the right difficulty level develop and use items/tasks that can discriminate between those who have the desired competences and those who do not. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona

Thank you. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona