Validity and Reliability

Slides:



Advertisements
Similar presentations
Test Development.
Advertisements

Agenda Levels of measurement Measurement reliability Measurement validity Some examples Need for Cognition Horn-honking.
Measurement Concepts Operational Definition: is the definition of a variable in terms of the actual procedures used by the researcher to measure and/or.
1 The Appropriate Use of NAPLAN Data National Symposium, 23 July, 2010 Margaret Wu University of Melbourne 1.
Conceptualization and Measurement
The Research Consumer Evaluates Measurement Reliability and Validity
1 COMM 301: Empirical Research in Communication Kwan M Lee Lect4_1.
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.
VALIDITY AND RELIABILITY
Chapter 5 Measurement, Reliability and Validity.
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 6 Validity.
General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Assessment: Reliability, Validity, and Absence of bias
RESEARCH METHODS Lecture 18
Classroom Assessment A Practical Guide for Educators by Craig A
Standardized Test Scores Common Representations for Parents and Students.
Understanding Validity for Teachers
Validity and Reliability
PhD Research Seminar Series: Reliability and Validity in Tests and Measures Dr. K. A. Korb University of Jos.
Measurement in Exercise and Sport Psychology Research EPHE 348.
Reliability and Validity what is measured and how well.
Foundations of Educational Measurement
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Technical Adequacy Session One Part Three.
Foundations of Recruitment and Selection I: Reliability and Validity
Cara Cahalan-Laitusis Operational Data or Experimental Design? A Variety of Approaches to Examining the Validity of Test Accommodations.
CHAPTER 6, INDEXES, SCALES, AND TYPOLOGIES
Reliability & Validity
Validity Is the Test Appropriate, Useful, and Meaningful?
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Measurement Validity.
CHAPTER OVERVIEW The Measurement Process Levels of Measurement Reliability and Validity: Why They Are Very, Very Important A Conceptual Definition of Reliability.
Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
MEASUREMENT. MeasurementThe assignment of numbers to observed phenomena according to certain rules. Rules of CorrespondenceDefines measurement in a given.
Validity Validity is an overall evaluation that supports the intended interpretations, use, in consequences of the obtained scores. (McMillan 17)
Validity and Item Analysis Chapter 4. Validity Concerns what the instrument measures and how well it does that task Not something an instrument has or.
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.
The Practice of Social Research Chapter 6 – Indexes, Scales, and Typologies.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Measurement Experiment - effect of IV on DV. Independent Variable (2 or more levels) MANIPULATED a) situational - features in the environment b) task.
Chapter 6 - Standardized Measurement and Assessment
Michigan Assessment Consortium Common Assessment Development Series Module 16 – Validity.
© 2009 Pearson Prentice Hall, Salkind. Chapter 5 Measurement, Reliability and Validity.
Copyright © Springer Publishing Company, LLC. All Rights Reserved. DEVELOPING AND USING TESTS – Chapter 11 –
ESTABLISHING RELIABILITY AND VALIDITY OF RESEARCH TOOLS Prof. HCL Rawat Principal UCON,BFUHS Faridkot.
Copyright © 2009 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 47 Critiquing Assessments.
VALIDITY by Barli Tambunan/
CHAPTER 6, INDEXES, SCALES, AND TYPOLOGIES
Assessment Framework and Test Blueprint
Reliability and Validity in Research
Concept of Test Validity
Test Design & Construction
Test Validity.
assessing scale reliability
Classical Test Theory Margaret Wu.
Reliability & Validity
Classroom Assessment Validity And Bias in Assessment.
Human Resource Management By Dr. Debashish Sengupta
Week 3 Class Discussion.
پرسشنامه کارگاه.
Reliability and Validity of Measurement
PSY 614 Instructor: Emily Bullock, Ph.D.
Chapter 6 Indexes, Scales, And Typologies
RESEARCH METHODS Lecture 18
Assessment Literacy: Test Purpose and Use
Chapter 8 VALIDITY AND RELIABILITY
Qualities of a good data gathering procedures
Presentation transcript:

Validity and Reliability Margaret Wu

Validity According APA Standards for educational and psychological testing: The degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests. To establish validity: We need to accumulate evidence for appropriate interpretations of scores. That is, to evaluate the use of test scores, and not the test itself. A test may be valid for one kind of use, but not valid for another kind of use.

Sources of Validity Evidence Evidence based on Test Content Evidence based on Response Process Evidence based on Internal Structure Evidence based on Relations to Other Variables Evidence based on Consequences of Testing

Evidence based on Test Content Content coverage – representation of the domain Check assessment framework and test blueprints Assessment framework has definition of the test domain Test Blueprint has the coverage of subdomains. Check against intended use of the test: Subdomain scores reported? Inferences on individual student performance? Inferences on group performance? Inferences on trends over time? Test blueprint and curriculum match Test blueprint and test items match

Evidence based on Response Processes Response processes of test takers, For example: Questionnaire: social desirability answers or students’ “real” responses Items testing reasoning: Is reasoning needed, or perhaps memorised algorithm? Evidence can be gathered through: Theoretical evidence: test blueprints include students’ required cognitive demand Empirical evidence: cognitive labs (think-aloud procedure); pre-test; response time Students with disability Judges/raters’ judging processes Test administration procedures: testing time; test security, testing environment

Evidence based on Internal Structure Good/poor discriminating items Expected item difficulty order matches empirical order Item inter-relationship; unidimensionality Test reliability Presence of differential item functioning Size of measurement error; sampling error; in relation to the validity of using the test scores

Evidence based on relations to other variables Predictive validity Concurrent validity Relationship to other tests Relationship to group variables, e.g., gender, demographic Are the relationships consistent with what’s expected given the construct of the test?

Evidence based on consequences of testing Intended and unintended consequences of test use. Effect of group differences in test scores on employment selection Narrowing of school curriculum to exclude learning objectives not assessed. What are the benefits of using the test scores? Are there any detrimental effects from administering the test?

Reliability Reliability refers to how consistent would the test scores be should “similar” tests be administered.

(a) The temperature was 7° . It fell by 4° . The temperature was then David can answer 60% of the items (if we have the opportunity to administer all items) (c) (-11) + (+3) = (a) The temperature was 7° . It fell by 4° . The temperature was then A class starts at 10:30. The class is 40 minutes long. What time does the class finish? Possible Grade 5 Mathematics Item Pool – Many questions can be asked (a) 16 × 10 = A class starts at 10:30. The class is 40 minutes long. What time does the class finish? The correlation between students’ scores on two “parallel” tests is called reliability (j) 139.2 ÷ 1000 = Each apple weighs around 160 grams. How many apples together will weigh close to half a kilogram? 40 questions are sampled from the large item pool David’s test scores on similar NAPLAN tests will have a range of 10 score points (e.g., between 20/40 to 30/40). The difference between David’s test score and his “TRUE” score is called measurement error/. NAPLAN 2008 Test NAPLAN 2009 Test David’s score: 25/40 NAPLAN 2010 Test David’s score: 28/40 David’s score: 20/40

Margin of error in measuring student performance One test collects only a small sample of performance. Possible variation in scores is called Measurement Error. Note that Measurement Error does not refer to mistakes made in the assessment (e.g., wrong scoring; incorrect question) . Is the measurement error too large in NAPLAN? 11

How big an error size is acceptable? The answer is It depends. An example Effectiveness of a weight loss program Expect a loss of 0.5 kg after one week. Measurement scale is accurate to 1kg. Not good enough for measuring individual change OK for a group change, if group size is ‘large’. The key is to assess whether the measurement error is too large in comparison to the magnitude that we want to measure. 12 12

Magnitudes of measurement error As a rough guide, a lower bound for measurement error is given by sqrt(4/N) where N is the number of items in the test.

On the NAPLAN scale… 14

Summary of measuring individuals The main message is that a one-off test does not provide very accurate information at the individual student level other than an indicative level of whether a student is below average, at average or above average. If ever one single test of 30 items is used for high-stakes purposes such as selection into colleges or awarding certificates, we should be very wary of the results.

Reliability and Measurement Error Reliability = Var(T)/Var(X) variance of true score divided by variance of observed score Reliability = (Var(X) – Var(E)) / Var(X) Variance of observed scores minus square of measurement error, then divide by variance of observed scores. Sqrt(Var(E)) is Measurement Error.