Professor Jim Tognolini

Slides:



Advertisements
Similar presentations
Questionnaire Development
Advertisements

Item Analysis.
FACULTY DEVELOPMENT PROFESSIONAL SERIES OFFICE OF MEDICAL EDUCATION TULANE UNIVERSITY SCHOOL OF MEDICINE Using Statistics to Evaluate Multiple Choice.
Reliability IOP 301-T Mr. Rajesh Gunesh Reliability  Reliability means repeatability or consistency  A measure is considered reliable if it would give.
Issues of Reliability, Validity and Item Analysis in Classroom Assessment by Professor Stafford A. Griffith Jamaica Teachers Association Education Conference.
Consistency in testing
Topics: Quality of Measurements
Procedures for Estimating Reliability
Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
The Department of Psychology
Chapter 4 – Reliability Observed Scores and True Scores Error
What is a Good Test Validity: Does test measure what it is supposed to measure? Reliability: Are the results consistent? Objectivity: Can two or more.
MEQ Analysis. Outline Validity Validity Reliability Reliability Difficulty Index Difficulty Index Power of Discrimination Power of Discrimination.
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
-生醫統計期末報告- Reliability 學生 : 劉佩昀 學號 : 授課老師 : 蔡章仁.
Reliability and Validity of Research Instruments
Can you do it again? Reliability and Other Desired Characteristics Linn and Gronlund Chap.. 5.
Reliability n Consistent n Dependable n Replicable n Stable.
Reliability Analysis. Overview of Reliability What is Reliability? Ways to Measure Reliability Interpreting Test-Retest and Parallel Forms Measuring and.
Reliability n Consistent n Dependable n Replicable n Stable.
Item Analysis Prof. Trevor Gibbs. Item Analysis After you have set your assessment: How can you be sure that the test items are appropriate?—Not too easy.
Research Methods in MIS
Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.
Classical Test Theory By ____________________. What is CCT?
Psychometrics Timothy A. Steenbergh and Christopher J. Devers Indiana Wesleyan University.
Technical Issues Two concerns Validity Reliability
Measurement and Data Quality
Foundations of Educational Measurement
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Unanswered Questions in Typical Literature Review 1. Thoroughness – How thorough was the literature search? – Did it include a computer search and a hand.
Chapter 4: Test administration. z scores Standard score expressed in terms of standard deviation units which indicates distance raw score is from mean.
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Tests and Measurements Intersession 2006.
RELIABILITY AND VALIDITY OF ASSESSMENT
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
Reliability n Consistent n Dependable n Replicable n Stable.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Chapter 6 - Standardized Measurement and Assessment
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
Chapter 6 Norm-Referenced Reliability and Validity.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Norm Referenced Your score can be compared with others 75 th Percentile Normed.
1 Measurement Error All systematic effects acting to bias recorded results: -- Unclear Questions -- Ambiguous Questions -- Unclear Instructions -- Socially-acceptable.
Lecture 5 Validity and Reliability
Product Reliability Measuring
Assessment Theory and Models Part II
Reliability.
CHAPTER 5 MEASUREMENT CONCEPTS © 2007 The McGraw-Hill Companies, Inc.
Tests and Measurements: Reliability
Classical Test Theory Margaret Wu.
Reliability & Validity
پرسشنامه کارگاه.
بسمه تعالی تحليل سوالهاي آزمون Dr Mirjalili.
Making Sense of Advanced Statistical Procedures in Research Articles
مركز مطالعات و توسعه آموزش دانشگاه علوم پزشكي كرمان
Calculating Reliability of Quantitative Measures
Reliability, validity, and scaling
PSY 614 Instructor: Emily Bullock, Ph.D.
Unit IX: Validity and Reliability in nursing research
Evaluation of measuring tools: reliability
Chapter 4 Characteristics of a Good Test
By ____________________
How can one measure intelligence?
Chapter 8 VALIDITY AND RELIABILITY
Reliability and validity
Presentation transcript:

Professor Jim Tognolini Presentation 5: Analysing Tests and Test Items using Classical Test Theory (CTT) Professor Jim Tognolini

Analysing Tests and Test Items using Classical Test Theory (CTT) During this session we will define some basic test level statistics using Classical Test Theory analyses: test mean, test discrimination and test reliability (Chronbach’s Alpha). define some basic item level statistics from Classical test theory: item difficulty, item discrimination (Findlay Index and Point Biserial Correlation).

Test characteristics to evaluate Difficulty Discrimination Reliability Validity

Test difficulty  

Test discrimination The ability of a test to discriminate between high- and low-achieving individuals is a function of the items that comprise the test.

Methods of estimating reliability Type of Reliability Procedure Test-Retest Stability Reliability Give the same test to the same group on different occasions with some time between tests. Equivalent Forms Equivalent Reliability Give two forms (parallel forms) of the test to the same group in close succession. Split-half Internal Consistency Give test once; split test in half (odd/even); get the correlation between the score; correct the correlation between halves using the Spearman-Brown formula. Coefficient Alpha Give test once to a group and apply formula. Interrater Consistency of Ratings Get two or more raters to score the responses and calculate the correlation coefficient.

Split-halves method Reliability can also be estimated from a single administration of a test, either by correlating the two halves or by using the Kuder-Richardson Method. The Split-halves method requires the test to be split into halves which are most equivalent. To estimate the reliability of the full test the Spearman-Brown Adjustment is usually applied

Kuder-Richardson (KR-20 and KR-21) Method  

Cronbach’s alpha method  

Ways to improve reliability Test length In general the longer the test the higher the reliability (more adequate sampling) provided that the material that is added is identical in statistical and substantive properties Homogeneity of group The more heterogeneous the group, the high the reliability. It can vary at different score levels, gender, location, etc. Difficulty of items Tests that are too difficult or too hard provide results of low reliability. Generally set tests of item difficulty equal to 0.5. In general with tests that are required to discriminate, spread questions over the range in which the discrimination is required.

Ways to improve reliability Objectivity The more objective the test (and marking scheme) the more reliable are the resulting test scores. Retain Discriminating Items In general replace items with a low discrimination with those that highly discriminate. There comes a point where this practice raises the reliability to such a point that it lowers validity (attenuation paradox). Increase Speededness of the Tests Highly speeded tests usually show higher reliability. Don’t use internal consistency estimates.

Types of validity There are many different types of validity. Traditionally there are three main types: Content Validity (sometimes referred to as curricular or instructional validity) Criterion Related Validity (types include predictive and concurrent validity) Construct Validity Face Validity Loevinger (1957) argued that “since predictive, concurrent and content validities are all essentially ad hoc, construct validity of the whole of validity from a scientific point of view”

Define some basic item level statistics from Classical Test Theory

Item difficulty  

Item discrimination Methods for checking item discrimination include The Findlay Index (FI) The Point Biserial Correlation The Biserial Correlation

The Findlay Index (FI)  

The Findlay Index (FI) – An example

The Findlay Index (FI) FI = PRU - PRL If the number of students in the top group is not equal to the number in the bottom group proportions must be used. where PRU = proportion of persons right in upper group PRL = proportion of persons right in lower group FI = PRU - PRL

Graphical display of the Findlay Index (FI) Calculate the proportion of the group getting the item correct and then plot this against the mean score for the particular group mean scores for each group.

Graphical display of the Findlay Index (FI)

The Findlay Index (FI) – An example Item Type SA E Total Item Number 1 2 3 4 5 6 7 8 9 10 11 12 Max Marks 28 Astha 18 Bosco Chetan 21 Devika 16 Emily Farhan 24 Gogi Harshita Indu Jagat 22 TOTAL 23 14 20 31 176

The Findlay Index (FI) – An example Item Type SA E Total Item Number 1 2 3 4 5 6 7 8 9 10 11 12 Max Marks Astha 18 Bosco Chetan 21 Devika 16 Emily Farhan 24 Gogi Harshita Indu Jagat 22 TOTAL 23 14 20 31 176 Difficulty/Mean 0.9 1.0 0.8 0.5 2.3 0.4 1.4 2.0 1.8 2.4 3.1

The Findlay Index (FI) – An example Item Type SA E Total Item Number 1 2 3 4 5 6 7 8 9 10 11 12 Max Marks 28 Astha 18 Bosco Chetan 21 Devika 16 Emily Farhan 24 Gogi Harshita Indu Jagat 22 TOTAL 23 14 20 31 176 Difficulty/Mean 0.9 1.0 0.8 0.5 2.3 0.4 1.4 2.0 1.8 2.4 3.1 17.6 P-Value 0.2 0.7

The Findlay Index (FI) – An example Item Type SA E Total Item Number 1 2 3 4 5 6 7 8 9 10 11 12 Max Marks 28 Farhan 0.5 0.75 0.7 24 Harshita Jagat 0.8 22 Chetan 0.3 21 Emily Astha 0.25 18 Bosco Devika 16 Indu Gogi 0.2 P-Value 0.9 1.0

The Findlay Index (FI) – An example Item Type SA E Total Item Number 1 2 3 4 5 6 7 8 9 10 11 12 Max Marks 28 Farhan 0.5 0.75 0.7 24 Harshita Jagat 0.8 22 Chetan 0.3 21 Emily Astha 0.25 18 Bosco Devika 16 Indu Gogi 0.2 P-Value 0.9 1.0 FI 0.0 0.6 -0.1 0.4

Guttman scale Item Type SA E Total Item Number 2 5 1 3 6 11 8 9 10 4 12 7 Max Marks 28 Farhan 0.75 0.7 0.5 24 Harshita Jagat 0.8 22 Chetan 0.3 21 Emily Astha 0.25 18 Bosco Devika 16 Indu Gogi 0.2 P-Value 1.0 0.9

Point-biserial correlation  

The Guttman structure If person A scores better than person B on the test, then person A should have all the items correct that person B has, and in addition, some other items that are more difficult. Louis Guttman

The Guttman structure (cont.)

Reasons for not obtaining a strict Guttman pattern The items do not go together as expected and the scores on the items should not be added. The items are very close in difficulty and the persons are all close in ability.

Guttman scale Item Type SA E Total Item Number 2 5 1 3 6 11 8 9 10 4 12 7 Max Marks 28 Farhan 0.75 0.7 0.5 24 Harshita Jagat 0.8 22 Chetan 0.3 21 Emily Astha 0.25 18 Bosco Devika 16 Indu Gogi 0.2 P-Value 1.0 0.9

Individual reporting 3 11 2 15 14 9 8 1 7 4 13 12 5 10 6

Individual reporting 3 11 2 15 14 9 8 1 7 4 13 12 5 10 6