FACULTY DEVELOPMENT PROFESSIONAL SERIES OFFICE OF MEDICAL EDUCATION TULANE UNIVERSITY SCHOOL OF MEDICINE Using Statistics to Evaluate Multiple Choice.

Slides:



Advertisements
Similar presentations
SYSTEMIC ASSESSMENT Workshop Faculty of Medicine Zagazig University PART-I Prof.A.F.M.FAHMY January 2010.
Advertisements

Questionnaire Development
Item Analysis.
How to Make a Test & Judge its Quality. Aim of the Talk Acquaint teachers with the characteristics of a good and objective test See Item Analysis techniques.
Item Analysis: Improving Multiple Choice Tests Crystal Ramsay September 27, 2011 Schreyer Institute for Teaching.
Chapter 8 Flashcards.
Rebecca Sleeper July  Statistical  Analysis of test taker performance on specific exam items  Qualitative  Evaluation of adherence to optimal.
Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Chapter 4 – Reliability Observed Scores and True Scores Error
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
1Reliability Introduction to Communication Research School of Communication Studies James Madison University Dr. Michael Smilowitz.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
MEQ Analysis. Outline Validity Validity Reliability Reliability Difficulty Index Difficulty Index Power of Discrimination Power of Discrimination.
Using Test Item Analysis to Improve Students’ Assessment
Using Multiple Choice Tests for Assessment Purposes: Designing Multiple Choice Tests to Reflect and Foster Learning Outcomes Terri Flateby, Ph.D.
Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.
Constructing Exam Questions Dan Thompson & Brandy Close OSU-CHS Educational Development-Clinical Education.
-生醫統計期末報告- Reliability 學生 : 劉佩昀 學號 : 授課老師 : 蔡章仁.
Test Construction Processes 1- Determining the function and the form 2- Planning( Content: table of specification) 3- Preparing( Knowledge and experience)
Item Analysis What makes a question good??? Answer options?
Chapter 8 Developing Written Tests and Surveys Physical Fitness Knowledge.
Item Analysis Prof. Trevor Gibbs. Item Analysis After you have set your assessment: How can you be sure that the test items are appropriate?—Not too easy.
Multiple Choice Test Item Analysis Facilitator: Sophia Scott.
ANALYZING AND USING TEST ITEM DATA
Chapter 7 Correlational Research Gay, Mills, and Airasian
Quantitative Research
Technical Issues Two concerns Validity Reliability
Measurement and Data Quality
TEXAS TECH UNIVERSITY HEALTH SCIENCES CENTER SCHOOL OF PHARMACY KRYSTAL K. HAASE, PHARM.D., FCCP, BCPS ASSOCIATE PROFESSOR BEYOND MULTIPLE CHOICE QUESTIONS.
Chapter 8 Measuring Cognitive Knowledge. Cognitive Domain Intellectual abilities ranging from rote memory tasks to the synthesis and evaluation of complex.
Part #3 © 2014 Rollant Concepts, Inc.2 Assembling a Test #
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
Test item analysis: When are statistics a good thing? Andrew Martin Purdue Pesticide Programs.
Chapter 7 Item Analysis In constructing a new test (or shortening or lengthening an existing one), the final set of items is usually identified through.
Techniques to improve test items and instruction
Assessment in Education Patricia O’Sullivan Office of Educational Development UAMS.
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Tests and Measurements Intersession 2006.
Lab 5: Item Analyses. Quick Notes Load the files for Lab 5 from course website –
RELIABILITY AND VALIDITY OF ASSESSMENT
Assessment and Testing
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
Introduction to Item Analysis Objectives: To begin to understand how to identify items that should be improved or eliminated.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Designing a Classroom Test Anthony Paolo, PhD Director of Assessment & Evaluation Office of Medical Education & Psychometrician for CTC Teaching & Learning.
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
TEST SCORES INTERPRETATION - is a process of assigning meaning and usefulness to the scores obtained from classroom test. - This is necessary because.
Dan Thompson Oklahoma State University Center for Health Science Evaluating Assessments: Utilizing ExamSoft’s item-analysis to better understand student.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 10: Correlational Research 1.
Psychometrics: Exam Analysis David Hope
Multiple-Choice Item Design February, 2014 Dr. Mike Atkinson, Teaching Support Centre, Assessment Series.
Objective Examination: Multiple Choice Questions Dr. Madhulika Mistry.
Norm Referenced Your score can be compared with others 75 th Percentile Normed.
Copyright © Springer Publishing Company, LLC. All Rights Reserved. DEVELOPING AND USING TESTS – Chapter 11 –
Items analysis Introduction Items can adopt different formats and assess cognitive variables (skills, performance, etc.) where there are right and.
Exam Analysis Camp Teach & Learn May 2015 Stacy Lutter, D. Ed., RN Nursing Graduate Students: Mary Jane Iosue, RN Courtney Nissley, RN Jennifer Wierworka,
Professor Jim Tognolini
Using Data to Drive Decision Making:
DUMMIES RELIABILTY AND VALIDITY FOR By: Jeremy Starkey Lijia Zhang
ARDHIAN SUSENO CHOIRUL RISA PRADANA P.
Assessment Instruments and Rubrics Workshop Series
Data Analysis and Standard Setting
Classroom Analytics.
Classical Test Theory Margaret Wu.
Reliability & Validity
Test Development Test conceptualization Test construction Test tryout
EDUC 2130 Quiz #10 W. Huitt.
Tests are given for 4 primary reasons.
Presentation transcript:

FACULTY DEVELOPMENT PROFESSIONAL SERIES OFFICE OF MEDICAL EDUCATION TULANE UNIVERSITY SCHOOL OF MEDICINE Using Statistics to Evaluate Multiple Choice Test Items

Objectives Define test reliability Interpret KR-20 statistic Evaluate item difficulty (p value) Define and interpret the point biserial correlation Evaluate distractor quality Differentiate among poor, fair, and good test items

What is Reliability? Test reliability is a measure of the accuracy, consistency, or precision of the test scores. Statistics: Coefficient (Cronbach) Alpha – generally used for surveys or tests that have more than one correct answer Kuder-Richardson Formula (KR-20) – Measures inter-item consistency or how well your exam measures a single construct. Used for knowledge tests where items are scored correct/incorrect (dichotomous) KR-21 – Similar to KR-20 but underestimates the reliability of an exam if questions are of varying difficulty

Interpreting KR-20 KR-20 statistic is influenced by: Number of test items on the exam Student performance on each item Variance for the set of student test scores Range: 0.00 – 1.00 Values near 0.00 – weak relationship among items on the test Values near 1.o0 – strong relationship among items on the test Medical School exams should have a KR-20 of.70 or higher

Improving Reliability Reliability can be improved by: Writing clear and simple directions Ensuring test items are clearly written and follow NBME guidelines for construction Assuring that test items match course objectives and content Adding test items; longer tests produce more reliable scores

Item Difficulty Item Difficulty (p value) – measure of the proportion of students who answered a test item correctly Range – 0.00 – 1.00 Ex. p value of.56 means that 56% of students answered the question correctly. p value of 1.00 means that 100% of students answered the question correctly. For medical school tests where there is an emphasis on mastery, MOST items should have a p-value of.70 or higher.

What is a point biserial correlation? The point biserial correlation: Measures test item discrimination Ranges from to 1.00 A positive point biserial indicates that those scoring high on the total exam answered a test item correctly more frequently than low-scoring students. A negative point biserial indicates low scoring students on the total test did better on a test item than high-scoring students. As a general rule, a point biserial of.20 is desirable.

Distractor Analysis Addresses the performance of incorrect response options. Incorrect options should be plausible but incorrect. If no one chooses a particular option, the option is not contributing to the performance of the test item The presence of one or more implausible distractors can make the item artificially easier than it should be.

Point biserial analysis Items that are very easy or very difficult will have low ability to discriminate. Such items are often needed to adequately sample course content and objectives. A negative point biserial suggests one of the following: The item was keyed incorrectly. The test item was poorly constructed or was misleading. The content of the item was inadequately taught.

Example 1 Test Item Analysis Interpretation: Correct answer is A p value = question answered correctly by 83.51% of class point biserial of – high scoring students were more likely to choose the correct answer all distractors chosen GOOD QUESTION

Example 2 Test Item Analysis Interpretation: Correct answer is C p value = question answered correctly by only 25.57% of class point biserial = <.20 – low scoring students were more likely to choose the correct answer POOR QUESTION

Example 3 Test Item Analysis Interpretation: Correct answer is D p value = question answered correctly by 97.73% of class Point biserial = 0.08 <.20 – BUT almost all students answered correctly so item is unable to discriminate. This is okay if item tests a concept all students are expected to know. FAIR QUESTION

Review the following test item statistics: What can you conclude about this test item? Click your response below. What can you conclude about this test item? Click your response below. Self-Assessment The distractors were implausible and should be replaced. The distractors were implausible and should be replaced. Low scoring students got this item correct more often than high-scoring students. Low scoring students got this item correct more often than high-scoring students. More than 10% of the class answered this question incorrectly. More than 10% of the class answered this question incorrectly. This test item showed high discriminative ability and should be retained. This test item showed high discriminative ability and should be retained.

FEEDBACK: THE DISTRACTORS WERE ALL CHOSEN BY A SMALL NUMBER OF STUDENTS AND WERE PLAUSIBLE CHOICES. Incorrect! TRY AGAIN

FEEDBACK: THE POINT BISERIAL WAS.57. THIS INDICATES THAT HIGH SCORING STUDENTS WERE MORE LIKELY TO ANSWER THIS QUESTION CORRECTLY. Incorrect! TRY AGAIN

FEEDBACK: 90.29% OF THE CLASS ANSWERED THIS ITEM CORRECTLY, WHICH MEANS LESS THAN 10% ANSWERED INCORRECTLY. Incorrect! TRY AGAIN

FEEDBACK: 90.29% OF THE CLASS ANSWERED THIS ITEM CORRECTLY. THE POINT BISERIAL WAS 0.57 WHICH INDICATES HIGH DISCRIMINATIVE ABILITY (>.20 CUTOFF CRITERION). HIGH-SCORING STUDENTS WERE MORE LIKELY TO ANSWER THIS ITEM CORRECTLY. GREAT JOB!

REVIEW STEPS Evaluate KR-20 Evaluate p value of each test item Evaluate point biserial of correct answers Evaluate the value of distractors Revise, rewrite, or discard test items as needed Summary