Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.

Slides:



Advertisements
Similar presentations
Item Analysis.
Advertisements

How to Make a Test & Judge its Quality. Aim of the Talk Acquaint teachers with the characteristics of a good and objective test See Item Analysis techniques.
FACULTY DEVELOPMENT PROFESSIONAL SERIES OFFICE OF MEDICAL EDUCATION TULANE UNIVERSITY SCHOOL OF MEDICINE Using Statistics to Evaluate Multiple Choice.
Consistency in testing
Some (Simplified) Steps for Creating a Personality Questionnaire Generate an item pool Administer the items to a sample of people Assess the uni-dimensionality.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Psychometrics William P. Wattles, Ph.D. Francis Marion University.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Psychology: A Modular Approach to Mind and Behavior, Tenth Edition, Dennis Coon Appendix Appendix: Behavioral Statistics.
Table of Contents Exit Appendix Behavioral Statistics.
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Test Construction Processes 1- Determining the function and the form 2- Planning( Content: table of specification) 3- Preparing( Knowledge and experience)
Chapter 4 Validity.
QUANTITATIVE DATA ANALYSIS
Reliability and Validity
A quick introduction to the analysis of questionnaire data John Richardson.
Item Analysis Prof. Trevor Gibbs. Item Analysis After you have set your assessment: How can you be sure that the test items are appropriate?—Not too easy.
Lesson Nine Item Analysis.
Multiple Choice Test Item Analysis Facilitator: Sophia Scott.
ANALYZING AND USING TEST ITEM DATA
Evaluating a Norm-Referenced Test Dr. Julie Esparza Brown SPED 510: Assessment Portland State University.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Internal Consistency Reliability Analysis PowerPoint.
Part #3 © 2014 Rollant Concepts, Inc.2 Assembling a Test #
Foundations of Educational Measurement
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
Induction to assessing student learning Mr. Howard Sou Session 2 August 2014 Federation for Self-financing Tertiary Education 1.
Test item analysis: When are statistics a good thing? Andrew Martin Purdue Pesticide Programs.
Chapter 7 Item Analysis In constructing a new test (or shortening or lengthening an existing one), the final set of items is usually identified through.
Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Lab 5: Item Analyses. Quick Notes Load the files for Lab 5 from course website –
EDU 8603 Day 6. What do the following numbers mean?
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
Appraisal and Its Application to Counseling COUN 550 Saint Joseph College For Class # 3 Copyright © 2005 by R. Halstead. All rights reserved.
Validity and Reliability Neither Valid nor Reliable Reliable but not Valid Valid & Reliable Fairly Valid but not very Reliable Think in terms of ‘the purpose.
RELIABILITY AND VALIDITY OF ASSESSMENT
Chapter 2: Behavioral Variability and Research Variability and Research 1. Behavioral science involves the study of variability in behavior how and why.
Presented By Dr / Said Said Elshama  Distinguish between validity and reliability.  Describe different evidences of validity.  Describe methods of.
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
Measurement Issues General steps –Determine concept –Decide best way to measure –What indicators are available –Select intermediate, alternate or indirect.
Introduction to Item Analysis Objectives: To begin to understand how to identify items that should be improved or eliminated.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
STATISTICS STATISTICS Numerical data. How Do We Make Sense of the Data? descriptively Researchers use statistics for two major purposes: (1) descriptively.
Chapter 6 - Standardized Measurement and Assessment
2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)
Dan Thompson Oklahoma State University Center for Health Science Evaluating Assessments: Utilizing ExamSoft’s item-analysis to better understand student.
1 Collecting and Interpreting Quantitative Data Deborah K. van Alphen and Robert W. Lingard California State University, Northridge.
Psychometrics: Exam Analysis David Hope
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Copyright © Springer Publishing Company, LLC. All Rights Reserved. DEVELOPING AND USING TESTS – Chapter 11 –
Items analysis Introduction Items can adopt different formats and assess cognitive variables (skills, performance, etc.) where there are right and.
ARDHIAN SUSENO CHOIRUL RISA PRADANA P.
Concept of Test Validity
Evaluation of measuring tools: validity
Classroom Analytics.
UMDNJ-New Jersey Medical School
Week 3 Class Discussion.
Analyzing Reliability and Validity in Outcomes Assessment Part 1
Test Development Test conceptualization Test construction Test tryout
Evaluation of measuring tools: reliability
Using statistics to evaluate your test Gerard Seinhorst
Analyzing test data using Excel Gerard Seinhorst
15.1 The Role of Statistics in the Research Process
Analyzing Reliability and Validity in Outcomes Assessment
Collecting and Interpreting Quantitative Data
Tests are given for 4 primary reasons.
Presentation transcript:

Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008

Validity  Validity refers to “the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores.”  “Validity is an integrative summary.” (Messick, 1995)  Validation is the process of building an argument supporting interpretation of test scores. (Kane, 1992)

Reliability Consistency, reproducibility, generalizability Very norm-referenced – relative standing in a group Only scores can be described as reliable, not tests. Reliability depends on  Test Length – number of items  Sample of test takers – group homogeneity  Score range  Dimensionality – content and skills tested

Planning the Test  Test blueprint / table of specifications  Content, skills, domains  Level of cognition  Relative importance of each element  Linked to learning objectives.  Provides evidence for content validity.

Test Blueprint: Third Year Surgery Clerkship Content

Test Statistics  A basic assumption: items measure a single subject area or underlying ability.  General indicator of test quality is a reliability estimate.  The measure most commonly used to estimate reliability in a single administration of a test is Cronbach's Alpha. Measure of internal consistency.

Cronbach’s alpha Coefficient alpha reflects three characteristics of the test:  The interitem correlations -- the greater the relative number of positive relationships, and the stronger those relationships are, the greater the reliability. Item discrimination indices and the test's reliability coefficient are related in this regard.  The length of the test -- a test with more items will have a higher reliability, all other things being equal.  The content of the test -- generally, the more diverse the subject matter tested and the testing techniques used, the lower the reliability. Where Total test variance = the sum of the item variances + twice the unique covariances

Descriptive Statistics  Total test score distribution  Central tendency  Score Range  Variability  Frequency distributions for individual items – allows us to analyze the distractors.

Mean = (6.78) Median = 77 Mode = 72 Human Behavior Exam

Item Statistics  Response frequencies/distribution  Mean  Item variance/standard deviation  Item difficulty  Item discrimination

Item Analysis Examines responses to individual test items from a single administration to assess the quality of the items and the test as a whole. Did the item function as intended? Were the test items of appropriate difficulty? Were the test items free from defects? Technical Testwiseness Irrelevant difficulty Was each of the distractors effective?

Item Difficulty  For items with one correct answer worth a single point, difficulty is the percentage of students who answer an item correctly, i.e. item mean.  When an alternative is worth other than a single point, or when there is more than one correct alternative per question, the item difficulty is the average score on that item divided by the highest number of points for any one alternative.  Ranges from 0 to the higher the value, the easier the question.

Item Difficulty  Item difficulty is relevant for determining whether students have learned the concept being tested.  Plays an important role in the ability of an item to discriminate between students who know the tested material and those who do not.  To maximize item discrimination, desirable difficulty levels are slightly higher than midway between chance and perfect scores for the item.

Ideal difficulty levels for MCQ Lord, F.M. "The Relationship of the Reliability of Multiple-Choice Test to the Distribution of Item Difficulties," Psychometrika, 1952, 18,

Item Difficulty Assuming a 5-option MCQ, rough guidelines for judging difficulty: ≥.85 Easy >.50 and <.85Moderate <.50Hard

Item Discrimination Ability of an item to differentiate among students on the basis of how well they know the material being tested. Describes how effectively the test item differentiates between high ability and low ability students. All things being equal, highly discriminating items increase reliability.

Discrimination Index D = p u - p l p u = proportion of students in the upper group who were correct. p l = proportion of students in the lower group who were correct. D .40 satisfactory item functioning.30  D .39 little or no revision required.20  D .29marginal - needs revision D <.20eliminate or complete revision

Point biserial correlation Correlation between performance on a single item and performance on the total test. - High and positive: best students get the answer correct; poorest students get it wrong. - Low or zero: no relationship between performance on the item and the total test. - High and negative: Poorest students get the item correct; best get it wrong.

Point biserial correlation  r pbis tends to be lower for tests measuring a wide range of content areas than for more homogeneous tests.  Items with low discrimination indices are often ambiguously worded.  A negative value may indicate that the item was miskeyed.  Tests with high internal consistency consist of items with mostly positive relationships with total test score.

Item Discrimination Rough guidelines for r pbis >.30 Good >.10 and <.30Fair <.10Poor

Item Analysis Matrix

ITEM 1 ITEM 2

ITEM 4 ITEM 3

A Sample of MS1 Exams

Cautions  Item analyses reflect internal consistency of items rather than validity.  The discrimination index is not always a measure of item quality  Extremely difficult or easy items will have low ability to discriminate but such items are often needed to adequately sample course content and objectives.  An item may show low discrimination if the test measures many different content areas and cognitive skills.

Cautions  Item analysis data are tentative. Influenced by:  type and number of students being tested   instructional procedures employed   both systematic and random measurement error  If repeated use of items is possible, statistics should be recorded for each administration of each item.

Recommendations Valuable tool for improving items to be used in future tests – item banking.  Modify or eliminate ambiguous, misleading, or flawed items.  Helps improve instructors’ skills in test construction.  Identifies specific areas of course content which need greater emphasis or clarity.

Research Downing SJ. The effects of violating standard item writing principles on tests and students: The consequences of using flawed items on achievement examinations in medical education. Adv Health Sci Educ 10: , Jozefowicz RF et al. The quality of in-house medical school examinations. Acad Med 77(2): , Muntinga JH, Schull HA. Effects of automatic item eliminations based on item test analysis. Adv Physiol Educ 31: , 2007.