Mohamed Dirir, Norma Sinclair, and Erin Strauts

Slides:



Advertisements
Similar presentations
Implications and Extensions of Rasch Measurement.
Advertisements

Item Analysis.
Test Development.
What is a CAT?. Introduction COMPUTER ADAPTIVE TEST + performance task.
Topics: Quality of Measurements
Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
Item Response Theory in Health Measurement
Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.
1 CSSS Large Scale Assessment Webinar Adaptive Testing in Science Kevin King (WestEd) Roy Beven (NWEA)
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
Models for Measuring. What do the models have in common? They are all cases of a general model. How are people responding? What are your intentions in.
Presented at the 2006 CLEAR Annual Conference September Alexandria, Virginia Something from Nothing: Limitations of Diagnostic Information in a CAT.
Item Analysis What makes a question good??? Answer options?
Objective Exam Score Distribution. Item Difficulty Power Item
Item Analysis Prof. Trevor Gibbs. Item Analysis After you have set your assessment: How can you be sure that the test items are appropriate?—Not too easy.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.
A Comparison of Progressive Item Selection Procedures for Computerized Adaptive Tests Brian Bontempo, Mountain Measurement Gage Kingsbury, NWEA Anthony.
Computerized Adaptive Testing: What is it and How Does it Work?
Office of Institutional Research, Planning and Assessment January 24, 2011 UNDERSTANDING THE DIAGNOSTIC GUIDE.
Technical Adequacy Session One Part Three.
The ABC’s of Pattern Scoring Dr. Cornelia Orr. Slide 2 Vocabulary Measurement – Psychometrics is a type of measurement Classical test theory Item Response.
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
Chapter 7 Item Analysis In constructing a new test (or shortening or lengthening an existing one), the final set of items is usually identified through.
ASSESMENT IN OPEN AND DISTANCE LEARNING Objectives: 1.To explain the characteristic of assessment in ODL 2.To identify the problem and solution of assessment.
Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1.
The ABC’s of Pattern Scoring
Building the NCSC Summative Assessment: Towards a Stage- Adaptive Design Sarah Hagge, Ph.D., and Anne Davidson, Ed.D. McGraw-Hill Education CTB CCSSO New.
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Item Response Theory in Health Measurement
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.
The Design of Statistical Specifications for a Test Mark D. Reckase Michigan State University.
Stages of Test Development By Lily Novita
Reliability EDUC 307. Reliability  How consistent is our measurement?  the reliability of assessments tells the consistency of observations.  Two or.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
IRT Equating Kolen & Brennan, 2004 & 2014 EPSY
1 Measurement Error All systematic effects acting to bias recorded results: -- Unclear Questions -- Ambiguous Questions -- Unclear Instructions -- Socially-acceptable.
M.A.P. Measures of Academic Progress
Chapter 14 Sampling.
What is a CAT? What is a CAT?.
ARDHIAN SUSENO CHOIRUL RISA PRADANA P.
Experiment Basics: Designs
M.A.P. Measures of Academic Progress
Understanding Results
Graduate School of Business Leadership
Classroom Analytics.
Item Analysis: Classical and Beyond
Reliability & Validity
Item pool optimization for adaptive testing
Booklet Design and Equating
Meeting-6 SAMPLING DESIGN
Detecting Item Parameter Drift
Evaluation of measuring tools: reliability
Aligned to Common Core State Standards
Using statistics to evaluate your test Gerard Seinhorst
Some Design Recommendations For ASAP Studies
A Multi-Dimensional PSER Stopping Rule
Cross-validation for the selection of statistical models
Margaret Wu University of Melbourne
Innovative Approaches for Examining Alignment
STA 291 Spring 2008 Lecture 13 Dustin Lueker.
Item Analysis: Classical and Beyond
Item Analysis: Classical and Beyond
Measurement System Analysis
Collaboration In Research
Tests are given for 4 primary reasons.
Presentation transcript:

Effects of Item Pool Characteristics on Ability Estimation in Computerized Adaptive test in K-12 Mohamed Dirir, Norma Sinclair, and Erin Strauts Paper presented at Symposium Operationalizing Multi-State, K-12 Computer Adaptive Testing Programs: Technical Challenges and Solutions NCSA June 24, 2015, San Diego, CA

Purpose Examine how difficulty distribution in item pools affect CAT administration Examine effects of item pool difficulty on accuracy and precision of ability estimation Inform the development of adequate item pools for CAT Utilized results from 2014 SBAC field test as a guide in choosing study design and pool characteristics

Importance of the Study CAT has arrived in K-12 large scale assessment Over 18 states have administered CAT to millions of students, most of whom have taken CAT for the first time These are high-stake assessments that need stringent quality control in administration, validity, and reliability Research in CAT in K-12 large scale has been spotty

Some Questions Addressed by the Study How does an item bank with a large number of difficult items and few easy items impact the accuracy and precision of ability estimation? How does an item bank with a large number of difficult items and few easy items impact the exposure of items in a Computer Adaptive Test? How does an item bank containing items calibrated with varying sample sizes (small, medium, large) impact the accuracy and precision of ability estimation?

Design Condition Distribution Condition No. of responses Distribution of difficulty level All with a U ~(0.6,1.8) n=700 Condition Distribution Uniformly distributed U~(-3,3) Moderately difficult 5% easy U~ (-3,-1) 47% moderate U~ (-1,1) 48% hard U~ (1,3) Mostly difficult 35% moderate U~ (-1,1) 60% hard U~ (1,3) Extremely difficult 15% moderate U~ (-1,1) 80% hard U~ (1,3) Distribution of # of responses Condition No. of responses Uniform 1500 Unbalanced Counts 56% receive 1800 20% receive 1500 17% receive 1000 7% receive 500 Equal No. of items 25% receive 1800 25% receive 1500 25% receive 1000 25% receive 500

Simulation Process From the sets of generated items (700 per pool) tests were built using Linear-on-the-Fly (LOFT) process First test lengths of 25, 35, and 50 were constructed, but the 35-item test results are presented in this paper. Examinees ability were generated from N~(0,1) Items were drawn randomly from the pools Each pool was calibrated with a selected sample, then theta and item estimates were saved For each pool and sample combination, this process was replicated 100 times

Extremely Difficult Unbalanced Equal # of Items Uniform Condition Simulated Pool Difficulty Parameter Distribution Calibration Sample Size 1 Extremely Difficult Unbalanced 2 Equal # of Items 3 Uniform 4 Mostly Difficult 5 6 7 Moderately Difficult 8 9 10 Uniformly Difficult 11 12

Number of Items by Condition and Difficulty Range   Range of True Difficulty Condition -3 to -2 -2 to -1 -1 to 0 0 to 1 1 to 2 2 to 3 Grand Total 1 14 16 50 55 279 236 650 2 17 275 225 635 3 13 15 56 48 281 207 621 4 18 121 124 204 183 663 5 126 119 208 166 648 6 11 21 200 161 638 7 169 160 143 672 8 12 163 130 654 9 165 128 652 10 97 118 120 113 114 99 661 94 116 91 649 86 110 122 88 641

Results - LOFT Item difficulties were pushed out at the extremes Difficult items got more difficult while easy items got easier The mid range difficulties were more stable across pools and calibration samples

Difference between True Difficulty and Estimated Difficulty   Range of True Difficulty Condition -3 to -2 -2 to -1 -1 to 0 0 to 1 1 to 2 2 to 3 Overall 1 0.045 (.21) -0.001 (.09) 0.001 (.05) 0.002 (.04) 0.004 (.07) -0.006 (.17) 0.007 (.10) 2 0.047 (.22) 0.006 (.12) 0.003 (.05) -0.003 (.05) -0.013 (.09) -0.053 (.28) -0.002 (.13) 3 0.034 (.22) -0.008 (.08) 0.008 (.05) 0.003 (.07) 0.001 (.11) -0.019 (.23) 0.003 (.13) 4 0.040 (.16) 0.029 (.08) 0.005 (.04) -0.001 (.04) -0.006 (.07) -0.028 (.17) 0.006 (.09) 5 -0.012 (.17) 0.001 (.09) -0.004 (.05) -0.012 (.09) -0.026 (.23) -0.009 (.12) 6 0.046 (.22) -0.002 (.12) 0.001 (.07) -0.005 (.06) -0.005 (.11) -0.039 (.27) -0.001 (.14) 7 0.065 (.18) 0.004 (.09) -0.003 (.04) -0.037 (.18) 0.004 (.10) 8 0.041 (.24) -0.003 (.09) 0.004 (.05) -0.003 (.10) -0.007 (.21) 9 0.016 (.19) 0.030 (.13) -0.004 (.06) -0.007 (.06) -0.008 (.11) -0.038 (.27) -0.002 (.14) 10 0.005 (.18) 0.003 (.08) 0.001 (.04) -0.003 (.08) -0.012 (.19) -0.002 (.10) 11 0.032 (.25) 0.003 (.09) -0.001 (.05) -0.009 (.09) -0.058 (.24) -0.006 (.13) 12 0.035 (.26) 0.011 (.12) 0.005 (.06) 0.002 (.06) 0.003 (.10) -0.028 (.23) 0.005 (.14) 0.033 (.21) 0.006 (.10) -0.005 (.09) -0.029 (.22) 0.001 (.12) Note: Values listed are (true - estimated) difficulty averaged across items and banks. Standard deviation of bias averaged over banks is in parentheses.

Results – CAT: Bias and SEM The average bias in theta was greater at the extremes Low theta values were underestimated while high theta values were overestimated All item pools were similar in bias at the high end, while uniform pool resulted less bias at the lower end of theta SEM analyses resulted in a similar pattern as the bias.

Bias and SEM of Theta By Pool Difficulty and Ability Range   Pool Difficulty Level ABILITY RANGE -1.8 & LOWER -1.7 TO -0.6 -0.5 TO 0.5 0.6 TO 1.6 1.7 & HIGHER Bias in Theta EXTREME 0.021 0.006 -0.001 -0.006 MOSTLY 0.030 0.004 0.000 -0.005 MODERATE 0.019 0.009 -0.002 -0.000 UNIFORM (-3,3) 0.007 0.002 Standard Error of Measurement 0.256 0.194 0.175 0.170 0.181 0.258 0.180 0.169 0.251 0.184 0.168 0.167 0.179 0.191

Bias in Theta Estimation

Standard Error of Theta Estimation Over Replications

Examinee Ability and the Last Item The effectiveness of the last item in CAT was measured as the difference between the second-to-last ability estimate and the last item’s difficulty parameter At the lower ends of the ability, difficult item pools performed poorly The pool with the uniformly distributed item difficulty resulted in small differences at low abilities At high ability, all pools formed reasonably The range in differences between theta and difficulty across ability range was 0.192 for the uniform pool, and 1.3 to 1.45 for the other pools.

Effectiveness of the Last Item Pool Type ABILITY RANGE -1.8 & LOWER -1.7 TO -0.6 -0.5 TO 0.5 0.6 TO 1.6 1.7 & HIGHER Extremely Difficult -1.411 -0.452 -0.028 -0.053 0.026 Mostly Difficult -1.279 -0.358 0.004 0.003 0.030 Moderately Difficult -1.423 -0.413 -0.010 0.011 0.028 Uniform (-3,3) -0.101 -0.014 0.001 0.091

Progression through CAT For low theta, Non-Uniform difficulty conditions were less likely to give a less difficult item following an incorrect response Uniform difficulty conditions were more likely to give an easy item after incorrect answer. Uniform difficulty pools were more consistent across ability groups than the non-uniform difficulty pools The average range in chance for an easier item after incorrect answer were: 0.023 for uniform pool and 0.243 for extremely difficult pool

Ability Range Likelihood of harder Item after correct answer Pool Type -1.8 & LOWER -1.7 TO -0.6 -0.5 TO 0.5 0.6 TO 1.6 1.7 & HIGHER Likelihood of harder Item after correct answer Extremely Difficult 0.784 0.734 0.715 0.821 0.867 Mostly Difficult 0.771 0.738 0.770 0.836 Moderately Difficult 0.775 0.752 0.795 0.822 0.813 Uniform (-3,3) 0.747 0.766 0.788 Likelihood of easier item after incorrect answer 0.612 0.673 0.721 0.812 0.855 0.607 0.694 0.780 0.819 0.614 0.696 0.804 0.758 0.776 0.781 0.759

Item Exposure Conditions with few easy items had high exposure of easy items (10 to 50 percent of students saw the item) while about 1 percent saw each of the difficult items For Non-Uniform Difficulty conditions the number of times an item was administered was correlated with difficulty such that less difficult items were administered a greater number of times (r ~ .6) Conditions with uniformly distributed difficulty banks exposed items in the middle of the difficulty distribution the most (5 to 20 percent of students saw the item)

Conclusion Practitioners strive to build CAT item pools that are uniformly distributed in difficulty The goal is to measure all ability levels with good, equitable precision This paper highlighted what the lack of an ideal item pool, and hence precision, could result in It has been shown in this presentation that pools with negatively skewed difficulty distributions may not provide good results for all students in CAT