Booklet Design and Equating

Slides:



Advertisements
Similar presentations
Some (Simplified) Steps for Creating a Personality Questionnaire Generate an item pool Administer the items to a sample of people Assess the uni-dimensionality.
Advertisements

Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.
Choosing appropriate summative tests.
Overview of field trial analysis procedures National Research Coordinators Meeting Windsor, June 2008.
VERTICAL SCALING H. Jane Rogers Neag School of Education University of Connecticut Presentation to the TNE Assessment Committee, October 30, 2006.
Estimating Growth when Content Specifications Change: A Multidimensional IRT Approach Mark D. Reckase Tianli Li Michigan State University.
Why Scale -- 1 Summarising data –Allows description of developing competence Construct validation –Dealing with many items rotated test forms –check how.
Chemometrics Method comparison
Technical Considerations in Alignment for Computerized Adaptive Testing Liru Zhang, Delaware DOE Shudong Wang, NWEA 2014 CCSSO NCSA New Orleans, LA June.
Becoming a Teacher Ninth Edition
Validation of the Assessment and Comparability to the PISA Framework Hao Ren and Joanna Tomkowicz McGraw-Hill Education CTB.
Measuring Learning and Improving Education Quality: International Experiences in Assessment John Ainley South Asia Regional Conference on Quality Education.
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
International Forum “EURASIAN EDUCATIONAL DIALOGUE” Yaroslavl (Russia), April 17-19, 2013 Henk A. Moelands Director Cito International Cito, The Netherlands.
Diagnostics Mathematics Assessments: Main Ideas  Now typically assess the knowledge and skill on the subsets of the 10 standards specified by the National.
English Language Acquisition Professional Learning Community WIDA Standards Organizational Meeting March 8, 2011.
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Deep Dive into the Math Shifts Understanding Focus and Coherence in the Common Core State Standards for Mathematics.
Learning Objective Chapter 9 The Concept of Measurement and Attitude Scales Copyright © 2000 South-Western College Publishing Co. CHAPTER nine The Concept.
The hybrid success model: Theory and practice G. Gage Kingsbury Martha S. McCall Northwest Evaluation Association A paper presented to the Seminar on longitudinal.
A COMPARISON METHOD OF EQUATING CLASSIC AND ITEM RESPONSE THEORY (IRT): A CASE OF IRANIAN STUDY IN THE UNIVERSITY ENTRANCE EXAM Ali Moghadamzadeh, Keyvan.
Chapter 4: Variability. Variability Provides a quantitative measure of the degree to which scores in a distribution are spread out or clustered together.
Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June.
Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office.
University of Ostrava Czech republic 26-31, March, 2012.
Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.
Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.
LISA A. KELLER UNIVERSITY OF MASSACHUSETTS AMHERST Statistical Issues in Growth Modeling.
Chapter 4 Variability PowerPoint Lecture Slides Essentials of Statistics for the Behavioral Sciences Seventh Edition by Frederick J Gravetter and Larry.
1 Perspectives on the Achievements of Irish 15-Year-Olds in the OECD PISA Assessment
IRT Equating Kolen & Brennan, 2004 & 2014 EPSY
Sampling and Sampling Distribution
Examining Achievement Gaps
Daniel Muijs Saad Chahine
Next Generation Iowa Assessments
Chapter 15 Panel Data Models.
Vertical Scaling in Value-Added Models for Student Learning
Growth: Changing the Conversation
Multivariate Analysis - Introduction
Assessment Framework and Test Blueprint
Assessment Research Centre Online Testing System (ARCOTS)
Student Growth Measurements and Accountability
Classical Test Theory Margaret Wu.
Validity and Reliability
Item Analysis: Classical and Beyond
Reliability & Validity
Language Arts Assessment Update
پرسشنامه کارگاه.
Measuring Social Life: How Many? How Much? What Type?
The New OSU Math Placement Exam Jeremy Penn, Ph. D
Survey What? It's a way of asking group or community members what they see as the most important needs of that group or community is. The results of the.
Analyzing Reliability and Validity in Outcomes Assessment Part 1
Performance Task Overview
Sampling Distribution
Sampling Distribution
LDZ System Charges – Structure Methodology 26 July 2010
Mohamed Dirir, Norma Sinclair, and Erin Strauts
Understanding and Using Standardized Tests
Formative Assessments Director, Assessment and Accountability
Psych 231: Research Methods in Psychology
Margaret Wu University of Melbourne
Item Analysis: Classical and Beyond
2009 AERA Annual Meeting, San Diego
Chapter 4 . Trajectory planning and Inverse kinematics
Item Analysis: Classical and Beyond
MGS 3100 Business Analysis Regression Feb 18, 2016
What is this PAT information telling me about my own practice and my students? Leah Saunders.
Presentation transcript:

Booklet Design and Equating

Items do not fit (unidimensional) IRT models Construct is artificially put together For example, mathematics consists of number, data, measurement, algebra, space.   Number Algebra Measurement Geometry Data Australia 498 499 511 491 531 Korea 586 597 577 598 569 Netherlands 539 514 549 513 560 New Zealand 481 490 500 488 526 Norway 456 428 461 Russian Federation 505 516 507 515 484

Considerations in Test Design for Equating Item position effect

Use rotated booklets to reduce position effect, and to cover content Block 1 Block 2 Block 3 1 C1 C2 C4 2 C3 C5 3 C6 4 C7 5 6 7

Balanced Incomplete Block Design Booklet Block 1 Block 2 Block 3 Block 4 1 M1 M2 M4 R1 2 M3 M5 R2 3 M6 PS1 4 M7 PS2 5 S1 6 S2 7 8 9 10 11 12 13

Requirements for BIB Number of pairs: k clusters, then there are kC2 pairs: k(k-1)/2 Example, k=13, then we need to form 78 pairs (13x12/2) Each booklet has four blocks, there will be 6 pairs (4x3/2) Since there are 13 booklets, we can have 13x6=78 pairs

Equating

Why is Equating needed? When different tests are administered, the results from the tests are not directly comparable. When a person performs well on a test, we do not know whether it is because the person is very able, or because the test is easy. All we can say is that the person found a particular test easy. Group 1 took Test 1, and Group 2 took Test 2. We cannot compare Group 1 and Group 2 results.

Common Item Equating Two tests containing common items are administered to different groups of respondents. There are a number of ways to perform equating using common items. We discuss the shift method, anchor method and joint calibration method.   Test 1 items Common items Test 2 items Test 1 respondents Test 2 respondents

Before carrying out equating … Check for Item Invariance

Number of common items needed for equating We recommend a minimum of 30 link items for equating purposes, and more if possible. If the purpose of equating is to put different grade level students’ results on the same scale, one should be aware that a yearly increase (i.e., between two adjacent grades of students) in proficiency is of the order of 0.5 logit (about half a standard deviation of the student ability distribution for a year level). If the purpose of equating is to monitor trend from a calendar year to another calendar year for students in the same grade, one needs to be aware that the average cohort change across time is typically very small (perhaps less than 0.05 logit, or less than one month of growth)

Factors influencing item difficulty Curriculum change Exposure to an item Opportunity to learn Item position in a test – fatigue effect

Shift Method Tests are calibrated separately, and the item and ability parameters for one test are placed on the scale of another test by a constant shift, the magnitude of which is computed as the amount needed to make the means of the common item parameters between the two tests the same

Shift and scale method – common items A variation to the shift method is to transform both the scale and the location of the item parameters. In the case where the standard deviations of the common items for two tests are different and are taken into account in matching test 2 to test 1, the equating transformation is

Shift and scale methods – match ability distributions IRT scaling is carried out for each test separately so that two sets of item parameters are produced. A re-scaling of the first test using item parameters for the common items for the second test is carried out, where the item parameters are fixed at the calibrated values of the second test. Work out a transformation based on the means and standard deviations of the ability distributions of the two calibrations of the first test.

Anchoring method Item parameters for the second test are anchored at the values for the first test. The anchoring method differs from the shift method in that every common item in the second test is fixed to the parameter value of the first test, while in the shift method, only the means of the set of common items are made equal so that individual items in the common item set may not have equal parameter values. Anchoring method may be preferred if one test has small sample size and the item parameters are not reliable.

The joint calibrations method (concurrent calibration)   Test 1 items Common items Test 2 items Test 1 respondents Test 2 respondents

Common person equating method   Test 1 items Test 2 items Test 1 respondents Respondents taking both tests Test 2 respondents

Horizontal and Vertical equating The terms horizontal and vertical equating have been used to refer to equating tests aimed for the same target level of students (horizontal) and for different target levels of students (vertical). Vertical equating is more challenging because item placements in the tests, also because of opportunities to learn.

Equating errors The idea is to capture the variability of estimated item parameters in two different tests. The magnitudes of equating errors are generally large (in comparison with trend or growth estimates). See example in PISA technical reports (e.g., 2003 technical report)

Challenges in test equating real-life challenges of keeping test items invariant. Curriculum changes Item position effect Fatigue Sequencing of items Balanced test design is essential Check for test delivering mode effect, e.g., online versus paper