Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.

Slides:



Advertisements
Similar presentations
The effect of differential item functioning in anchor items on population invariance of equating Anne Corinne Huggins University of Florida.
Advertisements

Mark D. Reckase Michigan State University The Evaluation of Teachers and Schools Using the Educator Response Function (ERF)
DIF Analysis Galina Larina of March, 2012 University of Ostrava.
Item Response Theory in a Multi-level Framework Saralyn Miller Meg Oliphint EDU 7309.
Consistency in testing
General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.
IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.
Issues of Technical Adequacy in Measuring Student Growth for Educator Effectiveness Stanley Rabinowitz, Ph.D. Director, Assessment & Standards Development.
1 Detection of Item Degradation Yongwei Yang Abdullah Ferdous Tzu-Yun Chin University of Nebraska-Lincoln In T. L. Hayes (chair), Item degradation: impact,
New Hampshire Enhanced Assessment Initiative: Technical Documentation for Alternate Assessments Consequential Validity Inclusive Assessment Seminar Elizabeth.
Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
The Excel NORMDIST Function Computes the cumulative probability to the value X Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc
New Hampshire Enhanced Assessment Initiative: Technical Documentation for Alternate Assessments Alignment Inclusive Assessment Seminar Brian Gong Claudia.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview of Lecture Between Group & Within Subjects Designs Mann-Whitney Test.
© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.
Probability and Statistics in Engineering Philip Bedient, Ph.D.
Measurement Problems within Assessment: Can Rasch Analysis help us? Mike Horton Bipin Bhakta Alan Tennant.
EPSY 8223: Test Score Equating
Challenges in Developing a University Admissions Test & a National Assessment A Presentation at the Conference On University & Test Development in Central.
Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.
Adventures in Equating Land: Facing the Intra-Individual Consistency Index Monster * *Louis Roussos retains all rights to the title.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Program Evaluation. Program evaluation Methodological techniques of the social sciences social policy public welfare administration.
Measuring Learning and Improving Education Quality: International Experiences in Assessment John Ainley South Asia Regional Conference on Quality Education.
Chapter 15 Data Analysis: Testing for Significant Differences.
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Department of Cognitive Science Michael J. Kalsher PSYC 4310 COGS 6310 MGMT 6969 © 2015, Michael Kalsher Unit 1B: Everything you wanted to know about basic.
Measuring of student subject competencies by SAM: regional experience Elena Kardanova National Research University Higher School of Economics.
From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Issues in Assessment Design, Vertical Alignment, and Data Management : Working with Growth Models Pete Goldschmidt UCLA Graduate School of Education &
CALIFORNIA DEPARTMENT OF EDUCATION Jack O’Connell, State Superintendent of Public Instruction Results of the 2005 National Assessment of Educational Progress.
Chapter 7 The Logic Of Sampling. Observation and Sampling Polls and other forms of social research rest on observations. The task of researchers is.
BPS - 3rd Ed. Chapter 131 Confidence Intervals: The Basics.
A COMPARISON METHOD OF EQUATING CLASSIC AND ITEM RESPONSE THEORY (IRT): A CASE OF IRANIAN STUDY IN THE UNIVERSITY ENTRANCE EXAM Ali Moghadamzadeh, Keyvan.
35th Annual National Conference on Large-Scale Assessment June 18, 2005 How to compare NAEP and State Assessment Results NAEP State Analysis Project Don.
Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June.
Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office.
Assessing Learners with Special Needs: An Applied Approach, 6e © 2009 Pearson Education, Inc. All rights reserved. Chapter 5: Introduction to Norm- Referenced.
University of Ostrava Czech republic 26-31, March, 2012.
RESEARCH & DATA ANALYSIS
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Gary W. Phillips American Institutes for Research CCSSO 2014 National Conference on Student Assessment (NCSA) New Orleans June 25-27, 2014 Multi State.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Sampling Theory and Some Important Sampling Distributions.
Aligning Assessments to Monitor Growth in Math Achievement: A Validity Study Jack B. Monpas-Huber, Ph.D. Director of Assessment & Student Information Washington.
PARCC Field Test Study Comparability of High School Mathematics End-of- Course Assessments National Conference on Student Assessment San Diego June 2015.
Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.
Welcome. Outcomes  Learn to analyze growth as a catalyst for change  Understand the process to evaluate the effectiveness of instructional interventions.
The Invariance of the easyCBM® Mathematics Measures Across Educational Setting, Language, and Ethnic Groups Joseph F. Nese, Daniel Anderson, and Gerald.
Northwest Evaluation Association – Measure of Academic Progress.
Gary W. Phillips Vice President & Institute Fellow American Institutes for Research Next Generation Achievement Standard Setting Symposium CCSSO NCSA New.
IRT Equating Kolen & Brennan, 2004 & 2014 EPSY
Nonequivalent Groups: Linear Methods Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2 nd ed.). New.
Jean-Guy Blais Université de Montréal
Assessments for Monitoring and Improving the Quality of Education
Assessment Research Centre Online Testing System (ARCOTS)
Booklet Design and Equating
Partial Credit Scoring for Technology Enhanced Items
National Conference on Student Assessment
Sampling Distribution
Sampling Distribution
Validating Student Growth During an Assessment Transition
Investigating item difficulty change by item positions under the Rasch model Luc Le & Van Nguyen 17th International meeting of the Psychometric Society,
Margaret Wu University of Melbourne
Evaluating Multi-item Scales
Chapter 8 VALIDITY AND RELIABILITY
  Using the RUMM2030 outputs as feedback on learner performance in Communication in English for Adult learners Nthabeleng Lepota 13th SAAEA Conference.
Presentation transcript:

Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and Scale Score Stability in Alternate Assessments NCSA – June 24, 2015

Item Parameter and Scale Score Stability in Alternate Assessments Sources of Drift  Distortion in test scores can be caused by shifts in item performance over time (Goldstein, 1983; Hambleton & Rogers, 1989) owing to the changes of  cognitive or noncognitive examinee characteristics (Bulut et al., 2015)  examinees’ opportunities to learn (Albano & Rodriguez, 2013)  curriculum or teaching methods (DeMars, 2004; Miller & Linn, 1988) 1

Item Parameter and Scale Score Stability in Alternate Assessments Item Parameter Drift  In item response theory (IRT), if item performance changes over time, item parameter drift (IPD) occurs.  Drifted item parameters can result in systematic errors in equating, scaling, and consequently scoring (Kolen & Brennan, 2004).  Therefore, it is important to check item performance  across various subgroups of examinees (gender, ethnic groups, etc.)  Across test administrations over time 2

Item Parameter and Scale Score Stability in Alternate Assessments Purpose of This Study  In alternate assessments, it is even more crucial to monitor IPD and score stability because  student population is more heterogeneous.  fluctuations of population is more common across years, and  test design changes from field test to operational.  This study focuses on 1. assessing IPD of items and 2. examining score stability in alternate assessments. 4

Item Parameter and Scale Score Stability in Alternate Assessments Data Source 5 Math & ReadingScience State 13 – 5, 6 – 8, , 8, 10 State 23 – 5, 6 – 8, , 8, 10-12

Item Parameter and Scale Score Stability in Alternate Assessments Data Source 6 Test Year State 1 (N=400/130) State 2 (N=7,500/2,500) 2011Field Test 2012Operational 2013 OperationalField Test 2014Operational

Item Parameter and Scale Score Stability in Alternate Assessments Test Design  Field Test (FT) Form  Multiple fixed forms linked by common items  9 or 15 tasks in each form  6 to 8 items in each task  Students responded to all tasks.  Operational (OP) Form  Single form with various test lengths  One form of 12 OP tasks and 3 FT tasks in State 1  Three forms of 12 OP tasks and 1 FT task in State 2  Students respond to a subset of tasks based on their abilities 7

Item Parameter and Scale Score Stability in Alternate Assessments Test Administration

Item Parameter and Scale Score Stability in Alternate Assessments Item Calibration 9

Item Parameter and Scale Score Stability in Alternate Assessments Parameter Drift Analysis (1)  Item calibration in operational setting:  First year free calibration of field-test items:  Items were calibrated by subject for mathematics and reading to create vertical scales.  Items were calibrated by grade/grade band for science  Later year concurrent calibration of field-test items  using operational items with good fit statistics as anchor items 10

Item Parameter and Scale Score Stability in Alternate Assessments Parameter Drift Analysis (2)  In this study, free calibrations were conducted using 2014 data.  The new parameters were equated to the existing scale using:  Mean/Mean (Loyd & Hoover, 1980)  Haebara (Haebara, 1980)  Stocking-Lord (Stocking & Lord, 1983)  Items with the difference of average difficulties greater than 0.3 were iteratively deleted from the anchor set.  Different equating methods may lead to different anchor sets. 11

Item Parameter and Scale Score Stability in Alternate Assessments Evaluation Criteria 1) Root-mean-square deviation (RMSD) of item parameters 2) RMSD of ability estimates 3) Mean absolute percentile difference(MAPD)  MAPD considers the number of examinees that are influenced by drifted parameters.  Since the ability distributions for alternative assessments are negatively skewed, kernel-smoothed empirical cumulative distribution was used to compute MAPD at quadrature points from -4 to 4 with the interval of

Item Parameter and Scale Score Stability in Alternate Assessments Results (Drift Rate) 13 MM: Mean/Mean; HB: Haebara; SL: Stocking-Lord

Item Parameter and Scale Score Stability in Alternate Assessments Results (RMSD) 13 MM: Mean/Mean; HB: Haebara; SL: Stocking-Lord

Item Parameter and Scale Score Stability in Alternate Assessments Results (RMSD) 14 MM: Mean/Mean; HB: Haebara; SL: Stocking-Lord

Item Parameter and Scale Score Stability in Alternate Assessments Results (MAPD) 15 MM: Mean/Mean; HB: Haebara; SL: Stocking-Lord

Item Parameter and Scale Score Stability in Alternate Assessments Results (MAPD) 16 MM: Mean/Mean; HB: Haebara; SL: Stocking-Lord

Item Parameter and Scale Score Stability in Alternate Assessments Summary (1)  Drift rate  Drift rate is generally higher in State 1 than State 2  Item parameter drift:  Larger IPD in State 1 than State 2  In State 1, the drifts are between 0.2 to  In State 2, the drifts are between 0.1 to  This might be due to greater time difference and smaller sample size. 17

Item Parameter and Scale Score Stability in Alternate Assessments Summary (2)  Impact of IPD on scores (θ):  Larger RMSD in State 1 than State 2  In State 1, the RMSD are between 0.02 to  In State 2, the RMSD are between are 0.01 to  Larger IPD may not lead to larger changes in scores.  IPD can occur in two directions.  The effect of IPD may be canceled out.  Mean absolute percentile difference(MAPD)  The MAPD of the two states are both below

Item Parameter and Scale Score Stability in Alternate Assessments Summary (3)  Results are aligned with previous studies  Huyn and Meyer (2009, 2010)  Wei (2013)  Wells, Hambleton, and Meng (2011)  Wells, Hambleton, Kirkpatrick, and Meng (2014)  Wells., Subkoviak, Serlin (2002) 18

Item Parameter and Scale Score Stability in Alternate Assessments Limitations of the Study  Only two states are included in this study.  Because of small sample sizes in alternate assessments, the impact of sampling is inevitable.  The duration between test administrations (3 years in State 1 and 1 year in State 2) may not be long enough to observe IPD that have significant impact on scores. 18

Item Parameter and Scale Score Stability in Alternate Assessments Selected References  Albano, A. D., & Rodriguez, M. C. (2013). Examining differential math performance by gender and opportunity to learn. Educational and Psychological Measurement, 73,  Bulut, O., Palma, J., Rodriguez, M. C., & Stanke, L. (2015). Evaluating measurement invariance in the measurement of developmental assets in Latino English language groups across developmental stages. Sage Open, 2,  Wells, C. S. Subkovial, M. J., & Serlin, R. (2002) The Effect of Item Parameter Drift on Examinee Ability Estimates. Applied Psychological Measurement. Vol. 26 No.  Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149.  Kolen, M. & Brennan, R. (2004). Test equating, scaling, and linking : methods and practices (2nd ed.). New York: Springer.  Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193.  Miller, A. D., & Linn, R. L. (1988). Invariance of item characteristic functions with variations in instructional coverage. Journal of Educational Measurement, 25, 205–219.  Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7,  Wei, X. E. (2013). Impacts of Item Parameter Drift on Person Ability Estimation in Multistage Testing.. Technical Report.

Thank you! For further information please contact: Ming Lei