National Conference on Student Assessment

Slides:

Advertisements

Similar presentations

Standardized Scales.

Advertisements

Chapter 18: The Chi-Square Statistic

How Should We Assess the Fit of Rasch-Type Models? Approximating the Power of Goodness-of-fit Statistics in Categorical Data Analysis Alberto Maydeu-Olivares.

Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.

MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT

Exploring the Full-Information Bifactor Model in Vertical Scaling With Construct Shift Ying Li and Robert W. Lissitz.

Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.

Multivariate Methods Pattern Recognition and Hypothesis Testing.

Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.

Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.

Estimating Growth when Content Specifications Change: A Multidimensional IRT Approach Mark D. Reckase Tianli Li Michigan State University.

© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.

Classroom Assessment A Practical Guide for Educators by Craig A

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.

Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.

Measurement in Exercise and Sport Psychology Research EPHE 348.

Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.

Test item analysis: When are statistics a good thing? Andrew Martin Purdue Pesticide Programs.

T tests comparing two means t tests comparing two means.

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 10/7/2015 A Model for Scaling, Linking, and Reporting.

CSD 5100 Introduction to Research Methods in CSD Observation and Data Collection in CSD Research Strategies Measurement Issues.

Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.

Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.

Sub-regional Workshop on Census Data Evaluation, Phnom Penh, Cambodia, November 2011 Evaluation of Age and Sex Distribution United Nations Statistics.

Chapter Seventeen. Figure 17.1 Relationship of Hypothesis Testing Related to Differences to the Previous Chapter and the Marketing Research Process Focus.

Lecture 11 Preview: Hypothesis Testing and the Wald Test Wald Test Let Statistical Software Do the Work Testing the Significance of the “Entire” Model.

CJT 765: Structural Equation Modeling Class 8: Confirmatory Factory Analysis.

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

FIT ANALYSIS IN RASCH MODEL University of Ostrava Czech republic 26-31, March, 2012.

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.

Chapter 6 - Standardized Measurement and Assessment

Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.

IRT Equating Kolen & Brennan, 2004 & 2014 EPSY

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 25 Critiquing Assessments Sherrilene Classen, Craig A. Velozo.

Multivariate Analysis - Introduction. What is Multivariate Analysis? The expression multivariate analysis is used to describe analyses of data that have.

Chapter 8 Introducing Inferential Statistics.

GS/PPAL Section N Research Methods and Information Systems

When Teachers Choose: Fairness and Authenticity in Teacher-Initiated Classroom Observations American Educational Research Association, Annual Meeting.

Multivariate Analysis - Introduction

Item Analysis: Classical and Beyond

CJT 765: Structural Equation Modeling

Reliability & Validity

Elementary Statistics

پرسشنامه کارگاه.

Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests

Reliability and Validity of Measurement

Analyzing Reliability and Validity in Outcomes Assessment Part 1

Determining the distribution of Sample statistics

CHAPTER 29: Multiple Regression*

Gerald Dyer, Jr., MPH October 20, 2016

ANOVA Table Models can be evaluated by examining variability.

Testing Writing Rio Darmasetiawan

EPSY 5245 EPSY 5245 Michael C. Rodriguez

CH2. Cleaning and Transforming Data

A Multi-Dimensional PSER Stopping Rule

Statistics II: An Overview of Statistics

Product moment correlation

15.1 The Role of Statistics in the Research Process

Measurement Concepts and scale evaluation

Regression & Correlation (1)

Analyzing Reliability and Validity in Outcomes Assessment

Chapter 18: The Chi-Square Statistic

Item Analysis: Classical and Beyond

Multivariate Analysis - Introduction

Item Analysis: Classical and Beyond

MGS 3100 Business Analysis Regression Feb 18, 2016

Presentation transcript:

National Conference on Student Assessment An IRT-based approach to detection of aberrant response patterns for tests with multiple components National Conference on Student Assessment June 21, 2016 – Philadelphia, PA Li Cai, Kilchan Choi, & Mark Hansen

Overview testing for differences in projected and “observed” score estimates illustrations a 4-dimensional English language proficiency assessment a 1-dimensional English language arts/literacy assessment other potential applications caveats/limitations

Testing for differences in projected and “observed” score estimates In item response theory (IRT) scoring, the posterior distribution describes a plausible distribution for an individual’s true ability, given the evidence collected in the test (item responses) The posterior distribution is the source of the a scale score point estimate, as well as information about the precision of that estimate (the standard error of measurement) In this study, we examine the possibility of using posterior distributions estimated from separate parts of a test in order to detect unlikely/inconsistent response patterns.

Calibrated projection For multidimensional tests, calibrated projection (Thissen et al., 2010) is used to obtain from one part of a test plausible score distributions that can be compared to the posterior distributions obtained directly through the scoring of another part. 1 2 1. calibration 1 2 2. projection 2 3. scoring 4. evaluation

A test statistic for comparing two posterior distributions squared difference between the two score estimates 𝜒 𝑖 2 = 𝜃 1𝑖 − 𝜃 2𝑖 2 𝜎 𝜃 1𝑖 2 + 𝜎 𝜃 2𝑖 2 degrees of freedom = 1 sum of the squared standard errors of measurement (which is the error variance of the numerator, since the estimates are independent)

Illustrations a 4-dimensional English language proficiency assessment a 1-dimensional English language arts/literacy assessment

Compare the projected writing score with the “actual” writing score Illustration #1 Using 3 domains of an English Language Proficiency Assessment (listening, reading, speaking) to predict performance on the 4th domain (writing) Compare the projected writing score with the “actual” writing score “actual” L R S W W domains projection items            

about 97% of the cases examined do not display significant differences between the scores estimated from the 2 test segments 12 randomly selected cases with Wald 𝜒 2 p≥0.05 shown at right

About 3% of the cases examined display significant differences between the scores estimated from the 2 test segments 12 randomly selected cases with Wald 𝜒 2 p<0.05 shown at right

mean difference = -0.01

⇔ Illustration #2 Test consists of two segments/components Scaling model is unidimensional, although this can be viewed as a special case of projection (in which the dimensions have the same distribution and are perfectly correlated) 𝜌=1 𝑁 𝜇, 𝜎 2 1 2 𝑁 𝜇, 𝜎 2 𝑁 𝜇, 𝜎 2 ⇔ items items segment 1 segment 2 segment 1 segment 2

Test consists of two segments/components Illustration #2 Test consists of two segments/components Scaling model is unidimensional, although this can be viewed as a special case of projection (in which the dimensions have the same distribution and are perfectly correlated) Compare score estimates from 2 components of a English language arts/literacy (ELA/L) assessment 𝜌=1 𝜌=1 “actual” “actual” 1 2 2 1 1 2 projection projection segment 1 segment 2 segment 1 segment 2

about 93% of the cases examined do not display significant differences between the scores estimated from the 2 test segments 12 randomly selected cases with Wald 𝜒 2 p≥0.05 shown at right

about 7% of the cases examined display significant differences between the scores estimated from the 2 test segments 12 randomly selected cases with Wald 𝜒 2 p<0.05 shown at right

mean difference = -0.08

Caveats and limitations Model must be approximately correct Reasonably good overall fit Stable item and structural parameter estimates Aberrant patterns in the calibration may reduce sensitivity Ability to detect unlikely responses is dependent on (projected) score precision Low power when projecting from small number of items Low power when projecting across weakly correlated domains Interpretation of findings not particularly clear. Rudner, Bracey, & Skaggs (1996, p. 107): “In general, we need more clinical oriented studies that find aberrant patterns of responses and then follow up with respondents. We know of no studies that empirically investigate what these respondents are like. Can anything meaningful be said about them beyond the fact that they do not look like typical respondents?” Rudner, L. M., Bracey, G., & Skaggs, G. (1996). The use of a person-fit statistic with one high-quality achievement test. Applied Measurement in Education, 9(1), 91-109.

Other potential applications Test-retest consistency Comparisons of machine- and hand-scored responses Next Steps Examine calibration of the test statistic (error rates, power) Investigate causes/meaning of discrepancy Clustering of inconsistent patterns Student characteristics

THANK YOU! Li Cai, lcai@ucla.edu Kilchan Choi, kcchoi@ucla.edu Mark Hansen, markhansen@ucla.edu