1 Detection of Item Degradation Yongwei Yang Abdullah Ferdous Tzu-Yun Chin University of Nebraska-Lincoln In T. L. Hayes (chair), Item degradation: impact,

Slides:

Advertisements

Similar presentations

Chapter 8 Flashcards.

Advertisements

VALIDITY AND RELIABILITY

General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.

Inference for Regression

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.

Objectives (BPS chapter 24)

Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.

Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.

LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.

LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.

BA 555 Practical Business Analysis

Concept of Reliability and Validity. Learning Objectives  Discuss the fundamentals of measurement  Understand the relationship between Reliability and.

Validity Does test measure what it says it does? Is the test useful? Can a test be reliable, but not valid? Can a test be valid, but not reliable?

Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.

Using Hierarchical Growth Models to Monitor School Performance: The effects of the model, metric and time on the validity of inferences THE 34TH ANNUAL.

Today Concepts underlying inferential statistics

Correlational Designs

© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

Chapter 7 Correlational Research Gay, Mills, and Airasian

CORRELATIO NAL RESEARCH METHOD. The researcher wanted to determine if there is a significant relationship between the nursing personnel characteristics.

Introduction to Regression Analysis, Chapter 13,

So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.

Relationships Among Variables

Correlation Nabaz N. Jabbar Near East University 25 Oct 2011.

Correlation & Regression

Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.

Linear Regression.

Regression and Correlation Methods Judy Zhong Ph.D.

Multiple Sample Models James G. Anderson, Ph.D. Purdue University.

Marketing Research Aaker, Kumar, Day and Leone Tenth Edition

Introduction to Linear Regression and Correlation Analysis

CHAPTER NINE Correlational Research Designs. Copyright © Houghton Mifflin Company. All rights reserved.Chapter 9 | 2 Study Questions What are correlational.

BPS - 3rd Ed. Chapter 211 Inference for Regression.

Understanding Statistics

Cara Cahalan-Laitusis Operational Data or Experimental Design? A Variety of Approaches to Examining the Validity of Test Accommodations.

Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.

Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.

L 1 Chapter 12 Correlational Designs EDUC 640 Dr. William M. Bauer.

A “Dose-Response” Strategy for Assessing Program Impact in Naturalistic Contexts Megan PhillipsGeorge Tremblay Antioch University Antioch University New.

EVIDENCE ABOUT DIAGNOSTIC TESTS Min H. Huang, PT, PhD, NCS.

Tests and Measurements Intersession 2006.

Multivariate Analysis. One-way ANOVA Tests the difference in the means of 2 or more nominal groups Tests the difference in the means of 2 or more nominal.

Linear Regression Model In regression, x = independent (predictor) variable y= dependent (response) variable regression line (prediction line) ŷ = a +

Power Point Slides by Ronald J. Shope in collaboration with John W. Creswell Chapter 12 Correlational Designs.

MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.

6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)

Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.

Confirmatory Factor Analysis Psych 818 DeShon. Construct Validity: MTMM ● Assessed via convergent and divergent evidence ● Convergent – Measures of the.

Measurement Models: Exploratory and Confirmatory Factor Analysis James G. Anderson, Ph.D. Purdue University.

Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”

Introduction to testing statistical significance of interactions Jane E. Miller, PhD The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.

Chapter 8: Simple Linear Regression Yang Zhenlin.

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

Chapter 6 - Standardized Measurement and Assessment

Copyright c 2001 The McGraw-Hill Companies, Inc.1 Chapter 11 Testing for Differences Differences betweens groups or categories of the independent variable.

Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.

© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 11 Testing for Differences Differences betweens groups or categories of the independent.

Chapter 8 Relationships Among Variables. Outline What correlational research investigates Understanding the nature of correlation What the coefficient.

Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.

Approaches to quantitative data analysis Lara Traeger, PhD Methods in Supportive Oncology Research.

BPS - 5th Ed. Chapter 231 Inference for Regression.

Vertical Scaling in Value-Added Models for Student Learning

The University of Manchester

Evaluation of measuring tools: validity

The University of Manchester

Regression Analysis.

Presentation transcript:

1 Detection of Item Degradation Yongwei Yang Abdullah Ferdous Tzu-Yun Chin University of Nebraska-Lincoln In T. L. Hayes (chair), Item degradation: impact, detection, and mitigation, an academic-practitioner collaborative forum conducted at the 22 nd annual conference of the Society of Industrial and Organizational Psychology in New York City, NY, April 2007.

2 Item Degradation  Item’s favorable psychometric characteristics deteriorate over time Psychometric characteristics  Content relevance and representativeness  Technical characteristics (e.g., “difficulty”/“location”, lack of bias)  Utility (e.g., item-criterion relationship) Item Degradation vs. Exposure/Compromise  Item degradation: observed phenomenon  Item exposure/compromise: Items have become known to test takers prior to administration Possible reasons for degradation

3 Detection of Item Degradation Essentially it is about investigating the comparability of item’s psychometric properties over time  “temporal stability of the psychometric characteristics” (Chan, Drasgow, & Sawin, 1999) Can be evaluated under the framework of:  Measurement invariance (MI; Meredith, 1993)  Predictive invariance (PI; Millsap, 1995)

Item Degradation as MI or PI Measurement Invariance (MI) Same relationship across populations between observed indicators and the latent variables Degradation  noninvariance in such relationships over time  Loading, location 4 Predictive Invariance (PI) Same relationship across populations between predictors and criterion Degradation  noninvariance in such relationships over time  Indicator-criterion relationship Let x be observed indicator that measures latent w and predicts y, and v be some population indicator

5 Item Degradation Detection Methods Differential item functioning, item parameter drift Mean & covariance modeling  Assessing invariance in various aspects pertain to measurement or predictive properties Statistical process control Models of change

6 Item Degradation Detection Differential item functioning, item parameter drift Mean & covariance modeling  Assessing invariance in various aspects pertain measurement or predictive properties Statistical process control  Cumulative sum (CUSUM) procedure Models of change

7 CUSUM for Item Degradation Detection Our approach—Conditional CUSUM  Whether item parameters have deviated from target  Make use of observed scores  The importance of controlling for shifts in traits level over time “Conditional”—test takers at different time points were matched based on their total test score Procedures  Initial Item Calibration Compute target item parameter (e.g., difficulty) using the first n job applicants from the operation sample  Define “time group” Every m applicants from the n+1 applicant to the last person under investigation  Define “trait group” (conditioning variable) Divide job applicants into groups of reasonable size based on total test scores  Compute and plot CUSUM statistics for each trait group separately

8 Conditional CUSUM—Calculation Two-sided Standardized CUSUM Initial Status Item VarianceTime Group i Item Variance Time Group i Item MeanTarget Item Mean Reference value (k) and Control limit ( h )

9 Conditional CUSUM—Data Source A web-based personnel selection assessment for selecting managers  103 items measuring job-related non-cognitive attributes  CTT-based test construction and scoring  Fixed-length, linear test  Unproctored Sample:  Job applicants from Oct to Sept  Re-taker excluded  Total N = 7,000

10 Conditional CUSUM—Results Among the 103 items  36 flagged for upward shift in item means for at least one trait group  20 flagged for downward shift in item means for at least one trait group  9 flagged for having both upward and downward shifts for different trait groups  38 not flagged for any trait group A couple examples: it035, it174it035it174 Follow-up analysis:  Were there differences across item types with respect to the likelihood of being flagged by conditional CUSUM?

Conditional CUSUM—Follow-up Multinomial logistic regression  DV: condition CUSUM flag; 3 categories; “Not Flagged” as the reference category  IV: ability (6 levels), item type (3 levels, multiple choice (MC) as the reference group 11 Results  GOF statistic indicates appropriate fit of the main effect model (X 2 =16.83, df=20, p=.664)  The impact of ability levels on the CUSUM flags was not statistically significant (X 2 =13.48, df=10, p=.198)  The impact of item type on the CUSUM flags was statistically significant (X 2 =17.83, df=4, p=.001).  MC items were more likely to be flagged by conditional CUSUM for negative shifts  Forward items were more likely to be flagged by conditional CUSUM for positive shifts

Model of Change Perspective 1:  Understanding patterns of change using examinee characteristics  Do the trajectories of item parameter change vary across different types of examinees? Applicant location, SES, demographics, etc. Perspective 2:  Understanding patterns of change using item characteristics  Do the trajectories of item parameter change vary across different types of items? Item format, complexity, content area, etc. Formulating these questions in a longitudinal analysis framework 12

Perspective 1 Example 13 Using a 2-level longitudinal model to explore:  RQ1: On average, was there a shift in item difficulty?  RQ2: Were there variations in the slope of the shift?  (If Yes to RQ2) RQ3: Could the variations be explained by job applicants characteristics (e.g., trait level, region, etc.)? The model: Analysis with item 174:  RQ1: significant positive slope  RQ2: non-significant variations  RQ3: not pursued Level I: Level II:

Perspective 2 Example 14 Using a 2-level longitudinal model to explore:  RQ1: Across items, on average was there a change in item difficulty over time?  RQ2: Were there variations in the slope of the change across items?  (If Yes to RQ2) RQ3: Could the variations be explained by item characteristics?

Model B: Analysis with this data set:  RQ3: item type did not explain a significant portion of the variations in slopes Perspective 2 Example Model A: Analysis with this data set:  RQ1: average slope across items was not different from zero  RQ2: significant variations in slopes across items 15 Level I Level II

Summary and Discussions Two types of methods that serve different purposes:  Statistical process control (e.g., CUSUM):  Real-time monitoring of degradation  We illustrated conditional CUSUM procedure, but other methods exist (e.g., an IRT- based moving residual approach by Han & Hambleton [2004])  Explicit modeling of patterns of degradation:  Understanding the nature of degradation, exploring potential factors that impact degradation, assisting the development of prevention and mitigation procedures  We illustrated longitudinal modeling methods, but various methods for studying MI/PI may be applied These methods can also be used in monitoring and understanding degradation in other parameters (e.g., item variance, discrimination, response time)  It might be helpful to monitor/model multiple parameters simultaneously to (1) “flag” items more accurately and, (2) understand factors behind degradation 16

Summary and Discussions Understanding temporal stability of measurement properties is essential to:  Valid decisions based on test scores  Valid inferences in substantive research based on assessment outcomes Research on Flynn effect (e.g., Wicherts et al., 2004) Further research is needed, such as  What monitoring approaches would better fit personnel selection assessment programs?  What would lead to or impact degradation?  How would item-level degradation impact test-level decisions and inferences?  Etc. 17

18 Some Useful References MI & PI Concepts  Mellenbergh (1989)  Meredith (1993)  Millsap (1995) Various IPD and Item Exposure Detection Methods  Bock, Muraki, & Pfeiffenberger (1988)  Chan, Drasgow, & Sawin (1999)  DeMars (2004)  Donahue & Isham (1998)  Han & Hambleton (2004)  Kim, Cohen, & Park (1995) CUSUM and Psychometric Applications:  Hawkins & Olwell (1998)  Meijer & van Krimpen-Stoop (2003)  Montgomery (2005)  van Krimpen-Stoop & Meijer (2002)  Veerkamp & Glas (2000)

19 Contacts Yongwei Yang: Abdullah Ferdous: Tzu-Yun Chin: THANK YOU

Item 35 Conditional CUSUM Charts 20 back

Item 174 Conditional CUSUM Charts 21 back