CAT Item Selection and Person Fit: Predictive Efficiency and Detection of Atypical Symptom Profiles Barth B. Riley, Ph.D., Michael L. Dennis, Ph.D., Kendon.

Slides:



Advertisements
Similar presentations
Implications and Extensions of Rasch Measurement.
Advertisements

Standardized Scales.
Measurement Concepts Operational Definition: is the definition of a variable in terms of the actual procedures used by the researcher to measure and/or.
DIF Analysis Galina Larina of March, 2012 University of Ostrava.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT
Item Response Theory in Health Measurement
How does PROMIS compare to other HRQOL measures? Ron D. Hays, Ph.D. UCLA, Los Angeles, CA Symposium 1077: Measurement Tools to Enhance Health- Related.
PROMIS DEVELOPMENT METHODS, ANALYSES AND APPLICATIONS Presented at the Patient-Reported Outcomes Measurement Information System (PROMIS): A Resource for.
PROMIS: The Right Place at the Right Time? David Cella, Ph.D. Department of Medical Social Sciences Northwestern University Chair, PROMIS Steering Committee.
Predictors of Change in HIV Risk Factors for Adolescents Admitted to Substance Abuse Treatment Passetti, L. L., Garner, B. R., Funk, R., Godley, S. H.,
Development and Evaluation of the Global Appraisal of Individual Needs (GAIN) Validity Measures Rodney Funk, Michael L. Dennis, Melissa Ives, Chestnut.
Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,
Journal Club Alcohol and Health: Current Evidence July-August 2006.
+ A New Stopping Rule for Computerized Adaptive Testing.
© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.
Adaptive Designs for Clinical Trials
Computerized Adaptive Testing in Clinical Substance Abuse Practice: Issues and Strategies Barth Riley Lighthouse Institute, Chestnut Health Systems.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Internal Consistency Reliability Analysis PowerPoint.
Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.
Studying treatment of suicidal ideation & attempts: Designs, Statistical Analysis, and Methodological Considerations Jill M. Harkavy-Friedman, Ph.D.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.
Negative Urgency, Distress Tolerance and Problematic Alcohol Use Abstract Purpose: This study aimed to explore the relations among Negative Urgency, Distress.
Cross-Validation and Integration of Four Mental Health Screeners with Item Response Theory Barth B. Riley \1, Brian Rush \2, Saulo Castel \2, Bruna Brands.
1 Reducing the duration and cost of assessment with the GAIN: Computer Adaptive Testing.
Classical and Bayesian Computerized Adaptive Testing Algorithms Richard J. Swartz Department of Biostatistics
Kendon ConradBarth Riley University of Illinois at Chicago Michael L. Dennis Chestnut Health Systems.
Measurement and Data Quality
Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.
報 告 者 王瓊琦. postpartum depression : identification of women at risk.
Modern Test Theory Item Response Theory (IRT). Limitations of classical test theory An examinee’s ability is defined in terms of a particular test The.
Presented By: Trish Gann, LPC
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
AOD Use and Mental Health Disparities during Pregnancy and Postpartum Victoria H. Coleman, Ph.D. & Michael L. Dennis, Ph.D. Chestnut Health Systems, Bloomington,
Introduction Neuropsychological Symptoms Scale The Neuropsychological Symptoms Scale (NSS; Dean, 2010) was designed for use in the clinical interview to.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Lecture 6: Reliability and validity of scales (cont) 1. In relation to scales, define the following terms: - Content validity - Criterion validity (concurrent.
Correlational Research Chapter Fifteen Bring Schraw et al.
EVIDENCE ABOUT DIAGNOSTIC TESTS Min H. Huang, PT, PhD, NCS.
Michael Bogenschutz, Dennis Donovan, Cameron Crandall, Robert Lindblad, Raul Mandler, Harold Perl, Alyssa Forcehimes 1 Screening Procedures to Identify.
Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1.
Chapter 2 ~~~~~ Standardized Assessment: Types, Scores, Reporting.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
Item Response Theory (IRT) Models for Questionnaire Evaluation: Response to Reeve Ron D. Hays October 22, 2009, ~3:45-4:05pm
Criteria for selection of a data collection instrument. 1.Practicality of the instrument: -Concerns its cost and appropriateness for the study population.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Concurrent Validity of Alternative CANS Outcome Metrics William A. Shennum Julian Leiro Delisa Young Five Acres Altadena, California.
Social Anxiety and College Drinking: An Examination of Coping and Conformity Drinking Motives Lindsay S. Ham, Ph.D. and Tracey A. Garcia, B.A. Florida.
Item Response Theory in Health Measurement
Single-Subject and Correlational Research Bring Schraw et al.
PARCC Field Test Study Comparability of High School Mathematics End-of- Course Assessments National Conference on Student Assessment San Diego June 2015.
Using Rasch modeling to investigate the psychometric properties of the OSCE = 51.86* * *0.2 Aim To present a prototype of a validated.
The Reliability of Crowdsourcing: Latent Trait Modeling with Mechanical Turk Matt Baucum, Steven V. Rouse, Cindy Miller-Perrin, Elizabeth Mancuso Pepperdine.
Method Introduction Results Discussion Mean Negative Cigarette Systoli Previous research has reported that across the nation 29% of college students engage.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Reducing Burden on Patient- Reported Outcomes Using Multidimensional Computer Adaptive Testing Scott B. MorrisMichael Bass Mirinae LeeRichard E. Neapolitan.
Questionnaire-Part 2. Translating a questionnaire Quality of the obtained data increases if the questionnaire is presented in the respondents’ own mother.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 25 Critiquing Assessments Sherrilene Classen, Craig A. Velozo.
Measures Functional Assessment of Cancer Therapy: Prostate (FACT-P). The FACT-P (Cella et al., 1993; Esper et al., 1997) is a widely-used self-report measure(e.g.,
Logic of Hypothesis Testing
Measurement: A Rasch Analysis of Malaysian Automotive Quality Management-Cost of Quality Scale (MAQM-CoQ Scale) Muhammad Shahar Hj Jusoh , PhD Rushami.
PARCC Data Forensics: A Multifaceted Approach
Item Analysis: Classical and Beyond
National Conference on Student Assessment
A Multi-Dimensional PSER Stopping Rule
Item Analysis: Classical and Beyond
Item Analysis: Classical and Beyond
GIM & HSR Research Seminar: October 5, 2018
Qualities of a good data gathering procedures
Presentation transcript:

CAT Item Selection and Person Fit: Predictive Efficiency and Detection of Atypical Symptom Profiles Barth B. Riley, Ph.D., Michael L. Dennis, Ph.D., Kendon J. Conrad, Ph.D. Funded by NIDA grant 1R21DA025731

IntroductionIntroduction Do our measures accurately reflect a person’s performance or status? –Example: Persons with few endorsed symptoms, but symptoms of high severity Person fit statistics offer a means of detecting these patterns. But, detecting person misfit in CAT is problematic: –Reduced number of items administered –Selected items cover limited range of measurement continuum

Item Selection in CAT Optimized for efficiency and precision of measurement estimation. –e.g., maximizing Fisher’s information function Alternative procedures could be devised to balance efficiency/precision and obtaining responses over a wider range of the measurement continuum –e.g., Linacre’s (1995) Bayesian falsification procedure

Purpose of Study Examine the predictive efficiency and sensitivity of various person fit indices to detecting misfit in CAT –Predictive efficiency: how well can we predict the overall pattern of misfit based on item responses collected via CAT? What effect does different item selection methods have on our ability to detect person misfit in a CAT context?

HypothesesHypotheses 1.Predictive efficiency of CAT-derived person fit statistics will be enhanced by selecting items from a wider range of the measurement continuum. 2.Greater predictive efficiency will improve detection of atypical responding.

Data Source and Simulation Procedure Data were from 4,360 individuals presenting to substance abuse treatment upon intake Post-hoc CAT simulations were performed: –One parameter IRT (Rasch) dichotomous response model. –Maximum-likelihood estimation –Item Selection Procedures Modified “Bayesian” falsification procedure (MBF) Maximum Fisher’s Information (MFI) –Stop Rule: all items were administered to examine the effects of successive item administration on person fit indices.

Internal Mental Distress Scale The IMDS is a 42-item instrument that is part of the Global Appraisal of Individual Needs (Dennis et al., 2003). Measures: –Internal mental distress (second-order factor) –Depression –Anxiety –Trauma –Homicidality/Suicidality –Somatic complaints Validated using a 1-parameter IRT (Rasch) measurement model

Modified Bayesian Falsification Item Selection (MBF) 1.Set the start value for the measure (θ 0 ) at 0 logits. 2.Calculate a “target” measure: i.If previous item was endorsed or first item: θ T = θ i-1 + max(2,SE 2 ) ii.Otherwise: θ T = θ i-1 – max(2,SE 2 ) 3.For each unadministered item, compute the information function I ni (θ T ). 4.Select the item with the largest information function.

Person Fit Statistics Residual-based: –Infit, outfit (Wright & Stone, 1979; Wright, 1980) –Log infit and outfit (Wright & Stone, 1979) Non-Parametric –Modified Caution Index (MCI; Harnisch & Linn, 1981) –H T (Sijtsma, 1986; Sijtsma & Meier, 1992) Likelihood-Based –lz (Drasgow, Levine & Williams, 1985) CAT-Specific (CUSUM; van Krimpen-Stoop & Meijer, 2000) –Used three different methods for estimating response residuals (T1, T3, and T6).

Predictive Efficiency of Person Fit Statistics

Predictive Efficiency, MFI Item Selection

Predictive Efficiency, MBF Item Selection

Min. Number of Items to Achieve R 2 =.80 Fit StatisticMFIMBF MCI1311 HTHT 1817 Infit2019 Log Infit1516 Outfit3936 Log Outfit19 lZlZ 3834 CUSUM (T1)26 CUSUM (T3)3032 CUSUM (T6)3935 Average

Identification of Persons with Atypical Suicide

Atypical Suicide Conrad and colleagues (2010) identified a subgroup with suicidal ideation with lower levels of depression, anxiety, trauma In this study however, we defined atypical suicide as persons with: –2+ suicidal symptoms –Level of internal mental distress is not predictive of suicidality. –Under typical CAT operation, these individuals would be unlikely to receive suicide items during a CAT session

Suicide Groups Based on 2+ Symptoms N=7,348

Predicting Atypical Suicide: All Items VariableAUCSensitivitySpecificity IMDS MCI HTHT Infit/Log Infit Outfit Log Outfit lZlZ CUSUM (T1) CUSUM (T3) CUSUM (T6) Multivariate

Sensitivity to Predict Atypical Suicide

Comparison of Item Selection Procedures

First 5 Items Administered by CAT

CAT to Full Instrument Correlation

Measurement Precision (RMSE)

Test Information

A Case Example

MFI Item Selection and Measure Estimation First suicide item administered

MBF Item Selection and Measure Estimation First suicide item administered

ComparisonComparison MFIMBFFull Measure Std. Error Outfit Infit lzlz # Suicide035 # Administered192242

ConclusionsConclusions Hypothesis 1: Item selection method had only a modest effect on predictive efficiency, though in the hypothesized direction. –MBF had strongest effect on outfit, l z and CUSUM (T6) Partial support for Hypothesis 2: –MBF provided efficient detection of atypical suicide pattern –Reflects the type of items selected early in the CAT rather than on predictive efficiency MBF was found to be somewhat less efficient than MFI

Strengths and Limitations Strengths –Large sample –Clinical sample –Several fit statistics examined Limitations –Multidimensionality –Small item bank –Further work needed on defining “atypicalness” in clinical context –Further validation of approach across instruments, measurement models

ReferencesReferences Conrad, K. J., Bezruczko, N., Chan, Y. F., Riley, B., Diamond, G., & Dennis, M. L. (2010). Screening for atypical suicide risk with person fit statistics among people presenting to alcohol and other drug treatment. Drug and Alcohol Dependence, 106(1), Drasgow, F., Levine, M. V., & McLaughlin, M. E. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement, 11(1), Harnisch, D. L., & Linn, R. L. (1981). Analysis of item response patterns: Questionable test data and dissimilar curriculum practices. Journal of Educational Measurement, 18(2), Linacre, J. M. (1995). Computer-adaptive testing CAT: A Bayesiian approach. Rasch Measurement Transactions, 9(1), 412. Sijtsma, K. (1986). A coefficient of deviance of response patterns. Kwantitatieve Methoden, 7, 131–145. Sijtsma, K., & Meijer, R. R. (1992). A method for investigating the intersection of item response functions in Mokken’s non-parametric IRT model. Applied Psychological Measurement, 16(2), van Krimpen-Stoop, E. M., & Meijer, R. R. (2000). Detecting person misfit in adaptive testing using statistical process control techniques. In W.J. van der Linden and C.A.W. Glas (Ed.), Computer adaptive testing: Theory and practice. Boston: Kluwer Academic. Wright, B. D. (1980). Afterword. In G. Rasch (Ed.), Probabilistic models for some intelligence and attainment tests: With foreword and afterword by Benjamin D. Wright. Chicago: MESA Press. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: University of Chicago, MESA Press.

Thank you! For more information, contact: Barth Riley, Ph.D. For more information about the psychometrics of the Global Appraisal of Individual Needs (GAIN), including the Internal Mental Distress Scale, go to: