Computerized Adaptive Testing in Clinical Substance Abuse Practice: Issues and Strategies Barth Riley Lighthouse Institute, Chestnut Health Systems.

Slides:



Advertisements
Similar presentations
Introduction to IRT/Rasch Measurement with Winsteps Ken Conrad, University of Illinois at Chicago Barth Riley and Michael Dennis, Chestnut Health Systems.
Advertisements

Implications and Extensions of Rasch Measurement.
Standardized Scales.
Chapter 8 Flashcards.
The GAIN-Q (GQ): Development and Validation of a Substance Abuse and Mental Health Brief Assessment Janet C. Titus, Ph.D. Michael L. Dennis, Ph.D. Lighthouse.
ASSESSING RESPONSIVENESS OF HEALTH MEASUREMENTS. Link validity & reliability testing to purpose of the measure Some examples: In a diagnostic instrument,
What is a CAT?. Introduction COMPUTER ADAPTIVE TEST + performance task.
DIF Analysis Galina Larina of March, 2012 University of Ostrava.
Cognitive Modelling – An exemplar-based context model Benjamin Moloney Student No:
CAT Item Selection and Person Fit: Predictive Efficiency and Detection of Atypical Symptom Profiles Barth B. Riley, Ph.D., Michael L. Dennis, Ph.D., Kendon.
The Research Consumer Evaluates Measurement Reliability and Validity
Some (Simplified) Steps for Creating a Personality Questionnaire Generate an item pool Administer the items to a sample of people Assess the uni-dimensionality.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Part II Sigma Freud & Descriptive Statistics
What is a Good Test Validity: Does test measure what it is supposed to measure? Reliability: Are the results consistent? Objectivity: Can two or more.
PROMIS DEVELOPMENT METHODS, ANALYSES AND APPLICATIONS Presented at the Patient-Reported Outcomes Measurement Information System (PROMIS): A Resource for.
Measuring strengths & recovery Observations on the Mental Health Center of Denver.
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
Validity In our last class, we began to discuss some of the ways in which we can assess the quality of our measurements. We discussed the concept of reliability.
Effect Size and Meta-Analysis
Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.
+ A New Stopping Rule for Computerized Adaptive Testing.
Practical Psychometrics Preliminary Decisions Components of an item # Items & Response Approach to the Validation Process.
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.
1 Reducing the duration and cost of assessment with the GAIN: Computer Adaptive Testing.
Kendon ConradBarth Riley University of Illinois at Chicago Michael L. Dennis Chestnut Health Systems.
Scales and Indices While trying to capture the complexity of a phenomenon We try to seek multiple indicators, regardless of the methodology we use: Qualitative.
Using Research/Evaluation Questions to Define Data Collection and Findings: Findings from the FY 2004 KTOS Follow-up Study Robert Walker, Allison Mateyoke-Scrivener,
Presented By: Trish Gann, LPC
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
CHAPTER 6, INDEXES, SCALES, AND TYPOLOGIES
 Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing.
Chapter 18 Some Other (Important) Statistical Procedures You Should Know About Part IV Significantly Different: Using Inferential Statistics.
EDU 8603 Day 6. What do the following numbers mean?
CHAPTER OVERVIEW The Measurement Process Levels of Measurement Reliability and Validity: Why They Are Very, Very Important A Conceptual Definition of Reliability.
Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Validity and Item Analysis Chapter 4. Validity Concerns what the instrument measures and how well it does that task Not something an instrument has or.
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
Item Response Theory (IRT) Models for Questionnaire Evaluation: Response to Reeve Ron D. Hays October 22, 2009, ~3:45-4:05pm
The Practice of Social Research Chapter 6 – Indexes, Scales, and Typologies.
Assessing Responsiveness of Health Measurements Ian McDowell, INTA, Santiago, March 20, 2001.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
Psychometric Evaluation of Questionnaire Design and Testing Workshop December , 10:00-11:30 am Wilshire Suite 710 DATA.
Item Response Theory in Health Measurement
Scales and Indices While trying to capture the complexity of a phenomenon We try to seek multiple indicators, regardless of the methodology we use: Qualitative.
Brian Lukoff Stanford University October 13, 2006.
Chapter 6 - Standardized Measurement and Assessment
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Educational Research Chapter 8. Tools of Research Scales and instruments – measure complex characteristics such as intelligence and achievement Scales.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Reducing Burden on Patient- Reported Outcomes Using Multidimensional Computer Adaptive Testing Scott B. MorrisMichael Bass Mirinae LeeRichard E. Neapolitan.
IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
© 2009 Pearson Prentice Hall, Salkind. Chapter 5 Measurement, Reliability and Validity.
Measurement and Scaling Concepts
Indexes and Scales Why use a “composite” measure of a concept? ▫ There is often no clear “single” indicator ▫ Increase the range of variation ▫ Make data.
Test-Retest Reliability of the Work Disability Functional Assessment Battery (WD-FAB) Dr. Leighton Chan, MD, MPH Chief, Rehabilitation Medicine Department.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 25 Critiquing Assessments Sherrilene Classen, Craig A. Velozo.
BROOKHAVEN HOSPITAL’S
Evaluating Multi-Item Scales
Concept of Test Validity
Understanding Results
Evaluation of measuring tools: reliability
Mohamed Dirir, Norma Sinclair, and Erin Strauts
A Multi-Dimensional PSER Stopping Rule
Qualities of a good data gathering procedures
Presentation transcript:

Computerized Adaptive Testing in Clinical Substance Abuse Practice: Issues and Strategies Barth Riley Lighthouse Institute, Chestnut Health Systems

Overview CAT Basics CAT in Clinical Assessment Triage of individuals to support clinical decision making Measuring Multiple Dimensions Identifying Persons with Atypical Presentation of Symptoms

Evidence-Based Practice Requires accurate diagnosis, treatment placement, and outcomes monitoring Assessment over a wide range of domains The cost of evidence-based assessment is: Time Respondent Burden Increased staff resources (including training

Improving Efficiency The use of screeners and short-form instruments has significantly improved the efficiency of the assessment process Can help determine whether a full assessment is warranted But not a substitute for a full assessment Lack of precision Floor and ceiling effects Limited content validity

CAT Basics

CAT Process Decreased Difficulty Typical Pattern of Responses Increased Difficulty Middle Difficulty Score is calculated and the next best item is selected based on item difficulty +/- 1 Std. Error CorrectIncorrect

Logical Components of CAT Start Rule Item Selection Measure Estimation Stop Rule(s)

The Start Rule Used to select first item What measure is assigned to the respondent prior to selecting the first item? Can be an arbitrary value (0 on the logit scale) or can be based on previously gathered information.

Item Information Item Difficulty = 0.5 too difficult too easy Maximum information, Trait level = 0.5

CAT in Clinical Assessment

Clinical Decision Making How severe are the symptoms? What type of treatment is most appropriate? Can CAT be used to answer these questions more efficiently?

Strategy Starting Rules Using screener measures to set the initial measure and select the first item Variable Stop Rules Tight precision around cut points Less precision away from cut points

Riley, Conrad, Dennis & Bezruczko, 2007 Used CAT to place persons into low, moderate and high levels of substance abuse and dependency. Substance Problem Scale (SPS) is a 16 item instrument measuring recency of substance use. When was the last time you drank alcohol?

Defining Cut Points Cut points can be established by examining where persons with different levels of severity fall onto the measurement continuum.

The Start Rules Random: Randomly select an item between -0.5 and 0.5 logits of severity. Screener: Select most informative item relative to measure on a previously administered screener (SDScr).

The Variable Stop Rule Stop rules set for low, mid and high range of severity. Mid range stop rule was set to SE=0.35 for all simulations. Low and High range stop rule: SE=0.5 to 0.75

CAT Standard Error Middle range where decisions and made and precision is controlled High & Low ranges where there is little impact on clinical decisions and precision is allowed to vary more

Start Rule Using Screener Select item Administer item Estimate Measure, SE Stop? End test Yes No High range? Mid range? Low range stop rule High range stop rule Mid range stop rule Yes No CAT Algorithm

Results Screener starting rule improved CAT efficiency by 7 percent CAT reduced the number of required items by 13 to 66% CAT to full-measure correlations ranged from.87 to.99 Classification of persons into treatment groups based on CAT and full measure (kappa coefficients) ranged from.66 to.71.

Results Variable stop rules improved efficiency by 15-38% Efficiency depended on definition of the mid range of severity Screener start rule and variable stop rules resulted in accurate and efficient estimation of substance abuse severity.

Measuring Multiple Dimensions

Assessment on Multiple Dimensions Instruments often measure multiple constructs In CAT, treating a multidimensional item bank as unidimensional is problematic: Some subdimensions may not be adequately measured Particularly if subdimensions are not highly correlated with each other

Strategy: Content Balancing Set an item “quota” for each subscale Maximum number of subscale items to administer during the CAT An item is selected if: Its subscale quota has not been met Provides maximum information

Internal Mental Distress Scale The IMDS consists of the following subscales: Depression Symptom Scale Anxiety/Fear Symptom Scale Traumatic Distress Scale Homicidal/Suicidal Scale

Variations of Content Balancing Screener: Administers screener items first; no further content balancing. Mixed: Administers screener items, then uses content balancing for remaining items. Full: Uses content balancing throughout CAT session.

Variations of Content Balancing In mixed and full content balancing, the following target number of items is administered from the IMDS subscales: Depression: 5 Anxiety: 5 Trauma: 5 Homicidal/Suicidal: 3

Content Balancing Results ScaleN ItemsNoneScreenerMixedFull Depression ≥ 199.1%100% ≥ %76.7% 100% Homicidal/ Suicidal ≥ % 100% ≥ 3 8.2%7.8% 100% Anxiety ≥ 1100% ≥ 3100% Trauma ≥ 1100% ≥ 399.7%100%

CAT to Full-Scale Correlations ScaleNoneScreenerMixedFull IMDS Depression Homicidal/ Suicidal Anxiety Trauma Average r

Placement into Triage Groups MeasureNoneScreenerMixedFull IMDS Depression Homicidal/ Suicidal Anxiety Trauma Average Kappa

Results Content balancing had the greatest impact on homicidal/suicidal scale. Mixed content balancing provided best overall results

Identifying Persons with Atypical Presentation of Symptoms

Implications Implications: Clients sometimes endorse severe clinical symptoms that are not reflected by overall scores on standard assessments. Misfit in clinical assessment can reflect: Difficulty understanding the assessment Cross-cultural effects Differential effects of treatment on some symptoms but not others Unusual symptom profiles

Clinical Implications Results reveal subgroups who endorse severe symptoms without endorsement of milder symptoms. Atypical Suicide profile Substance dependence symptoms with abuse symptoms Persons who commit serious crimes (murder, rape) who have not committed less serious criminal offenses.

Person Fit Statistics Person fit statistics are the most common means of detecting atypical responders. Here is a typical (predicted by IRT) pattern of responding: Here is an example of an atypical response pattern:

Fit Statistics in CAT Become less sensitive as the number of administered items decreases. In CAT, items are usually selected in which each possible response to the item is equally likely. Items for which unusual responses are given may not be administered by the CAT.

Outfit by Number of Items Admin. Items Outfit Categories < 0.75 Proto Typical Typical > 1.33 Atypical %48.1%21.7% %51.1%14.6% 838.4%53.2%8.4% 458.2%40.0%1.8%

Strategies Item selection strategies Unidimensional Approach Examine response patterns for items representing a second- order construct, such as internal mental distress Fit statistics: detects all atypical symptom patterns Multidimensional Approach Compare subdimension measures Detection of a specific response pattern Is the persons level of suicide ideation greater than their level of depression? How big a difference in measures? Combination of the above

Does Item Selection Matter? Atypicalness Category NoneScreenerMixedFull IMDS Proto Typical26.7%34.6%48.3%50.5%49.2% Typical69.0%58.7%40.8%38.9%38.4% Atypical4.3%6.5%10.9%10.6%12.4% Kappa

CAT to Full-Measure Person Fit CAT* Statistic Full-Measure Outfit r=.73 EiEi r=.31 Homicidal/Suicidal – Depression r=.08 Logistic Regression Correct %91.6% * Using full content balancing

Suicide-Depression Profile CAT* Statistic Full Measure H/S a - Depression Outfitr =.11 Eir = -.54 H/S-Depressionr =.92 Multiple RegressionR 2 =.86 * Using full content balancing a Homicidal-Suicidal Scale measure

Conclusions Fit statistics and examination of subscale scores appear to capture different response patterns. Using effective item selection methods in conjunction with multiple measures of person fit improves our ability to detect atypical symptom patterns.

Potential of CAT in Clinical Practice Reduce respondent burden Reduce staff resources Reduce data fragmentation Streamline complex assessment procedures Assist in clinical decision making Identify persons with atypical profiles

Contact Information A copy of this presentation will be at: For information on this method and a paper on it, please contact Barth Riley at