Cara Cahalan-Laitusis Operational Data or Experimental Design? A Variety of Approaches to Examining the Validity of Test Accommodations.

Slides:



Advertisements
Similar presentations
Designing Accessible Reading Assessments National Accessible Reading Assessment Projects General Advisory Committee December 7, 2007 Overview of DARA Project.
Advertisements

Copyright © 2004 Educational Testing Service Listening. Learning. Leading. Using Differential Item Functioning to Analyze a State English-language Arts.
Copyright © 2006 Educational Testing Service Listening. Learning. Leading. Using Differential Item Functioning to Investigate the Impact of Accommodations.
Impact of Read Aloud on Test of Reading Comprehension Cara Cahalan-Laitusis Educational Testing Service.
Copyright © 2004 Educational Testing Service Listening. Learning. Leading. Using DIF to Examine the Validity and Fairness of Assessments for Students With.
Examining Differential Boost from Read Aloud on a Test of Reading Comprehension at Grades 4 and 8 Cara Cahalan-Laitusis, Linda Cook, Fred Cline, and Teresa.
Partnership for Accessible Reading Assessment Partnership for Accessible Reading Assessment (PARA) Research Martha Thurlow National Center on Educational.
Examining Differential Boost from Read Aloud on a Test of Reading Comprehension at Grades 4 and 8 Cara Cahalan-Laitusis Linda Cook Fred Cline Teresa King.
Test Development.
The Journey Toward Accessible Assessments Karen Barton CTB/McGraw-Hill Validity & Accommodations:
The Research Consumer Evaluates Measurement Reliability and Validity
1 COMM 301: Empirical Research in Communication Kwan M Lee Lect4_1.
General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.
Designing Accessible Reading Assessments Reading Aloud Tests of Reading Review of Research from the Designing Accessible Reading Assessments Projects Cara.
Issues of Technical Adequacy in Measuring Student Growth for Educator Effectiveness Stanley Rabinowitz, Ph.D. Director, Assessment & Standards Development.
Issues Related to Assessment with Diverse Populations
Assessment: Reliability, Validity, and Absence of bias
Data and the Nature of Measurement
VALIDITY.
RELIABILITY & VALIDITY
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Jamal Abedi University of California, Davis/CRESST Presented at The Race to the Top Assessment Program January 20, 2010 Washington, DC RACE TO THE TOP.
Georgia Modification Research Study Spring 2006 Sharron Hunt Melissa Fincher.
Designing Accessible Reading Assessments Research on Making Large Scale Assessments More Accessible for Students with Disabilities Institute of Education.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Educational Assessment
Reliability and Validity. Criteria of Measurement Quality How do we judge the relative success (or failure) in measuring various concepts? How do we judge.
Validity Lecture Overview Overview of the concept Different types of validity Threats to validity and strategies for handling them Examples of validity.
Test Validity S-005. Validity of measurement Reliability refers to consistency –Are we getting something stable over time? –Internally consistent? Validity.
Kaizen–What Can I Do To Improve My Program? F. Jay Breyer, Ph.D. Presented at the 2005 CLEAR Annual Conference September Phoenix,
Group Discussion Explain the difference between assignment bias and selection bias. Which one is a threat to internal validity and which is a threat to.
Consumer Preference Test Level 1- “h” potato chip vs Level 2 - “g” potato chip 1. How would you rate chip “h” from 1 - 7? Don’t Delicious like.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
Classroom Assessments Checklists, Rating Scales, and Rubrics
Evaluating a Research Report
Research Strategies Chapter 6. Research steps Literature Review identify a new idea for research, form a hypothesis and a prediction, Methodology define.
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
The Basics of Experimentation Ch7 – Reliability and Validity.
Developing Assessments for and of Deeper Learning [Day 2b-afternoon session] Santa Clara County Office of Education June 25, 2014 Karin K. Hess, Ed.D.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 3: The Foundations of Research 1.
Chapter 4 – Research Methods in Clinical Psych Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
SURVEY RESEARCH.  Purposes and general principles Survey research as a general approach for collecting descriptive data Surveys as data collection methods.
Evaluating Impacts of MSP Grants Hilary Rhodes, PhD Ellen Bobronnikov February 22, 2010 Common Issues and Recommendations.
1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.
QUANTITATIVE RESEARCH Presented by SANIA IQBAL M.Ed Course Instructor SIR RASOOL BUKSH RAISANI.
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Validity and Item Analysis Chapter 4. Validity Concerns what the instrument measures and how well it does that task Not something an instrument has or.
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
Experimental Research Methods in Language Learning Chapter 5 Validity in Experimental Research.
McGraw-Hill/Irwin © 2012 The McGraw-Hill Companies, Inc. All rights reserved. Obtaining Valid and Reliable Classroom Evidence Chapter 4:
Ch 9 Internal and External Validity. Validity  The quality of the instruments used in the research study  Will the reader believe what they are readying.
Study of Device Comparability within the PARCC Field Test.
RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.
Chapter 6 - Standardized Measurement and Assessment
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Chapter 3 Selection of Assessment Tools. Council of Exceptional Children’s Professional Standards All special educators should possess a common core of.
 Good for:  Knowledge level content  Evaluating student understanding of popular misconceptions  Concepts with two logical responses.
1 Measurement Error All systematic effects acting to bias recorded results: -- Unclear Questions -- Ambiguous Questions -- Unclear Instructions -- Socially-acceptable.
Assist. Prof. Merve Topcu Department of Psychology, Çankaya University
Carina Omoeva, FHI 360 Wael Moussa, FHI 360
Test Validity.
Introduction to the Validation Phase
Validity and Reliability
Week 3 Class Discussion.
Chapter 8 VALIDITY AND RELIABILITY
Causal Comparative Research Design
Presentation transcript:

Cara Cahalan-Laitusis Operational Data or Experimental Design? A Variety of Approaches to Examining the Validity of Test Accommodations

Review types of evidence Review current research designs Pros/Cons for each approach

Types of Validity Evidence Psychometric research Experimental research Survey research Argument based approach

Psychometric Indicators (National Academy of Sciences, 1982) Reliability Factor Structure Item functioning Predicted Performance Admission Decisions

Psychometric Evidence Is the test as reliable when taken with and without accommodations? (Reliability) Does the test (or test items) appear to measure the same construct for each group? (Validity) Are test items of relatively equal difficulty for students with and without a disability who are matched on total test score? (Fairness/Validity)

Psychometric Evidence Are completion rates relatively equal between students with and without a disability who are matched on total test score? (Fairness) Is equal access provided to testing accommodations across different disability, racial/ethnic, language, gender, and socio-economic groups? (Fairness) Do tests scores under or over predict an alternate measure of performance (e.g., grades, teacher ratings, other test scores, post graduate success) for students with disabilities? (Validity)

Advantages of Operational Data Cost effective Quick results Easy to replicate Provides evidence of validity Large sample size Motivated test takers

Limitations of Operational Data Disability and accommodation are confounded Order effects can not be controlled for Sample size can be insufficient Difficult to show reasons why data is not comparable between subgroups Disability and Accommodation codes are not always accurate –Approved accommodations may not be used –Disability category may be too broad

Types of Analyses Correlations Factor Analysis Differential Item Functioning Descriptive analyses

Relationship Among Content Areas Correlation between content areas (e.g. reading and writing) can also assess a tests reliability. –Compare correlations among content areas by population (e.g., LD with read aloud vs. LD without an accommodation) –Does the accommodation alter construct being measured? (e.g., correlations between reading and writing may be lower if read aloud is used for writing but not reading). –Is correlation significantly lower for one population? (difference of.10 or greater)

Reliability Examine internal consistency measures –with and without specific accommodations –with and without a disability Examine test-retest reliability between different populations –with and without specific accommodations –with and without a disability

Factor Structure Types of questions –Are the number of factors invariant? –Are the factor loadings invariant for each of the groups? –Are the intercorrelations of the factors invariant for each of the groups?

Differential Item Functioning DIF refers to a difference in item performance between two comparable groups of test takers DIF exists if test takers who have the same underlying ability level are not equally likely to get an item correct Some recent DIF studies on accommodations/disability –Bielinski, Thurlow, Ysseldyke, Freidebach & Friedebach, 2001 –Bolt, 2004 –Barton & Finch, 2004 –Cahalan-Laitusis, Cook, & Aicher, 2004

Issues Related to the Use of DIF Procedures for Students with Disabilities Group characteristics –Definition of group membership –Differences between ability levels of reference and focal groups The characteristics of the criterion –Unidimensional –Reliable –Same meaning across groups

Procedures/Sample DIF Procedures (e.g., Mantel-Haenszel, Logistic regression, DIF analysis paradigm, Sibtest) Reference/focal groups –minimum of 100 per group, ETS uses a minimum of 300 for most operational tests –Select groups that are specific (e.g., LD with read aloud) rather than broad (e.g., all students with IEP or 504)

DIF with hypotheses Generate hypotheses on why items may function differently Code items based on hypotheses Compare DIF results with item coding Examine DIF results to generate new hypotheses

Other Psychometric Research DIF to examine fatigue on extended time Item completion rates between groups matched on ability Loglinear analysis to examine if specific demographic subgroups (SES, race/ethnicity, geographic regions, gender) are using specific accommodation less than other groups.

Other Research Studies Experimental Research –Differential Boost Survey/Field Test Research Argument-based Evidence

Advantages of Collecting Data Disability and accommodation can be examined separately Form and Order effects can be controlled Sample can be specific (e.g., reading-based LD rather than all LD or LD with or without ADHD) Opportunity to collect additional information Reasons for differences can be tested Data can be reused for psychometric analyses

Disadvantages Cost of large data collection Test takers may not be as motivated More time consuming than psychometric research Over testing of students

Differential Boost (Fuchs & Fuchs 1999) Would students without disabilities benefit as much from the accommodation as students with disabilities? –If Yes then the accommodation is not valid. –If No, then the accommodation may be valid.

Differential Boost Design

Ways to reduce cost: Decrease sample size Randomly assign students to one of two conditions Use operational test data for one of the two sessions

Additional data to collect: Alternate measure of performance on construct being assessed Teacher survey (ratings of student performance, history of accommodation use) Student survey Observational data (how student used accommodation) Timing data

Additional Analyses Differential Boost –by subgroups –controlling for ability level Psychometric properties (e.g, DIF) Predictive Validity (alt performance measure required)

Field Testing Survey How well does item type measure intended construct (e.g., reading comprehension, problem solving)? Did you have enough time to complete this item type? How clear were the directions (for this type of test question)?

Field Testing Survey How would you improve this item type? –To make the directions clearer –To measure the intended construct What specific accommodations would improve this item type? Which presentation approach did the test takers prefer?

Additional Types of Surveys How accommodation decisions are made Expert opinion on how/if accommodation interferes with construct being measured Information on how test scores with and without accommodations interpreted Correlation between use of accommodations in class and on standardized tests

Additional Research Designs Think Aloud Studies or Cognitive Labs Item Timing Studies Scaffolded Accommodations

Argument-Based Validity Clearly Define Construct Assessed –Evidence Centered Design Decision Tree