Examining Differential Item Functioning of "Insensitive" Test Items Examining Differential Item Functioning of "Insensitive" Test Items Juliya Golubovich,

Slides:

Advertisements

Similar presentations

Project VIABLE: Behavioral Specificity and Wording Impact on DBR Accuracy Teresa J. LeBel 1, Amy M. Briesch 1, Stephen P. Kilgus 1, T. Chris Riley-Tillman.

Advertisements

Test Development.

DIF Analysis Galina Larina of March, 2012 University of Ostrava.

Issues in factorial design

General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.

Music Preference Among Students at LLC LLC Conformity Project.

Assessment: Reliability, Validity, and Absence of bias

BEYOND SKIN DEEP: INVESTIGATING THE “WHO” OF THE SENSITIVITY REVIEW James Grand Juliya Golubovich Ann Marie Ryan Neal Schmitt Society for Industrial &

BACKGROUND RESEARCH QUESTIONS  Does the time parents spend with children differ according to parents’ occupation?  Do occupational differences remain.

Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.

Genetic Factors Predisposing to Homosexuality May Increase Mating Success in Heterosexuals Written by Zietsch et. al By Michael Berman and Lindsay Tooley.

Evolutionary Psychology, Workshop 11: Controllability of Mate Value.

Abstract Rankin and Reason (2005; Reason & Rankin 2006) have suggested than women and students of color experience more harassment on college campuses.

TAYLOR HOWARD The Employment Interview: A Review of Current Studies and Directions for Future Research.

AN EVALUATION OF THE EIGHTH GRADE ALGEBRA PROGRAM IN GRAND BLANC COMMUNITY SCHOOLS 8 th Grade Algebra 1A.

CHAPTER 5: CONSTRUCTING OPEN- AND CLOSED-ENDED QUESTIONS Damon Burton University of Idaho University of Idaho.

Chapter 4 Principles of Quantitative Research. Answering Questions  Quantitative Research attempts to answer questions by ascribing importance (significance)

An Examination of Learning Processes During Critical Incident Training: Implications for the Development of Adaptable Trainees Andrew Neal, Stuart T. Godley,

Inferences about School Quality using opportunity to learn data: The effect of ignoring classrooms. Felipe Martinez CRESST/UCLA CCSSO Large Scale Assessment.

American Pride and Social Demographics J. Milburn, L. Swartz, M. Tottil, J. Palacio, A. Qiran, V. Sriqui, J. Dorsey, J. Kim University of Maryland, College.

1 Research Method Lecture 6 (Ch7) Multiple regression with qualitative variables ©

Is the Force Concept Inventory Biased? Investigating Differential Item Functioning on a Test of Conceptual Learning in Physics Sharon E. Osborn Popp, David.

POSTER TEMPLATE BY: om Sex Differences in Associations between Fear of Negative Evaluation (FNE) and Substance Use Lesley A.

Determining Wages: The Changing Role of Education Professor David L. Schaffer and Jacob P. Raleigh, Economics Department We gratefully acknowledge generous.

+ Equity Audit & Root Cause Analysis University of Mount Union.

Asian International Students Attitudes on Women in College Keyana Silverberg and Margo Hanson Advised by: Susan Wolfgram, Ph.D. University of Wisconsin-Stout.

1 Chapter 11: Survey Research Summary page 343 Asking Questions Obtaining Answers Multi-item Scales Response Biases Questionnaire Design Questionnaire.

Students’ Perceptions of the Physiques of Self and Physical Educators

Introduction Neuropsychological Symptoms Scale The Neuropsychological Symptoms Scale (NSS; Dean, 2010) was designed for use in the clinical interview to.

A statistical method for testing whether two or more dependent variable means are equal (i.e., the probability that any differences in means across several.

Cara Cahalan-Laitusis Operational Data or Experimental Design? A Variety of Approaches to Examining the Validity of Test Accommodations.

Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,

Emotional Intelligence: The Relationship Between Emotional Intelligence, Emotion Control, Affective Communication and Gender in University Students.

A MULTIDIMENSIONAL APPROACH TO THE IDENTIFICATION OF TEST FAIRNESS EXPLORATION OF THREE MULTIPLE-CHOICE SSC PAPERS IN PAKISTAN Syed Muhammad Fahad Latifi.

1 Bias and Sensitivity Review of Items for the MSP/HSPE/EOC August, 2012 ETS Olympia 1.

Mearns (1996, 1997) - an extension of Rogers’ (1957) facilitative conditions of therapeutic change. Mearns (2003) - serves as a distinctive hallmark of.

Attractive Equals Smart? Perceived Intelligence as a Function of Attractiveness and Gender Abstract Method Procedure Discussion Participants were 38 men.

Shane Lloyd, MPH 2011, 1,2 Annie Gjelsvik, PhD, 1,2 Deborah N. Pearlman, PhD, 1,2 Carrie Bridges, MPH, 2 1 Brown University Alpert Medical School, 2 Rhode.

Figure 1 Stress by parent gender and country of origin at times 1 and 2 ABSTRACT Newly immigrant parents (N = 253) were interviewed to assess their levels.

1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.

Introduction Disordered eating continues to be a significant health concern for college women. Recent research shows it is on the rise among men. Media.

Item Response Theory (IRT) Models for Questionnaire Evaluation: Response to Reeve Ron D. Hays October 22, 2009, ~3:45-4:05pm

Chapter Seven: The Basics of Experimentation II: Final Considerations, Unanticipated Influences, and Cross-Cultural Issues.

Personally Important Posttraumatic Growth as a Predictor of Self-Esteem in Adolescents Leah McDiarmid, Kanako Taku Ph.D., & Aundreah Walenski Presented.

Social Anxiety and College Drinking: An Examination of Coping and Conformity Drinking Motives Lindsay S. Ham, Ph.D. and Tracey A. Garcia, B.A. Florida.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

The Role of Mixed Emotional States in Predicting Men’s and Women’s Subjective and Physiological Sexual Responses to Erotic Stimuli Peterson, Z. D. 1 and.

Researching Technology in South Dakota Classrooms Dr. Debra Schwietert TIE Presentation April 2010 Research Findings.

Chapter 6 - Standardized Measurement and Assessment

Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.

Applied Opinion Research Training Workshop Day 3.

 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.

Abstract Research with youth faces particular challenges, including potential confusion about researchers’ intentions and vulnerabilities related to power.

Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.

LORAS.EDU Abstract Introduction We were interested in exploring general stereotypes undergraduates might have about students who choose to major in artistic.

The measurement and comparison of health system responsiveness Nigel Rice, Silvana Robone, Peter C. Smith Centre for Health Economics, University of York.

United Nations Regional Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Bangkok,

The Reliability of Crowdsourcing: Latent Trait Modeling with Mechanical Turk Matt Baucum, Steven V. Rouse, Cindy Miller-Perrin, Elizabeth Mancuso Pepperdine.

Crystal Reinhart, PhD & Beth Welbes, MSPH Center for Prevention Research and Development, University of Illinois at Urbana-Champaign Social Norms Theory.

Questionnaire-Part 2. Translating a questionnaire Quality of the obtained data increases if the questionnaire is presented in the respondents’ own mother.

Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.

Cheryl Ng Ling Hui Hee Jee Mei, Ph.D Universiti Teknologi Malaysia

Sexual Imagery & Thinking About Sex

Nicole R. Buttermore, Frances M. Barlas, & Randall K. Thomas

Concept of Test Validity

Journalism 614: Reliability and Validity

Classroom Assessment: Bias

David Kellen, Henrik Singmann, Sharon Chen, and Samuel Winiger

Variations on Aschs Research

The Effect of Lineup Structure on Individual Identification

Presentation transcript:

Examining Differential Item Functioning of "Insensitive" Test Items Examining Differential Item Functioning of "Insensitive" Test Items Juliya Golubovich, James A. Grand, M. A., Neal Schmitt, Ph.D., and Ann Marie Ryan, Ph.D. Michigan State University INTRODUCTION Fairness in testing is a prominent concern for selection specialists. To enhance test fairness test developers commonly use a sensitivity review process (or fairness review; Ramsey, 1993). During a typical sensitivity review, reviewer(s) go through test questions to identify and remove content that certain groups of test takers (e.g., gender, racial/ethnic, age, socio economic groups) could perceive as insensitive (e.g., upsetting, offensive). Consider an offensive fill in the blank item: The fact that even traditional country music singers have ________ words such as “bling” and “ho” seems to indicate that urban, hip-hop culture has ________ all music genres. An unaddressed question is whether items flagged by sensitivity reviewers as problematic are ones that would negatively affect certain test takers’ performance (i.e., show psychometric bias). To this end, we examined whether lack of fairness in the form of items judged to be insensitive (determined via a judgmental process before test administration) is associated with psychometric bias (determined via a statistical process after test administration). When equal ability test takers from different groups show unequal probabilities of responding to an item correctly, there is evidence of differential item functioning (DIF). We examined how item characteristics considered insensitive according to standard sensitivity review guidelines relate to the presence of DIF across male and female test takers. Gender was a reasonable choice given research suggesting women may be more reactive than men to problematic item content (Mael et al., 1996). METHOD Sample 336 students, primarily young (M = 19.45, SD = 1.65), White (73.6%), and about equally split on gender (n = 170 males). Measures Demographics. Age, gender, ethnicity, and race. Test items. Nine insensitive items were couched within a 30-item verbal ability test. The insensitive items came from a 54-item pool developed by the authors based on insensitive item exemplars from sensitivity reviewer training materials. Items belonged to various categories of insensitivity (e.g., offensive, emotionally provocative, portrays gender stereotype) that were derived from fairness guidelines (e.g., ACT, 2006). Testing professionals (n = 49) with experience serving as sensitivity reviewers (10.22 years on average) rated the original pool of 54 items on insensitivity using a four point scale (1-highly insensitive; 4-not problematic). Any one reviewer rated 18 of the 54 items. A sample of students (N = 301; 26.4% male; 14.6% non-White; mean age = 19.62) also evaluated these items (using the same scale). They were asked to assume the role of sensitivity reviewers after a brief tutorial on sensitivity reviews. The 9 items chosen for the current study were ones rated most insensitive on average by professional and student reviewers and ones that showed significant gender differences in student reviewer ratings. The non- problematic items in the current study (21 items) were selected from the larger set of 108 items rated by student reviewers that received the most favorable ratings (M = 3.85, SD =.04). RESULTS Overall, no significant differences in overall test performance were observed between women (M = 20.87, SD = 3.59) and men (M = 20.01, SD = 4.43). None of the insensitive items analyzed exhibited a large amount of DIF; the three items that exhibited the greatest evidence of DIF (females had an advantage on these) were not among those rated as insensitive by judges. Two of these items had significantly different difficulty estimates for men and women. Substantive analyses suggested that females’ advantage on these items may have been due to response style differences between men and women. Figure 1. Male and female ICCs for item 19. DISCUSSION We examined whether insensitive item content would produce item-level male- female performance differences on a verbal ability test. Only 3 items were flagged as problematic by DIF analyses, and these items were not ones sensitivity reviewers saw as problematic. Our findings are consistent with previous research with regard to judgmental and statistical processes diverging in identification of differentially functioning items. Our study moves beyond earlier work, however, in that we examined types of insensitive items that in other studies would not make it onto a test to be examined post-administration. Our inability to find DIF on items sensitivity reviewers make certain to remove from tests could suggest that removing these types of insensitive items may not help prevent DIF and may lead to discarding items useful for assessing ability. Given the costs and challenges of test development, removing useful items is undesirable. However, this does not imply that sensitivity reviews are inefficacious. Even if the types of items we examined do not differentially impact groups’ performance, their presence may still influence test reactions. Beyond insuring that certain groups’ performance is not influenced by factors other than the construct of interest, sensitivity reviews are also conducted with the goal of minimizing negative test-taker reactions (McPhail, 2010). Limitations include that data was not collected in a high stakes context (where certain groups may react more negatively to insensitivity), that the 9 insensitive items were relatively easy (if insensitivity disrupts concentration this may be more detrimental on difficult items) and that with our relatively small sample, power may have been insufficient to identify items that were actually problematic. Procedure Students responded to demographic questions using an online survey upon signing up for the study. They then appeared at a scheduled time to take the 30-item verbal ability test in a supervised group setting. The test was not timed. Statistical Analyses Post administration, we examined test items for differential item functioning based on gender. DIF exists on a particular item when individuals of equal ability but from different groups (in this case gender groups) have unequal probabilities of answering the item correctly. Differences in item characteristic curves (ICCs) across groups provide evidence of DIF. We used the two parameter logistic (2PL) model (probability of a correct response to an item modeled using item difficulty, item discrimination, and examinee ability). BILOG-MG showed good overall model fit to the data,  2 (201) = 198.6, p >.05, but the model did not fit two of the 30 items and parameters for a third item could not be estimated for males. Thus, a total of 27 items (7 of them judged insensitive) were used for the final DIF analyses. Item parameters were estimated separately for males and females using BILOG-MG and parameter estimates for the two groups were placed on a common scale so that ICCs could be compared. ICC curves for each group were plotted on common graphs and visually inspected for large differences. As examination suggested differences in difficulty rather than discrimination in each case, the DIF function in BILOG- MG was used to test the significance of difficulty differences between ICCs. Table 1 Responses to Item 19 Item % endorsing option MF One of the best features of the journalist’s lifestyle is you never know what’s next. a. you never know what’s next 6.5%7.2% b. it’s so unpredictable9.4%9.6% c. that you never know what’s next 40.6%22.9% d. one can never predict what’s next 8.2%6.0% e. its unpredictability*34.7%54.2% *Correct response