A controversy in PISA and other large- scale assessments: the trade-off between model fit, invariance and validity David Andrich CEM: 30 years of Evidence.

Slides:



Advertisements
Similar presentations
2/8/2014 Measuring Disability and Monitoring the UN Convention on the Rights of Persons with Disabilities… … the work of the Washington Group on Disability.
Advertisements

Designing Accessible Reading Assessments National Accessible Reading Assessment Projects General Advisory Committee December 7, 2007 Overview of DARA Project.
Test Development.
PISA FOR DEVELOPMENT Technical Workshops Components and input for ToR of the International Contractor(s) 9 th April 2014 OECD Secretariat 1.
The Research Consumer Evaluates Measurement Reliability and Validity
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT
From Concepts to Variables Sociology 690 – Measurement.
General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.
Using Multiple Choice Tests for Assessment Purposes: Designing Multiple Choice Tests to Reflect and Foster Learning Outcomes Terri Flateby, Ph.D.
Chapter Fifteen Understanding and Using Standardized Tests.
By: Michele Leslie B. David MAE-IM WIDE USAGE To identify students who may be eligible to receive special services To monitor student performance from.
Reliability, Validity, Trustworthiness If a research says it must be right, then it must be right,… right??
Overview of field trial analysis procedures National Research Coordinators Meeting Windsor, June 2008.
MSc Applied Psychology PYM403 Research Methods Validity and Reliability in Research.
How to evaluate the cross-cultural equivalence of single items Melanie Revilla, Willem Saris RECSM, UPF Zurich – 15/16 July.
Designing Accessible Reading Assessments Research on Making Large Scale Assessments More Accessible for Students with Disabilities Institute of Education.
1 The New York State Education Department New York State’s Student Reporting and Accountability System.
Chapter 4. Validity: Does the test cover what we are told (or believe)
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Internal Consistency Reliability Analysis PowerPoint.
Measurement Problems within Assessment: Can Rasch Analysis help us? Mike Horton Bipin Bhakta Alan Tennant.
A New Look at the Evaluation of Sociological Theories in International Large Scale Educational Assessments Daniel Caro and Andrés Sandoval-Hernandez CIES.
Measurement in Exercise and Sport Psychology Research EPHE 348.
Prototypical Level 4 Performances Students use a compensation strategy, recognizing the fact that 87 is two less than 89, which means that the addend coupled.
Validation of the Assessment and Comparability to the PISA Framework Hao Ren and Joanna Tomkowicz McGraw-Hill Education CTB.
CHAPTER 6, INDEXES, SCALES, AND TYPOLOGIES
WELNS 670: Wellness Research Design Chapter 5: Planning Your Research Design.
Student assessment AH Mehrparvar,MD Occupational Medicine department Yazd University of Medical Sciences.
Constructing/transforming Variables Still preliminary to data analysis (statistics) Would fit comfortably under Measurement A bit more advanced is all.
EDU 8603 Day 6. What do the following numbers mean?
MELS 601 Ch. 7. If curriculum can be defined most simply as what is taught in the school, then instruction is the how —the methods and techniques that.
Construct-Centered Design (CCD) What is CCD? Adaptation of aspects of learning-goals-driven design (Krajcik, McNeill, & Reiser, 2007) and evidence- centered.
Confirmatory Factor Analysis Psych 818 DeShon. Construct Validity: MTMM ● Assessed via convergent and divergent evidence ● Convergent – Measures of the.
Full Structural Models Kline Chapter 10 Brown Chapter 5 ( )
A COMPARISON METHOD OF EQUATING CLASSIC AND ITEM RESPONSE THEORY (IRT): A CASE OF IRANIAN STUDY IN THE UNIVERSITY ENTRANCE EXAM Ali Moghadamzadeh, Keyvan.
Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.
What Use Are International Assessments for States? 30 May 2008 Jack Buckley Deputy Commissioner National Center for Education Statistics Institute of Education.
ASSESSMENT LITERACY PROJECT Kansas State Department of Education Introduction and Overview Welcome !
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 15 Developing and Testing Self-Report Scales.
The Practice of Social Research Chapter 6 – Indexes, Scales, and Typologies.
Item Response Theory in Health Measurement
The Polytomous Unidimensional Rasch Model: Understanding its Response Structure and Process ACSPRI Social Science Methodology Conference, Sydney, December.
Chapter 6 - Standardized Measurement and Assessment
Rating Scale Examples. A helpful resource
Obtaining International Benchmarks for States Through Statistical Linking: Presentation at the Institute of Education Sciences (IES) National Center for.
LISA A. KELLER UNIVERSITY OF MASSACHUSETTS AMHERST Statistical Issues in Growth Modeling.
Essentials for Measurement. Basic requirements for measuring 1) The reduction of experience to a one dimensional abstraction. 2) More or less comparisons.
The Invariance of the easyCBM® Mathematics Measures Across Educational Setting, Language, and Ethnic Groups Joseph F. Nese, Daniel Anderson, and Gerald.
Northwest Evaluation Association – Measure of Academic Progress.
Measurement Why do we need measurement? We need to measure the concepts in our hypotheses. When we measure those concepts, they become variables. Measurement.
1 Perspectives on the Achievements of Irish 15-Year-Olds in the OECD PISA Assessment
1 Main achievement outcomes continued.... Performance on mathematics and reading (minor domains) in PISA 2006, including performance by gender Performance.
1 Teaching Supplement.  What is Intersectionality?  Intersectionality and Components of the Research Process  Implications for Practice 2.
Reliability and Validity
Daniel Muijs Saad Chahine
Assessment Framework and Test Blueprint
Test Design & Construction
Validity and Reliability
Reliability & Validity
SESRI Workshop on Survey-based Experiments
Booklet Design and Equating
Week 3 Class Discussion.
پرسشنامه کارگاه.
RESEARCH METHODS Lecture 18
Rating Scale Examples.
Understanding and Using Standardized Tests
Julian Williams & Maria Pampaka The University of Manchester
Measurement Concepts and scale evaluation
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Index Construction (Week 13)
MEASUREMENT AND QUESTIONNAIRE CONSTRUCTION:
Investigations into Comparability for the PARCC Assessments
Presentation transcript:

A controversy in PISA and other large- scale assessments: the trade-off between model fit, invariance and validity David Andrich CEM: 30 years of Evidence in Education London : 23 September 2014

Program for International Student Assessment - PISA Many uses and misuses e.g. may reject the program e.g. may reject the methodology Will consider one methodological attack here

General assessment plan in PISA To cover the curriculum, multiple booklets (16) with links are used in each country Students do different booklets All countries receive the same booklets Place different booklets on the same scale Use a probabilistic model for this purpose The model estimates are then involved in comparing countries

A methodological attack - DIF Valid comparisons - items should work invariantly among countries (same relative difficulty in all countries) Not invariant - said to have differential item functioning (DIF) If DIF – what can be done about it? If DIF – can comparisons be made valid? It depends!

The presentation 1.Distinguish between causal and index variables 2.Imagine the assessment of physics in multiple domains 3.Set up an idealised assessment design in three countries 4.Illustrate the model used and the concepts of (a) fit to the model (b) DIF 5. Show tension between model fit and validity.

Causal and Index Variables Stenner, A. J., et. al.(2008). Formative and reflective models: Can a Rasch analysis tell the difference? Rasch Measurement Transactions, 22, 1059 – 1060.

Causal and Index variables Causal Example E.G heat– indicated by thermometers Change in heat cause change on the thermometer (i)Same changes on all thermometers (ii)Thermometers are exchangeable Index Example E.G Indicators of SES education, occupational prestige, income, and neighbourhood (i)Change in one indicator does not change other indicators (ii)Indicators not exchangeable

Science proficiency – in light Assessment understanding of light (relatively thin variable) Items of a test related to the curriculum on light Causal variable – understanding of light governs performance on all items of the test Items in principle exchangeable (avoid effect of teaching to the test)

Assess a broad physics construct

Students from three countries Simulation of an idealisation of the PISA controversy All countries of equal proficiency Item difficulties similar in the 5 domains – 8 items each All items administered to all countries Have some DIF by domains

Model and Fit of Item 21 - Sound

Model and DIF, 17 – 24, Sound C1 > C2, C3

Model and Fit Item 29 – Electricity and Magnetism

Model and DIF, 25 – 32, Elec & Mag C3 > C1, C2

Resolve items by country: Sound

Split items by country: Elec and mag.

Summary of Means

Summary: DIF and Interpretation Split on a domain is equivalent to deleting it Most valid interpretation? Depends on source of DIF! Artefact or substantive Cannot be answered only statistically. Understand DIF, test and curriculum implications

Thank You