1 Conceptual Issues in Observed-Score Equating Wim J. van der Linden CTB/McGraw-Hill.

Slides:

Advertisements

Similar presentations

Estimation of Means and Proportions

Advertisements

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.

How Should We Assess the Fit of Rasch-Type Models? Approximating the Power of Goodness-of-fit Statistics in Categorical Data Analysis Alberto Maydeu-Olivares.

Item Response Theory in Health Measurement

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.

Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.

Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.

Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.

Objectives Look at Central Limit Theorem Sampling distribution of the mean.

Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.

By Dr. Mohammad H. Omar Department of Mathematical Sciences May 16, 2006 Presented at Statistic Research (STAR) colloquium, King Fahd University of Petroleum.

MGTO 231 Human Resources Management Personnel selection I Dr. Kin Fai Ellick WONG.

Intro to Statistics for the Behavioral Sciences PSYC 1900

A new sampling method: stratified sampling

The Autoregressive Model of Change David A. Kenny.

A Hierarchical Framework for Modeling Speed and Accuracy on Test Items Van Der Linden.

Maximum likelihood (ML)

Classical Test Theory By ____________________. What is CCT?

Sample Size Determination Ziad Taib March 7, 2014.

Linear Regression/Correlation

EPSY 8223: Test Score Equating

Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.

Statistics 11 Hypothesis Testing Discover the relationships that exist between events/things Accomplished by: Asking questions Getting answers In accord.

Conceptual Issues in Response-Time Modeling Wim J. van der Linden CTB/McGraw-Hill.

Linking & Equating Psych 818 DeShon. Why needed? ● Large-scale testing programs often require multiple forms to maintain test security over time or to.

2-1 MGMG 522 : Session #2 Learning to Use Regression Analysis & The Classical Model (Ch. 3 & 4)

Probability theory 2 Tron Anders Moger September 13th 2006.

Topic 5 Statistical inference: point and interval estimate

Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.

Random Sampling, Point Estimation and Maximum Likelihood.

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

PARAMETRIC STATISTICAL INFERENCE

Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,

CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.

Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.

Random Numbers and Simulation  Generating truly random numbers is not possible Programs have been developed to generate pseudo-random numbers Programs.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Chapter 7 Sampling and Point Estimation Sample This Chapter 7A.

Estimation Chapter 8. Estimating µ When σ Is Known.

EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.

Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office.

Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.

GG 313 Lecture 9 Nonparametric Tests 9/22/05. If we cannot assume that our data are at least approximately normally distributed - because there are a.

Chapter 4 The Classical Model Copyright © 2011 Pearson Addison-Wesley. All rights reserved. Slides by Niels-Hugo Blunch Washington and Lee University.

G Lecture 91 Measurement Error Models Bias due to measurement error Adjusting for bias with structural equation models Examples Alternative models.

University of Ostrava Czech republic 26-31, March, 2012.

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

Item Response Theory in Health Measurement

Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,

Chapter 21prepared by Elizabeth Bauer, Ph.D. 1 Ranking Data –Sometimes your data is ordinal level –We can put people in order and assign them ranks Common.

2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)

Hypothesis Testing and Statistical Significance

Lesson 2 Main Test Theories: The Classical Test Theory (CTT)

5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)

Chapter 2 Norms and Reliability. The essential objective of test standardization is to determine the distribution of raw scores in the norm group so that.

IRT Equating Kolen & Brennan, 2004 & 2014 EPSY

CWR 6536 Stochastic Subsurface Hydrology Optimal Estimation of Hydrologic Parameters.

Nonequivalent Groups: Linear Methods Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2 nd ed.). New.

Estimating standard error using bootstrap

Probability Theory and Parameter Estimation I

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

Hypothesis Testing: Hypotheses

Evaluation of measuring tools: reliability

Linear Regression/Correlation

By ____________________

Ch11 Curve Fitting II.

Statistics II: An Overview of Statistics

Inferential Statistics

Presentation transcript:

1 Conceptual Issues in Observed-Score Equating Wim J. van der Linden CTB/McGraw-Hill

2 Outline Review of Lord (1980) Local equating Few examples Discussion

3 Review of Lord (1980) Notation –X: old test form with observed score X –Y: new test form Y with observed score Y –θ: common ability measured by X and Y –x=φ(y): equating transformation

4 Review of Lord (1980) Cont’d Case 1: Infallible measures –X and Y order any population identically –Equivalence of ranks establishes equating transformation

5 Review of Lord (1980) Cont’d Case 1: Infallible measures Cont’d –Q-Q curve –Issues related to discreteness, strict monotonicity, and sampling error will be ignored –Equating is population invariant –Equating error always equal to zero

6 Review of Lord (1980) Cont’d Case 2: Fallible measures –For each test taker, observed score are random variables –Realizations of X and Y do not order populations of test takers identically –Criterion of equity of equating for all θ

7 Review of Lord (1980) Cont’d Case 2: Fallible measures Cont’d –Lord’s theorem: Under realistic conditions, scores X and Y on two tests cannot be equated unless either (1) both scores are perfectly reliable of (2) the two tests are strictly parallel [in which case φ(y)=y]

8 Review of Lord (1980) Cont’d Case 2: Fallible measures Cont’d –Equating no longer population invariant

9 Review of Lord (1980) Cont’d Two approximate methods –IRT true-score equating –Use ξ=ξ(η) to equate Y to X

10 Review of Lord (1980) Cont’d Two approximate methods Cont’d –IRT observed-score equating, for a sample of test takers a=1,…,N

11 Review of Lord (1980) Cont’d Lord’s forgotten question: What is really needed is a criterion for evaluating such approximate procedures, so as to be able to choose from among them. If you can’t be fair (provide equity) to everyone, what is the next best thing? (p.207)

12 Local Equating New definition of equating error Equity=no equating error! Setting e 2 (y) equal to zero and solving for φ(y) gives

13 Local Equating Cont’d Because of monotonicity of x=φ(y), the result is the family of error-free (or true) equating transformations Lord’s theorem is based on implicit assumption of a single transformation

14 Local Equating Cont’d Theorem: For a population of test takers P for which X and Y measure the same θ, equating with the family of transformations φ * (y;θ) has the following properties: (i) equity for each p P (ii) symmetry in X and Y for each p P (iii) population invariance within P

15 Local Equating Cont’d Theorem defines population P –No sampling of test takers required –Includes future test takers Alternative definition of equating error:

16 Local Equating Cont’d Definition of bias, MSE, etc., in equating now straightforward Lord’s criterion for finding the “next best thing”

17 Local Equating Cont’d Alternative motivations of local equating –Thought experiment –History of standard error of measurement –Comparison with true-score equating IRT observed-score equating –Same score but different equated scores?

18 Local Equating Cont’d Alternative motivations Cont’d –One measurement instrument but different transformations?

19 Few Examples It may seem as if local equating replaces Lord’s set of impossible conditions for equating (perfect reliability; parallel test) by another impossible condition (known ability) However, post hoc improvement of reliability or parallelness is impossible but we can always approximate an unknown ability

20 Few Examples Cont’d Possible approximations –Estimating ability –Anchor scores as a proxy of ability –Y=y as a proxy of ability –Proxies based on collateral information

21 Discussion Criterion of equity involves a different equating transformation for each ability level Traditional equating uses “one-size fits all” transformation, which compromises between the transformations for ability levels. As a result, the equating is always (i) biased and (ii) population dependent

22 Discussion Cont’d Lord’s theorem on the impossibility or unnecessity of observed-score equating was too pessimistic because it assumed the use of a single equating transformation for a population of test takers

23 Equipercentile Method Test Y Test X Test Score Cumulative Probability F (x)F (x) G(y)G(y) p

24 Thought Experiment y p Test Y

25 Thought Experiment Cont’d y x p p Test Y Test X

26 Thought Experiment Cont’d y x y x=φ(y) p p p Test Y Test X Transformation Y → X

27 Thought Experiment Cont’d y x y x=φ(y) p p p q Test Y Test X Transformation Y → X

28 Thought Experiment Cont’d y x y x=φ(y) p q p p Test Y Test X Transformation Y → X q

29 Thought Experiment Cont’d y y x=φ(y) p q p q qp x Test Y Test X Transformations Y → X

30 Thought Experiment Cont’d Test Y (Population 1) Test X (Population 2) y x y x=φ(y) Transformation Y → X

31 Thought Experiment Cont’d y x=φ(y) Transformation Y → X y x=φ(y) qp Transformations Y → X

32 Standard Error of Measurement Classical test theory involves one SEM for an entire population of test takers Stronger models condition on ability measured; e.g., IRT

33 True-Score Equating True-score equating is a degenerate case of local equating

34 Different Equated Scores? Why should two test takers, p and q, with the same score of 23 out of 30 items correct on a new test form need different equated scores on the same old form? –Would this not even be unfair? –Fallible scores

35 Different Equated Scores? Cont’d Observed-score distribution of pObserved-score distribution of q

36 Different Transformations? Example of measuring tape Number-correct scores are counts of responses, no fundamental measures Responses have person and item effects –Test equating requires “some type of control for differential examinee ability”—von Davier, Holland & Thayer (2004, p. 2)

37 Different Transformations? Cont’d An effective way to disentangle item and person effects is through IRT modeling Observed-score equating is an attempt to do the same through a transformation of total scores –Only possible way is (i) to first condition on the abilities and (ii) then transform the score to adjust for the item effects

38 Estimating Ability Assumption: fitting response model Calculate family of true equating transformations (Lord-Wingersky’s recursive procedure) Use member of family at point estimate of θ Bias study for 40-item subtests of LSAT Application in adaptive testing

39 Bias Study Bias Traditional Equating Local Equating at

40 Family of True Transformations for LSAT Subtest  =-2.0 x y  =2.0

41 Anchor Score as Proxy Current methods –Chain equating –Poststratification equating –Linear equating methods: Tucker, Levine, Braun-Holland, linear chain equating Use conditional distributions of X and Y given anchor score A=a

42 Anchor Score as Proxy Cont’d Empirical bias study for same LSAT subtests

43 Bias Study—Anchor-Test Design Chain Equating Poststratification Equating Local Equating

44 Y=y as Proxy of Ability Single-group design –Estimate distributions of X given Y=y directly from bivariate distribution of X and Y –Model-based estimate of Y given y

45 Y=y as Proxy of Ability Linear local equating Because μ Y|y =y (classical test theory),

46 Collateral Information Any variables correlating substantially with θ –Earlier tests –Battery of subtests –Response times Alternative sources give different equatings; just find the “next best thing”