The Accuracy of Small Sample Equating: An Investigative/ Comparative study of small sample Equating Methods. Kinge Mbella Liz Burton Rob Keller Nambury.

Slides:

Advertisements

Similar presentations

Chapter 2 The Process of Experimentation

Advertisements

Test Development.

Standardized Scales.

Research Methodology For reader assistance, have an introductory paragraph in which attention is given to the organization of the section in relation to.

Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.

IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.

Copyright © 2011 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 12 Measures of Association.

McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.

By Dr. Mohammad H. Omar Department of Mathematical Sciences May 16, 2006 Presented at Statistic Research (STAR) colloquium, King Fahd University of Petroleum.

Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.

Sampling and Experimental Control Goals of clinical research is to make generalizations beyond the individual studied to others with similar conditions.

Today Concepts underlying inferential statistics

Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.

Chapter 7 Correlational Research Gay, Mills, and Airasian

Chapter 7 Probability and Samples: The Distribution of Sample Means

Sampling Theory Determining the distribution of Sample statistics.

Richard M. Jacobs, OSA, Ph.D.

Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.

Writing the Research Paper

Assessment Statements  The Internal Assessment (IA) Rubric IS the assessment statement.

EPSY 8223: Test Score Equating

Testing Hypotheses.

Measurement in Exercise and Sport Psychology Research EPHE 348.

QNT 531 Advanced Problems in Statistics and Research Methods

Student Engagement Survey Results and Analysis June 2011.

Understanding Statistics

Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.

Chapter 7 Item Analysis In constructing a new test (or shortening or lengthening an existing one), the final set of items is usually identified through.

Review and Validation of ISAT Performance Levels for 2006 and Beyond MetriTech, Inc. Champaign, IL MetriTech, Inc. Champaign, IL.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.

Using Resampling Techniques to Measure the Effectiveness of Providers in Workers’ Compensation Insurance David Speights Senior Research Statistician HNC.

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.

Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.

By: Amani Albraikan.  Pearson r  Spearman rho  Linearity  Range restrictions  Outliers  Beware of spurious correlations….take care in interpretation.

6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)

+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.

35th Annual National Conference on Large-Scale Assessment June 18, 2005 How to compare NAEP and State Assessment Results NAEP State Analysis Project Don.

Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office.

© 2011 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.

© 2011 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.

Stats 845 Applied Statistics. This Course will cover: 1.Regression –Non Linear Regression –Multiple Regression 2.Analysis of Variance and Experimental.

University of Ostrava Czech republic 26-31, March, 2012.

Chapter 8: Simple Linear Regression Yang Zhenlin.

IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.

Nurhayati, M.Pd Indraprasta University Jakarta.  Validity : Does it measure what it is supposed to measure?  Reliability: How the representative is.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

Random Variables Numerical Quantities whose values are determine by the outcome of a random experiment.

RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.

Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.

The Design of Statistical Specifications for a Test Mark D. Reckase Michigan State University.

Introduction to statistics Definitions Why is statistics important?

Topics Semester I Descriptive statistics Time series Semester II Sampling Statistical Inference: Estimation, Hypothesis testing Relationships, casual models.

5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

IRT Equating Kolen & Brennan, 2004 & 2014 EPSY

STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.

Nonequivalent Groups: Linear Methods Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2 nd ed.). New.

Estimating standard error using bootstrap

Experimental Research

Intro to Research Methods

Evaluation of measuring tools: validity

Understanding Results

Week 3 Class Discussion.

Basic Statistical Terms

Shudong Wang, NWEA Liru Zhang, Delaware DOE G. Gage Kingsbury, NWEA

15.1 The Role of Statistics in the Research Process

Investigations into Comparability for the PARCC Assessments

Presentation transcript:

The Accuracy of Small Sample Equating: An Investigative/ Comparative study of small sample Equating Methods. Kinge Mbella Liz Burton Rob Keller Nambury Raju Psychometric Internship Measured Progress July 24, 2009

Presentation Outline Introduction Small Sample Equating Methodology Background to Study Research Hypothesis Small Sample Equating Identity Equating Chained linear Synthetic Linking Function Chained Log linear Pre Smoothing Circle-arc Methodology Research Design Procedure Results Discussion and Conclusion.

Introduction The primary motivation is from the 2007 paper by Livingston and Kim “Small Sample Equating by the Circle-arc method.” Empirical research findings confirm that this method produces smaller random and systematic errors when equating with samples smaller than 50 per form (Darby & Mbella, NCME 2009). Technological innovation is increasing the flexibility of test administration, and reporting. Most test have multiple forms taken by smaller samples of students at different test dates. The need to provide accurate equated scores in a timely manner is imminent. Practical circumstances in most certification programs dictate the use of small samples.

Research Objectives This research used empirical data to compare random and systematic errors associated with small sample equating methods. The ultimate goal is to provide practitioners with objective and valid results to effectively examine the small sample equating dilemma. It is my intention that these result will provide scientific and logical facts that “Yes we may be able to equate accurately with smaller samples”. But before I go into all that lets take a brief look at what has been happening until now.

Background Into Equating Mislevy (1992) “Test construction and equating are inseparable, when they are applied in concert, equated scores from parallel test forms provide virtually exchangeable evidence about students’ behavior on some domain…” Kolen and Brennan (2004, p. 269) Angoff amd kolen anchor items be about 20% of test length 45 equating is not a post hoc analysis

Research and Equating Jargons Form X : The test form administered to the 2007/08 examinees (New Form). Form Y: Test form administered to 2006/07 examinees (Old Form). Population: Scored responses for all students on a test form for a particular year. Small samples selected for this research are: 22, 35, 44, 70. Example SE_22_22 Experimental test form Y and X: Test forms assembled from an operational test form and response matrix. CING: The common item non equivalent group design Criterion estimate: The Equipercentile equating results of the Form X observed scores equated onto the Form Y observed scale scores for that particular grade level and subject area.

Equating Methods Linear Methods Synthetic linking Function Non Linear Identity Equating Chained Linear Chained Log linear Synthetic linking Function Non Linear Circle-arc Equipercentile

Chained Linear Function µ : Sample mean σ: Sample standard deviation yv= Anchor Old Form (Y) xv = Anchor New Form (X)

Identity Function Identity Equating function is a technical term for saying No equating is done. The equated score equal the observed score.

Synthetic Linking Function W = 0.5 The synthetic linking function is a weighted average between an equating function (in this case Chained Linear) with the Identity function.

Chained log Linear Using an adaptation of the log-linear function developed by Rosenbaum and Thayer (1987) the first two univariate moments of the observed score distribution are pre-smoothed before equating, In chained equipercentile, the linking is done through the common items. The percentile rank of a score on the common item for form X is linked to the equivalent percentile in form Y common scale. Then the corresponding form Y score at that percentile is the chained equipercentile equivalent for that particular form X observed score.

Circle-arc Livingston and Kim in 2007 proposed an innovative method with potential to considerably reduce sampling error of equating in small samples while introducing very little systematic error. Their rationale is based on the fact that the relationship between test forms is always curvilinear when forms differ in difficulty. Empirical research has shown that the circle-arc method is the most accurate method in modeling the equipercentile relationship in small samples.

Circle-arc Circle-arc is a very simplistic model. It relies entirely on the characteristics of the observed scores. The main properties are: The minimum and maximum possible observed scores are fixed for both test forms. A middle point is empirically determined by carrying out any of the linear equating transformations based on the data collection method. A combination of mathematical formulae which forces an arc of a circle to pass through these three points is used to produce the Circle-arc equating function. .

The Circle-Arc Method

Empirical Equating Curves

Research Questions How similar are the various small sample equating methods in terms of equating errors? How do differences in test form difficulty affect the accuracy and consistency of the various equating methods? What is the minimum sample size at which the standard error of equating becomes unacceptable?

Research Methodology Using real examinees’ responses on a Math and Reading Standardized test, two experimental test forms were created for each subject area and grade level. The Common Item Non Equivalent Group (CING) design with an internal anchor was used as the basis for collecting data for equating purposes.

Data Specification

Descriptive Statistics for Reading Grade 7

Procedure Large Sample Equating An equipercentile equating was done on the full population of Form Y and X for each subject and grade level. The unsmoothed equipercentile conversion was used as the base equating for comparison. Small Sample Equating Using a bootstrap sampling method without replacement, small samples were drawn from each population and concurrently equated using all 5 equating methods. The sampling and equating was repeated 250 times and the average equated score at each score point by method was used as the estimated equated score of form X on form Y observed scale.

Procedure_ Result Analysis Standard Error (SE) (Error due to sampling variability) Conditional bias (Error due to method effect) Conditional RMSE

Research Design Matrix

Bootstrap Mean Distribution The immediate explanation for this is the descriptive of bootstrap. It is not about superior

Standard Error Results (SE)

Standard Error Results (SE)

Standard Error Results (SE)

Standard Error Results (SE)

Preliminary Results How similar are the various equating methods in terms of equating error? The following conclusions have been reached based on these preliminary analyses: On the average, the Circle-arc method appears to have the smallest random error across the entire scale. The Synthetic linking function has the smallest random error variance for scores between -1 and 1 standard deviation around the mean. For all methods, the general trend is that the overall random error variance tend to decrease as sample size increases.

Bias Summary for selected conditions

RMSE Summary (Reading Grade 7)

RMSE Summary (Math Grade 7)

Exploratory MANOVA

Graphical Manova Summary Reading Grade 7 Preliminary results suggest that the within error variance due to sample variability is not significant. There appears to be a significant mean difference between the various equating methods in terms of the RMSE index. The mean RMSE for Circle-arc appears to be significantly different from the other methods Math Grade 7 The Exploratory Manova results from Math grade 7 leads to a slightly different conclusion. Both the within and between error variances appear not to be significantly different for all methods and sample conditions.

Results Summary _ Reading Grade 7

Conclusion From this first phase of analyses, the Circle-arc method appears to produce on the average the smallest amount of systematic and random error. However, the interpretation of which method produces the least amount of error depends on where the cut scores are set on the scale. An important recommendation from this study is that if the cut score is set around the mean, then any of these methods will produce similar equating errors proportional to the difference in form difficulty.

Future Directions I would like to look at the effects of differences in test form difficulty on the various methods. I also intend to explore even smaller samples to estimate the minimum sample sizes for each method where equating becomes unrealistic. My ultimate goal is to explore new ways to build test forms to meet predefined statistical and content characteristics in small sample situations.

Questions and Comments I would like to thank everyone in the Psychometrics Department and Measured Progress for making the whole experience very enjoyable and the actual research as painless as possible. Thank you Kinge Mbella Doctoral Student UNC Greensboro