Examing Rounding Rules in Angoff Type Standard Setting Methods Adam E. Wyse Mark D. Reckase.

Slides:

Advertisements

Similar presentations

Implications and Extensions of Rasch Measurement.

Advertisements

Standardized Scales.

DIF Analysis Galina Larina of March, 2012 University of Ostrava.

Copyright © 2012 Pearson Education, Inc. or its affiliate(s). All rights reserved

M AKING A PPROPRIATE P ASS- F AIL D ECISIONS D WIGHT H ARLEY, Ph.D. DIVISION OF STUDIES IN MEDICAL EDUCATION UNIVERSITY OF ALBERTA.

1Reliability Introduction to Communication Research School of Communication Studies James Madison University Dr. Michael Smilowitz.

Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.

Item Response Theory in Health Measurement

Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.

A procedure for dimensionality analyses of response data from various test designs Jinming Zhang William Stout.

Setting Performance Standards Grades 5-7 NJ ASK NJDOE Riverside Publishing May 17, 2006.

New Hampshire Enhanced Assessment Initiative: Technical Documentation for Alternate Assessments Standard Setting Inclusive Assessment Seminar Marianne.

Standard Setting Different names for the same thing Standard Passing Score Cut Score Cutoff Score Mastery Level Bench Mark.

Setting Alternate Achievement Standards Prepared by Sue Rigney U.S. Department of Education NCEO Teleconference March 21, 2005.

Standard setting Determining the pass mark - OSCEs.

+ A New Stopping Rule for Computerized Adaptive Testing.

Research Methods in MIS

Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.

A comparison of exposure control procedures in CATs using the 3PL model.

State Assessment Website Address: NYSESLAT The NYSESLAT will be administered annually for five grade clusters: K-1,

Standard Setting Methods with High Stakes Assessments Barbara S. Plake Buros Center for Testing University of Nebraska.

Adventures in Equating Land: Facing the Intra-Individual Consistency Index Monster * *Louis Roussos retains all rights to the title.

Modern Test Theory Item Response Theory (IRT). Limitations of classical test theory An examinee’s ability is defined in terms of a particular test The.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Test item analysis: When are statistics a good thing? Andrew Martin Purdue Pesticide Programs.

Jasmine Carey CDE Psychometrician Interpreting Science and Social Studies Assessment Results September 2014.

Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,

 Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing.

Assessment in Education Patricia O’Sullivan Office of Educational Development UAMS.

Employing Empirical Data in Judgmental Processes Wayne J. Camara National Conference on Student Assessment, San Diego, CA June 23, 2015.

Week 5 Lecture 4. Lecture’s objectives  Understand the principles of language assessment.  Use language assessment principles to evaluate existing tests.

Standard Setting Results for the Oklahoma Alternate Assessment Program Dr. Michael Clark Research Scientist Psychometric & Research Services Pearson State.

Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.

NCLEX ® is a Computerized Adaptive Test (CAT) How Does It Work?

Differential Item Functioning. Anatomy of the name DIFFERENTIAL –Differential Calculus? –Comparing two groups ITEM –Focus on ONE item at a time –Not the.

Using the IRT and Many-Facet Rasch Analysis for Test Improvement “ALIGNING TRAINING AND TESTING IN SUPPORT OF INTEROPERABILITY” Desislava Dimitrova, Dimitar.

Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”

McGraw-Hill/Irwin © 2012 The McGraw-Hill Companies, Inc. All rights reserved. Obtaining Valid and Reliable Classroom Evidence Chapter 4:

Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate.

Using the Many-Faceted Rasch Model to Evaluate Standard Setting Judgments: An IllustrationWith the Advanced Placement Environmental Science Exam Pamela.

Item pocket method to allow response review and change in CAT Kyung T. Han

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

Nurhayati, M.Pd Indraprasta University Jakarta.  Validity : Does it measure what it is supposed to measure?  Reliability: How the representative is.

Gary W. Phillips American Institutes for Research CCSSO 2014 National Conference on Student Assessment (NCSA) New Orleans June 25-27, 2014 Multi State.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

Chapter 6 - Standardized Measurement and Assessment

VALIDITY, RELIABILITY & PRACTICALITY Prof. Rosynella Cardozo Prof. Jonathan Magdalena.

NAEP Achievement Levels Michael Ward, Chair of COSDAM Susan Loomis, Assistant Director NAGB Christina Peterson, Project Director ACT.

Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.

Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability.

Review of Cut Scores and Conversion Tables (Angoff Method)

Jean-Guy Blais Université de Montréal

CLEAR 2011 Annual Educational Conference

Lecture 5 Validity and Reliability

ARDHIAN SUSENO CHOIRUL RISA PRADANA P.

Test Design & Construction

Introduction to the Validation Phase

The All-important Placement Cut Scores

Clinical Assessment Dr. H

Reliability & Validity

Item pool optimization for adaptive testing

Week 3 Class Discussion.

پرسشنامه کارگاه.

Criterion Referencing Judges Who are the best predictors?

Interpreting Science and Social Studies Assessment Results

Standard Setting for NGSS

National Conference on Student Assessment

By ____________________

Mohamed Dirir, Norma Sinclair, and Erin Strauts

Standard Setting Zagreb, July 2009.

Presentation transcript:

Examing Rounding Rules in Angoff Type Standard Setting Methods Adam E. Wyse Mark D. Reckase

Current Projects Multidimensional Item Response Theory – Development of methodology for fine grained analysis of item response data in high dimensional spaces. Application of methodology to gain understanding of constructs assessed by tests. Test Design and Construction – Design of content and statistical specifications for tests using the philosophy of item response theory. Use of computerized test assembly procedures to match test specifications. Portfolio Assessment – Design of portfolio assessment systems, including formal objective scoring of portfolios. Procedures for Setting Standards – Development and evaluation of procedures for setting standards on educational and psychological tests. Includes extensive work on setting standards on the National Assessment of Educational Progress. Computerized Adaptive Testing – Developing procedures for selecting and administering test items to individuals using computer technology. In particular, designing systems to match item selection to the specific requirements for test use.

Angoff Method The probability of the minimally competent examinee (MCE) would respond correctly to the item

Modified Angoff Method (1) Round to a whole number of score point (Yes/No method) Polytomous Dichotomous

Modified Angoff Method (2) Rate the MCE score of each cluster of items. -Round to 1 decimal place -round to integer

Modified Angoff Method (3) How to aggregate those rater’s judgment – Mean or median (for excluding the effect of outliner) meanmedian

Theoretical Framework Reckase 2006 Round to integer Round to 0.05 Perfectly understand the relation between Item difficulty and Cut theta

Theoretical Framework Reckase 2006 Round to 1 decimal place Round to 2 decimal places

Theoretical Framework Bias – Individual panelists cut-score – Group level cut-scores: mean or median. Other evidence for evaluating Standard Setting – Correlation: item ratings and P values provided by panelists Can’t detect the panelists’ servility Errors can be incorporated into Reckase evaluation approach.

Theoretical Framework Assumption – Only for single round (Without training effect) – Do not include error (In an ideal setting) Investigate the impact of the Angoff modifications and rounding rules in the ideal situation.

Data and Method NEAP Data – 20 raters last round – The panelist’s θ cut-score in NEAP was his intended cut-score. 2PL 3PL GPCM: E(X|θ)=1*P 1 (θ)+2*P 2 (θ)+3*P 3 (θ)+4*P 4 (θ)

Simulated conditions Round – Integer:  1 – Nearest 0.05:  1.25 – Nearest 2 decimal places:  1.23 Item pool – 180, 107, 109, 53 items

Simulated conditions Individual item vs. clusters of items Cut-scores – Basic, Proficient, and advanced Aggregating value – Mean vs. Median

Evaluation Criteria Bias: – Average absolute bias: – Bias for the group’s intended cut score – mean: – median:

Result –individual panelist >>>> Rounding: integer > 0.05 > 2 decimal places

Result –individual panelist Cut-score location: Advanced > Basic > Proficient

Result –individual panelist Individual items > cluster level (fewer rounding error) >

Result –individual panelist Item pool: 53 items have greater bias than the other pools

Result –individual panelist Item pool: 53 items < 180 items, for Proficient, integer. The importance of the location of Cut-score and the items distribution

Result –Group panelist Some cases the Mean is better, other cases the Median is better

Result –Group panelist Basic were “-” bias, Proficient and Advanced were “+” bias. At cluster item level, the proficient was “-” bias.

Result –Group panelist The advanced produced the greatest bias than other two level. The bias did not cancel out for a group of panelists.

Result –Group panelist Both the mean and median bias < 0.01 for round to 0.05 and 2 decimal places. Again, more test items did not necessarily.

Result –Group panelist Cluster level is better than individual items.

Impact on Percent Above Cut-score (PAC) Finding the PAC for the closest value on the NAEP in the pilot study. PAC for estimating θ - PAC for intended θ. Nearest 0.05 or nearest 0.01 did not change. No effect. Minimal impact

Impact on Percent Above Cut-score (PAC) Basic: 5.610~ Proficient: ~ Advanced: ~-1.262

Impact on Percent Above Cut-score (PAC) Basic: 4.490~ Proficient: ~ Advanced: ~-1.343

Impact on Percent Above Cut-score (PAC) Bias: Advanced > Basic and Proficient PAC: Advanced < Basic and Proficient There are more student near the basic and proficient cut score

Impact on Percent Above Cut-score (PAC) Rounding to the integer dose not present a viable alternative in Angoff method.

Discussion Rounding to integer could affect the cut scores. – Using cluster item level can mitigate bias, but biases still remained. Using more test items will not necessarily produce less bias. – The important is the location of the items in relationship to the intended cut-score.

Discussion 10 items [-2 ~ +2] Cut score θ = 0 – 5 items rounded to score 1 – 5 items rounded to score 0 Cut total score = 5  θ = 0 Bias = 0

Discussion 20 items [-1 ~ +3] Cut score θ = 0 – 5 items rounded to score 1 – 15 items rounded to score 0 Cut total score = 5  θ = Bias =

Discussion Using OIB from bookmark to roughly design half of the items were above cut-score. – Impossible to know the location of cut-score. – The intended cut-scores in different panelists are different. Some panelists must have bias In multiple cut-scores, at lease one of cut- scores would produce bias. Rounding to integer present many potential problems.

Discussion Challenge: in real situations panelists are not completely consistent in their judgments. – Feedback is helpful for reducing rater inconsistency in NAEP Further development – Examine the bias at the group level

Thank you for attention