Standard Setting for Professional Certification Brian D. Bontempo Mountain Measurement, Inc. (503) 284-1288 ext 129.

Slides:



Advertisements
Similar presentations
Copyright 2005, Prentice Hall, Sarafino
Advertisements

Developing Satisfaction Surveys: Integrating Qualitative and Quantitative Information David Cantor, Sarah Dipko, Stephanie Fry, Pamela Giambo and Vasudha.
Knowledge Dietary Managers Association 1 PART II - DMA Certification Exam Blueprint and Exam Development-
Principles of Standard Setting
Standard Setting.
Test Development.
Standardized Scales.
Copyright © 2012 Pearson Education, Inc. or its affiliate(s). All rights reserved
Spiros Papageorgiou University of Michigan
Psychometric Aspects of Linking Tests to the CEF Norman Verhelst National Institute for Educational Measurement (Cito) Arnhem – The Netherlands.
M AKING A PPROPRIATE P ASS- F AIL D ECISIONS D WIGHT H ARLEY, Ph.D. DIVISION OF STUDIES IN MEDICAL EDUCATION UNIVERSITY OF ALBERTA.
Praxis Assessment Overview Educational Testing Service Kentucky Education Professional Standards Board Frankfort, Kentucky January 21, 2007 Copyright 2004.
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
Methods of Standard Setting
Medical school attendedPassing grade Dr JohnNorthsouth COM (NSCOM)80% Dr SmithEastwest COM (EWCOM)50% Which of these doctors would you like to treat you?
Statistical Issues in Research Planning and Evaluation
1 New England Common Assessment Program (NECAP) Setting Performance Standards.
Setting Performance Standards Grades 5-7 NJ ASK NJDOE Riverside Publishing May 17, 2006.
Chapter 19 Confidence Intervals for Proportions.
Test Construction Processes 1- Determining the function and the form 2- Planning( Content: table of specification) 3- Preparing( Knowledge and experience)
Item Analysis What makes a question good??? Answer options?
New Hampshire Enhanced Assessment Initiative: Technical Documentation for Alternate Assessments Standard Setting Inclusive Assessment Seminar Marianne.
SETTING & MAINTAINING EXAM STANDARDS Raja C. Bandaranayake.
Standard Setting Different names for the same thing Standard Passing Score Cut Score Cutoff Score Mastery Level Bench Mark.
Setting Alternate Achievement Standards Prepared by Sue Rigney U.S. Department of Education NCEO Teleconference March 21, 2005.
SETTING & MAINTAINING EXAM STANDARDS
Examing Rounding Rules in Angoff Type Standard Setting Methods Adam E. Wyse Mark D. Reckase.
Today Concepts underlying inferential statistics
Copyright © 2008 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. John W. Creswell Educational Research: Planning,
Standard Setting Methods with High Stakes Assessments Barbara S. Plake Buros Center for Testing University of Nebraska.
Building Effective Assessments. Agenda  Brief overview of Assess2Know content development  Assessment building pre-planning  Cognitive factors  Building.
Part #3 © 2014 Rollant Concepts, Inc.2 Assembling a Test #
Statistics Primer ORC Staff: Xin Xin (Cindy) Ryan Glaman Brett Kellerstedt 1.
Standardization and Test Development Nisrin Alqatarneh MSc. Occupational therapy.
Test item analysis: When are statistics a good thing? Andrew Martin Purdue Pesticide Programs.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 New England Common Assessment Program (NECAP) Setting Performance Standards.
Chapter 7 Item Analysis In constructing a new test (or shortening or lengthening an existing one), the final set of items is usually identified through.
 Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing.
Analyzing and Interpreting Quantitative Data
 Frequency Distribution is a statistical technique to explore the underlying patterns of raw data.  Preparing frequency distribution tables, we can.
Techniques to improve test items and instruction
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
Assessment in Education Patricia O’Sullivan Office of Educational Development UAMS.
1 Producing Your Assessment Question Mark “Software for creating and delivering assessments with powerful reports”  Copyright 2000 QuestionMark. All.
Examining Relationships in Quantitative Research
Cut Points ITE Section One n What are Cut Points?
1 MARKETING RESEARCH Week 5 Session A IBMS Term 2,
Grading and Analysis Report For Clinical Portfolio 1.
8 Strategies for the Multiple Choice Portion of the AP Literature and Composition Exam.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Overview.
Assessment and Testing
The Theory of Sampling and Measurement. Sampling First step in implementing any research design is to create a sample. First step in implementing any.
JS Mrunalini Lecturer RAKMHSU Data Collection Considerations: Validity, Reliability, Generalizability, and Ethics.
Using the Many-Faceted Rasch Model to Evaluate Standard Setting Judgments: An IllustrationWith the Advanced Placement Environmental Science Exam Pamela.
Chapter 6: Analyzing and Interpreting Quantitative Data
Gary W. Phillips American Institutes for Research CCSSO 2014 National Conference on Student Assessment (NCSA) New Orleans June 25-27, 2014 Multi State.
RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.
NAEP Achievement Levels Michael Ward, Chair of COSDAM Susan Loomis, Assistant Director NAGB Christina Peterson, Project Director ACT.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Balancing on Three Legs: The Tension Between Aligning to Standards, Predicting High-Stakes Outcomes, and Being Sensitive to Growth Julie Alonzo, Joe Nese,
Setting Performance Standards EPSY 8225 Cizek, G.J., Bunch, M.B., & Koons, H. (2004). An NCME Instructional Module on Setting Performance Standards: Contemporary.
Making Sense of Statistics: A Conceptual Overview Sixth Edition PowerPoints by Pamela Pitman Brown, PhD, CPG Fred Pyrczak Pyrczak Publishing.
Jean-Guy Blais Université de Montréal
CLEAR 2011 Annual Educational Conference
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Validity and Reliability
Analyzing and Interpreting Quantitative Data
Standard Setting for NGSS
Lies, Damned Lies & Statistical Analysis for Language Testing
Presentation transcript:

Standard Setting for Professional Certification Brian D. Bontempo Mountain Measurement, Inc. (503) ext 129

Overview Definition of Standard Setting Management Issues relating to Standard Setting Standard Setting Process Methods of Standard Setting Using multiple methods of Standard Setting

Definition of Standard Setting Standard setting is a process whereby decision makers render judgments about the performance level required of minimally competent examinees

Types of Standards Relative Standard (Normative Standards) –Top 70% of scores pass –20 points above average Criterion-Referenced Standard (Absolute Standards) –70% of the items correct –600 out of 800 scaled score –.05 logits –20 items correct

Why do we conduct Standard Setting? To objectively involve stakeholders in the test decision making process To connect the expectations of employers to the test decision making process To connect the reality of training to the test decision making process To ensure psychometric soundness & legal defensibility

When to (re)set a passing standard For a new exam, after Beta Test data have been analyzed, typically after “Live” Test Forms have been constructed For exam revisions, when the expectations of a job role have changed –Practice has changed –Content domain has changed –It is not appropriate to change the passing standard whenever a test or training has been revised. –It is not appropriate to change the passing standard because of supply and demand issues (too many/few certified professionals)

Who should lead a standard setting panel? An experienced Psychometrician –Insider perspective, familiar with your certification and exam development –Outsider perspective, not familiar with your certification and exam development

How rigid should you be in your direction to the Psychometrician? I recommend a conversation between the Psychometrician and the Test Sponsor to figure out what works best. Typically a test sponsor will specify a framework (e.g., Angoff) and let the Psychometrician dictate the specifics.

Outcomes of Standard Setting A conceptual (qualitative) definition of minimal competency A proposed numeric (quantitative) passing standard A set of alternate passing standards based on errors in the process Expected passing rate(s) from each standard A report documenting the process and the psychometric quality of the process

Standard Setting Process

Gather test data Assemble a group of judges –Define minimal competency –Train judges on the method –Render judgments on the performance of borderline examinees Calculate the passing standard by aggregating the judgments Evaluate the outcome by calculating the expected passing rate

Selecting your judges Representative Sample –Hiring Managers –Trainers –Entry-Level Practitioners How many judges is enough? –For a low stakes exam at least 8 judges –For a medium stakes exam at least 12 judges –For a high stakes exam at least 16 judges

Developing a Definition of Minimal Competency Identify 3 common tasks within each domain of the test blueprint (an easy, a hard, and a “Borderline” task) Characterize the performance of minimally competent examinees on each of the major tasks Write text that summarizes these discussions

Training Judges Instruct them on their task Practice rating items –Two sets of practice items Practice discussing items Explain the stats that you will be providing them Set the tone and boundaries for good ‘group psychology’

Standard Setting Methods

Types of Standard Setting Methods Examinee-Centered Methods –Judges use external criteria, such as on the job performance, to evaluate the competency of real examinees Test-Centered Methods –Judges evaluate the performance of imaginary examinees on real test items Adjustments –in order to account for inaccuracy in the standard setting process, Psychometricians use real test data to provide a range of probable values for the passing standard

Examinee-Centered Methods Borderline group –Using external criteria (such as performance on the job), judges identify a group of examinees that they think are borderline examinees. The average score of this group is the passing standard Contrasting groups –Using external criteria, judges classify examinees as passers or failers. The passing standard is established by determining the point which discriminates the best between the scores of both groups

Test-Centered Modified-Angoff –Angoff, W.H. (1971) Scales, Norms, and equivalent scores. In R.L. Thorndike (Editor) Educational Measurement 2 nd edition: Washington, DC American Council on Education. Bookmark –Mitzel, H.C., Lewis, D.M., Patz, R.J., & Green, D.R. (2001). The Bookmark Procedure: Psychological perspectives. In G.J. Cizek (Editor), Setting Performance Standards: Mahwah, NJ Lawrence Erlbaum Associates.

Basic Angoff Process Judges evaluate each item –What percentage of MC examinees would get the item correct? Feedback/Discussion Judges make adjustments to their ratings Average of all items is the judges passing standard Average of all judges’ standards is the passing standard

Common Angoff Issues What percentage of –MCs vs. all –MCs is correct candidates –“would” vs. “should” –“would” is correct get the item correct?

Common Angoff Issues What type of ratings should judges make? –1/0 (Yes/No) –Percentage of Borderline examinees Round to 1 decimal (.9) Round to 2 decimals (.92) –NEVER use percentage of all examinees

Common Angoff Issues Types of Feedback to provide –Group Discussion Relate to conceptual definition of minimal competency –Typical or atypical content –Relevancy Relate to item nuances –Item Stem –Item Distractors “I expect a lot of the MC because this is core content and the item is straightforward.” “I would like to cut the MC some slack because this is not covered well in training and the scenario is a little abstract.”

Common Angoff Issues Types of Feedback to provide –Empirical Data Answer Key – Yes! Percentage of Borderline examinees answering the item correctly – If possible yes P-Value (Percentage of examinees answering the item correctly) – Only if the percentage of Borderline examinees is not available

Common Angoff Issues When to provide feedback? –Initial Rating –Discuss items –Secondary Rating –Provide Empirical Data –Tertiary Rating

Bookmark Test is divided up into sub tests –By domain OR –Equal variance of difficulty across sub tests Items are sorted from easiest to hardest –By judges OR –By actual value Judges bookmark the subtest at the point where the MC examinee would stop getting items correct and start getting them incorrect The lowest possible standard The expected standard The high possible standard Judges discuss ratings & make adjustments Passing standard is average # of items answered correct

Common Bookmark Issues How many Ordered Item Booklets (OIB) –One for each content domain –An equivalent number that meet the test plan

Common Bookmark Issues How should I select Items for the OIB? –Minimize the distance in difficulty between any two adjacent items. Ensure that there are enough items at all difficulty levels for each OIB Ensure that the variance in item difficulty is the same for each OIB

Common Bookmark Issues How should I sort the item booklets? –Easiest to Hardest –Hardest to Easiest

Common Bookmark Issues How do I know when the MC would stop getting items correct and start getting them incorrect? (What is the appropriate RP value?) –.5 –.67* Most Common –.75

Common Bookmark Issues How do I convert the bookmark to a passing standard? –Previous Item (PI) – Take the difficulty of the easier of the two items on either side of the bookmark –Between Item (BI) – Take the average of difficulty of the two items

Compare Angoff and Bookmark Angoff requires less preparation –Select a real test form as opposed to building the OIBs Judges understand Bookmark better –Rating the difficulty of an item is a difficult task Bookmark requires more test items –I’d recommend an item pool of at least 40 solid test items per content domain

Other Test Centered Methods Ebel Nedelsky Jaeger Rasch Item Mapping

Ebel Judges sort each item into piles –How difficult is this item for the MC examinee? Easy, moderate, or hard –How relevant is this content for practice? Critical, Moderately important, Not relevant Judges then estimate the percentage of items in each that MC examinees would get correct The passing standard is then determined by multiplying the number of items in each cell by the percentage and sum all values

Nedelsky Judges determine which response options are unrealistic for each item The probability of a guessed correct response is calculated The sum of the probabilities is the passing standard

Jaeger Judges evaluate each item –Yes/No - “Should every entry-level practitioner answer this item correctly?” Judges discuss ratings & make adjustments Judges are provided passing rate based on standard & make adjustments Passing standard is calculated by summing the number of “Yes” responses

Test-Centered Options What the ratings are based on –Should or would MC get this right How ratings are made –Yes/No, Percentage Relevance adjustments Guessing adjustments What kind of feedback is provided –Passing rate –Other judges ratings –Actual item difficulty

Using Multiple Methods of Standard Setting

Why use Multiple Methods? There is error in every standard setting Allows policymakers to “decide” on the standard rather than science simply documenting the outcomes of a panel Allows for the recovery of standard setting sessions that go awry Involves more stakeholders

Adjustments Simple Stats – Calculate the confidence interval around the estimate Beuk – Judges provide an expected passing score and an expected passing rate. Calculations are made that are based on the variability in these two estimates De Gruijter – Similar to Beuk, judges also provide an estimate of the uncertainty of their judgments. Hofstee – Judges indicate the highest and lowest passing score and passing rate. These values are plotted along with the cumulative frequency distribution and the point of intersection is the passing standard

Survey of Hiring Managers Ask hiring managers about the workforce –What percentage of certified persons do you believe to be minimally competent? –Are your certified persons more competent that your uncertified persons? Expands the reach of your exam

Triangulating results Psychometrician should present the outcome of each method and the passing rate associated with each outcome –A range of possible values Policymakers can use this information and “their professional experience” to set the actual passing standard

Wrap-Up

3 Vital Recommendations Have more judges at standard setting Spend more time training your judges With each standard setting ensure that you take the time to define minimal competency conceptually and don’t forget to document this definition.

Concluding Remarks Many people like to think of test makers as big bad people which is obviously not true. Standard setting is one example of how inclusive the scientific process of test development can be. I encourage folks to make this process light and fun.

Thank you for paying attention! Questions & Comments: