F. Kaftandjieva. Terminology F. Kaftandjieva Milestones in Comparability 1904 “The proof and measurement of association between two things“ association.

Slides:

Advertisements

Similar presentations

Alabama Teacher Leaders VAL-ED Instructional Leadership Survey January 2013.

Advertisements

Standardized Scales.

Fairness, Accuracy, & Consistency in Assessment

Spiros Papageorgiou University of Michigan

MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT

Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.

Presented by Denise Sibley Laura Jean Kerr Mississippi Assessment Center Research and Curriculum Unit.

New Hampshire Enhanced Assessment Initiative: Technical Documentation for Alternate Assessments Consequential Validity Inclusive Assessment Seminar Elizabeth.

New Hampshire Enhanced Assessment Initiative: Technical Documentation for Alternate Assessments Standard Setting Inclusive Assessment Seminar Marianne.

OS 352 2/28/08 I. Exam I results next class. II. Selection A. Employment-at-will. B. Two types of discrimination. C. Defined and methods. D. Validation.

Standard Setting Different names for the same thing Standard Passing Score Cut Score Cutoff Score Mastery Level Bench Mark.

National Center on Educational Outcomes N C E O What the heck does proficiency mean for students with significant cognitive disabilities? Nancy Arnold,

Setting Alternate Achievement Standards Prepared by Sue Rigney U.S. Department of Education NCEO Teleconference March 21, 2005.

PPA 502 – Program Evaluation

Exhibit 5.1: Many Ways to Create Internal Structure

Understanding Validity for Teachers

Quantitative Research

Appraisal Types.

Learning Objectives LO1 Describe the role of professional judgment in achieving the overall objectives of the independent auditor in conducting an audit.

Codex Guidelines for the Application of HACCP

OCTOBER ED DIRECTOR PROFESSIONAL DEVELOPMENT 10/1/14 POWERFUL & PURPOSEFUL FEEDBACK.

1 Models for Aligning Assessments to Standards Consortia Conference Call August 23, 2005 Regie Stites SRI International.

Difference Two Groups 1. Content Experimental Research Methods: Prospective Randomization, Manipulation Control Research designs Validity Construct Internal.

Becoming a Teacher Ninth Edition

RESEARCH A systematic quest for undiscovered truth A way of thinking

Quality in language assessment – guidelines and standards Waldek Martyniuk ECML Graz, Austria.

Standardization and Test Development Nisrin Alqatarneh MSc. Occupational therapy.

Curriculum and Learning Omaha Public Schools

Classroom Assessments Checklists, Rating Scales, and Rubrics

WELNS 670: Wellness Research Design Chapter 5: Planning Your Research Design.

Measuring Complex Achievement

© E. Kowch iD Instructional Design Evaluation, Assessment & Design: A Discussion (EDER 673 L.91 ) From Calgary With Asst. Professor Eugene G. Kowch.

Military Language Testing at the National Defence University and the Common European Framework BILC CONFERENCE BUDAPEST.

OCTOBER ED DIRECTOR PROFESSIONAL DEVELOPMENT 10/1/14 POWERFUL & PURPOSEFUL FEEDBACK.

6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)

Measurement Validity.

Issues in Validity and Reliability Conducting Educational Research Chapter 4 Presented by: Vanessa Colón.

Selecting a Sample. Sampling Select participants for study Select participants for study Must represent a larger group Must represent a larger group Picked.

Ensuring rigour in qualitative research CPWF Training Workshop, November 2010.

Inside NAEP Developing NAEP Test Questions 1 Peggy G. Carr National Center for Education Statistics November 17, 2007.

VALUE/Multi-State Collaborative (MSC) to Advance Learning Outcomes Assessment Pilot Year Study Findings and Summary These slides summarize results from.

Presented By Dr / Said Said Elshama  Distinguish between validity and reliability.  Describe different evidences of validity.  Describe methods of.

Copyright © Allyn & Bacon 2008 Intelligent Consumer Chapter 14 This multimedia product and its contents are protected under copyright law. The following.

Assessment Information from multiple sources that describes a student’s level of achievement Used to make educational decisions about students Gives feedback.

Consistency of Assessment (Validation) Webinar – Part 1 Renae Guthridge WA Training Institute (WATI)

Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.

Criterion-Referenced Testing and Curriculum-Based Assessment EDPI 344.

Chapter Eight: Quantitative Methods

Chapter 6 - Standardized Measurement and Assessment

ASSESSMENT CRITERIA Jessie Johncock Mod. 2 SPE 536 October 7, 2012.

Michigan Assessment Consortium Common Assessment Development Series Module 16 – Validity.

Company LOGO. Company LOGO PE, PMP, PgMP, PME, MCT, PRINCE2 Practitioner.

© International Training Centre of the ILO Training Centre of the ILO 1 Research Process for Trade Unions.

Research And Evaluation Differences Between Research and Evaluation  Research and evaluation are closely related but differ in four ways: –The purpose.

Setting Performance Standards EPSY 8225 Cizek, G.J., Bunch, M.B., & Koons, H. (2004). An NCME Instructional Module on Setting Performance Standards: Contemporary.

Statistics & Evidence-Based Practice

EVALUATING EPP-CREATED ASSESSMENTS

Classroom Assessments Checklists, Rating Scales, and Rubrics

Jean-Guy Blais Université de Montréal

VALIDITY by Barli Tambunan/

Auditor Training Module 1 – Audit Concepts and Definitions

Understanding Results

Classroom Assessments Checklists, Rating Scales, and Rubrics

Week 3 Class Discussion.

Appraisal Types.

Standard Setting for NGSS

Performance Management

TESTING AND EVALUATION IN EDUCATION GA 3113 lecture 1

REVIEW I Reliability scraps Index of Reliability

Deanna L. Morgan The College Board

Presentation transcript:

F. Kaftandjieva

Terminology

F. Kaftandjieva Milestones in Comparability 1904 “The proof and measurement of association between two things“ association

F. Kaftandjieva Milestones in Comparability “Scores on two or more tests may be said to be comparable for a certain population if they show identical distributions for that population.” comparable population

F. Kaftandjieva Milestones in Comparability ‘Scales, norms, and equivalent scores’: Equating Calibration Calibration Comparability Comparability

F. Kaftandjieva Milestones in Comparability

F. Kaftandjieva Milestones in Comparability

F. Kaftandjieva Alignment Alignment refers to the degree of match between test content and the standards Dimensions of alignment Content Depth Emphasis Performance Accessibility

F. Kaftandjieva Alignment content validity Alignment is related to content validity Specification (Manual – Ch. 4) “Specification … can be seen as a qualitative method. … There are also quantitative methods for content validation but this manual does not require their use.” (p. 2) 24 pages of forms Outcome: “A chart profiling coverage graphically in terms of levels and categories of CEF.” (p. 7) Crocker, L. et al. (1989). Quantitative Methods for Assessing the Fit Between Test and Curriculum. In: Applied Measurement in Education, 2 (2),

F. Kaftandjieva Alignment (Porter, 2004)

F. Kaftandjieva Milestones in Comparability

F. Kaftandjieva ConstructInstrumentExamineesModerator Equating === no Calibration =  = no Projection  = no Statistical moderation  Other test Social moderation  Judges Mislevy & Linn: Linking Assessments Equating  Linking

in Calibration The Good & The Bad

F. Kaftandjieva Model – Data Fit

F. Kaftandjieva Model – Data Fit

F. Kaftandjieva Model – Data Fit

F. Kaftandjieva Sample-Free Estimation

F. Kaftandjieva The ruler (θ scale)

F. Kaftandjieva The ruler (θ scale)

F. Kaftandjieva The ruler (θ scale)

F. Kaftandjieva The ruler (θ scale) boiling waterabsolute zero

F. Kaftandjieva The ruler (θ scale) F° = 1.8 * C° + 32 C° = (F° – 32) / 1.8

F. Kaftandjieva ConstructInstrumentExamineesModerator Equating === no Calibration =  = no Projection  = no Statistical moderation  Other test Social moderation  Judges Mislevy & Linn: Linking Assessments

Standard Setting

F. Kaftandjieva The Ugly

F. Kaftandjieva Human judgment is the epicenter of every standard-setting method Berk, 1995 Fact 1:

F. Kaftandjieva When Ugliness turns to Beauty

F. Kaftandjieva When Ugliness turns to Beauty

F. Kaftandjieva The cut-off points on the latent continuum do not possess any objective reality outside and independently of our minds. They are mental constructs, which can differ within different persons. Fact 2:

F. Kaftandjieva Whether the levels themselves are set at the proper points is a most contentious issue and depends on the defensibility of the procedures used for determining them Messick, 1994 Consequently:

F. Kaftandjieva Defensibility

F. Kaftandjieva National Standards Understands manuals for devices used in their everyday life Defensibility: Claims vs. Evidence CEF – A2 Can understand simple instructions on equipment encountered in everyday life – such as a public telephone (p. 70)

F. Kaftandjieva Cambridge ESOL DIALANG Finnish Matriculation CIEP (TCF) CELI Universitа per Stranieri di Perugia Goethe-Institut TestDaF Institut WBT (Zertifikat Deutsch) Defensibility: Claims vs. Evidence

F. Kaftandjieva Common Practice (Buckendahl et al., 2000) External Evaluation of the alignment of 12 tests by 2 publishers Publisher reports: No description of the exact procedure followed Reports include only the match between items and standards Evaluation study At least 10 judges per test Comparison results % of agreement: 26% - 55% Overestimation of the match by test-publishers Defensibility: Claims vs. Evidence

F. Kaftandjieva Standard 1.7: When a validation rests in part of the opinion or decisions of expert judges, observers or raters, procedures for selecting such experts and for eliciting judgments or ratings should be fully described. The description of procedures should include any training and instruction provided, should indicate whether participants reached their decisions independently, and should report the level of agreement reached. If participants interacted with one another or exchanged information, the procedures through which they may have influenced one another should be set forth. Standards for educational and psychological testing,1999

F. Kaftandjieva Evaluation Criteria Hambleton, R. (2001). Setting Performance Standards on Educational Assessments and Criteria for Evaluating the Process. In: Setting Performance Standards: Concepts, Methods and Perspectives., Ed. by Cizek, G., Lawrence Erlbaum Ass., A list of 20 questions as evaluation criteria Planning & Documentation 4 (20%) Judgments11 (55%) Standard Setting Method 5 (25%) Planning

F. Kaftandjieva Judges Because standard-setting inevitably involves human judgment, a central issue is who is to make these judgments, that is, whose values are to be embodied in the standards. Messick, 1994

F. Kaftandjieva Selection of Judges The judges should have the right qualifications, but some other criteria such as occupation, working experience, age, sex may be taken into account, because ‘… although ensuring expertise is critical, sampling from relevant different constituencies may be an important consideration if the testing procedures and passing scores are to be politically acceptable’ (Maurer & Alexander, 1992).

F. Kaftandjieva Number of Judges Livingston & Zieky (1982) suggest the number of judges to be not less than 5. Based on the court cases in the USA, Biddle (1993) recommends 7 to 10 Subject Matter Experts to be used in the Judgement Session. As a general rule Hurtz & Hertz (1999) recommend 10 to 15 raters to be sampled. 10 judges is a minimum number, according to the Manual (p. 94).

F. Kaftandjieva Training Session The weakest point How much? Until it hurts (Berk, 1995) Main focus Intra-judge consistency Evaluation forms Hambleton, 2001 Feedback

F. Kaftandjieva Training Session: Feedback Form

F. Kaftandjieva Training Session: Feedback Form

F. Kaftandjieva Standard Setting Method Good Practice The most appropriate Due diligence Field tested Reality check Validity evidence More than one

F. Kaftandjieva Probably the only point of agreement among standard-setting gurus is that there is hardly any agreement between results of any two standard-setting methods, even when applied to the same test under seemingly identical conditions. Berk, 1995 Standard Setting Method

F. Kaftandjieva Test-centered methods Examinee-centered methods He that increaseth knowledge increaseth sorrow. (Ecclesiastes 1:18)

F. Kaftandjieva He that increaseth knowledge increaseth sorrow. (Ecclesiastes 1:18)

F. Kaftandjieva In sum, it may seem that providing valid grounds for valid inferences in standards- based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences. Messick, 1994 In sum, it may seem that providing valid grounds for valid inferences in standards- based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences. Messick, 1994 Instead of Conclusion

F. Kaftandjieva consequences The chief determiner of performance standards is not truth; it is consequences. Popham, 1997 Instead of Conclusion

F. Kaftandjieva Perhaps by the year 2000, the collaborative efforts of measurement researchers and practitioners will have raised the standard on standard-setting practices for this emerging testing technology. Berk, 1996 Instead of Conclusion

F. Kaftandjieva