F. Kaftandjieva
Terminology
F. Kaftandjieva Milestones in Comparability 1904 “The proof and measurement of association between two things“ association
F. Kaftandjieva Milestones in Comparability “Scores on two or more tests may be said to be comparable for a certain population if they show identical distributions for that population.” comparable population
F. Kaftandjieva Milestones in Comparability ‘Scales, norms, and equivalent scores’: Equating Calibration Calibration Comparability Comparability
F. Kaftandjieva Milestones in Comparability
F. Kaftandjieva Milestones in Comparability
F. Kaftandjieva Alignment Alignment refers to the degree of match between test content and the standards Dimensions of alignment Content Depth Emphasis Performance Accessibility
F. Kaftandjieva Alignment content validity Alignment is related to content validity Specification (Manual – Ch. 4) “Specification … can be seen as a qualitative method. … There are also quantitative methods for content validation but this manual does not require their use.” (p. 2) 24 pages of forms Outcome: “A chart profiling coverage graphically in terms of levels and categories of CEF.” (p. 7) Crocker, L. et al. (1989). Quantitative Methods for Assessing the Fit Between Test and Curriculum. In: Applied Measurement in Education, 2 (2),
F. Kaftandjieva Alignment (Porter, 2004)
F. Kaftandjieva Milestones in Comparability
F. Kaftandjieva ConstructInstrumentExamineesModerator Equating === no Calibration = = no Projection = no Statistical moderation Other test Social moderation Judges Mislevy & Linn: Linking Assessments Equating Linking
in Calibration The Good & The Bad
F. Kaftandjieva Model – Data Fit
F. Kaftandjieva Model – Data Fit
F. Kaftandjieva Model – Data Fit
F. Kaftandjieva Sample-Free Estimation
F. Kaftandjieva The ruler (θ scale)
F. Kaftandjieva The ruler (θ scale)
F. Kaftandjieva The ruler (θ scale)
F. Kaftandjieva The ruler (θ scale) boiling waterabsolute zero
F. Kaftandjieva The ruler (θ scale) F° = 1.8 * C° + 32 C° = (F° – 32) / 1.8
F. Kaftandjieva ConstructInstrumentExamineesModerator Equating === no Calibration = = no Projection = no Statistical moderation Other test Social moderation Judges Mislevy & Linn: Linking Assessments
Standard Setting
F. Kaftandjieva The Ugly
F. Kaftandjieva Human judgment is the epicenter of every standard-setting method Berk, 1995 Fact 1:
F. Kaftandjieva When Ugliness turns to Beauty
F. Kaftandjieva When Ugliness turns to Beauty
F. Kaftandjieva The cut-off points on the latent continuum do not possess any objective reality outside and independently of our minds. They are mental constructs, which can differ within different persons. Fact 2:
F. Kaftandjieva Whether the levels themselves are set at the proper points is a most contentious issue and depends on the defensibility of the procedures used for determining them Messick, 1994 Consequently:
F. Kaftandjieva Defensibility
F. Kaftandjieva National Standards Understands manuals for devices used in their everyday life Defensibility: Claims vs. Evidence CEF – A2 Can understand simple instructions on equipment encountered in everyday life – such as a public telephone (p. 70)
F. Kaftandjieva Cambridge ESOL DIALANG Finnish Matriculation CIEP (TCF) CELI Universitа per Stranieri di Perugia Goethe-Institut TestDaF Institut WBT (Zertifikat Deutsch) Defensibility: Claims vs. Evidence
F. Kaftandjieva Common Practice (Buckendahl et al., 2000) External Evaluation of the alignment of 12 tests by 2 publishers Publisher reports: No description of the exact procedure followed Reports include only the match between items and standards Evaluation study At least 10 judges per test Comparison results % of agreement: 26% - 55% Overestimation of the match by test-publishers Defensibility: Claims vs. Evidence
F. Kaftandjieva Standard 1.7: When a validation rests in part of the opinion or decisions of expert judges, observers or raters, procedures for selecting such experts and for eliciting judgments or ratings should be fully described. The description of procedures should include any training and instruction provided, should indicate whether participants reached their decisions independently, and should report the level of agreement reached. If participants interacted with one another or exchanged information, the procedures through which they may have influenced one another should be set forth. Standards for educational and psychological testing,1999
F. Kaftandjieva Evaluation Criteria Hambleton, R. (2001). Setting Performance Standards on Educational Assessments and Criteria for Evaluating the Process. In: Setting Performance Standards: Concepts, Methods and Perspectives., Ed. by Cizek, G., Lawrence Erlbaum Ass., A list of 20 questions as evaluation criteria Planning & Documentation 4 (20%) Judgments11 (55%) Standard Setting Method 5 (25%) Planning
F. Kaftandjieva Judges Because standard-setting inevitably involves human judgment, a central issue is who is to make these judgments, that is, whose values are to be embodied in the standards. Messick, 1994
F. Kaftandjieva Selection of Judges The judges should have the right qualifications, but some other criteria such as occupation, working experience, age, sex may be taken into account, because ‘… although ensuring expertise is critical, sampling from relevant different constituencies may be an important consideration if the testing procedures and passing scores are to be politically acceptable’ (Maurer & Alexander, 1992).
F. Kaftandjieva Number of Judges Livingston & Zieky (1982) suggest the number of judges to be not less than 5. Based on the court cases in the USA, Biddle (1993) recommends 7 to 10 Subject Matter Experts to be used in the Judgement Session. As a general rule Hurtz & Hertz (1999) recommend 10 to 15 raters to be sampled. 10 judges is a minimum number, according to the Manual (p. 94).
F. Kaftandjieva Training Session The weakest point How much? Until it hurts (Berk, 1995) Main focus Intra-judge consistency Evaluation forms Hambleton, 2001 Feedback
F. Kaftandjieva Training Session: Feedback Form
F. Kaftandjieva Training Session: Feedback Form
F. Kaftandjieva Standard Setting Method Good Practice The most appropriate Due diligence Field tested Reality check Validity evidence More than one
F. Kaftandjieva Probably the only point of agreement among standard-setting gurus is that there is hardly any agreement between results of any two standard-setting methods, even when applied to the same test under seemingly identical conditions. Berk, 1995 Standard Setting Method
F. Kaftandjieva Test-centered methods Examinee-centered methods He that increaseth knowledge increaseth sorrow. (Ecclesiastes 1:18)
F. Kaftandjieva He that increaseth knowledge increaseth sorrow. (Ecclesiastes 1:18)
F. Kaftandjieva In sum, it may seem that providing valid grounds for valid inferences in standards- based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences. Messick, 1994 In sum, it may seem that providing valid grounds for valid inferences in standards- based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences. Messick, 1994 Instead of Conclusion
F. Kaftandjieva consequences The chief determiner of performance standards is not truth; it is consequences. Popham, 1997 Instead of Conclusion
F. Kaftandjieva Perhaps by the year 2000, the collaborative efforts of measurement researchers and practitioners will have raised the standard on standard-setting practices for this emerging testing technology. Berk, 1996 Instead of Conclusion
F. Kaftandjieva