Presentation is loading. Please wait.

Presentation is loading. Please wait.

F. Kaftandjieva. Terminology F. Kaftandjieva Milestones in Comparability 1904 “The proof and measurement of association between two things“ association.

Similar presentations


Presentation on theme: "F. Kaftandjieva. Terminology F. Kaftandjieva Milestones in Comparability 1904 “The proof and measurement of association between two things“ association."— Presentation transcript:

1 F. Kaftandjieva

2

3

4

5 Terminology

6 F. Kaftandjieva Milestones in Comparability 1904 “The proof and measurement of association between two things“ association

7 F. Kaftandjieva Milestones in Comparability 1904 1951 “Scores on two or more tests may be said to be comparable for a certain population if they show identical distributions for that population.” comparable population

8 F. Kaftandjieva Milestones in Comparability 1904 19511971 ‘Scales, norms, and equivalent scores’: Equating Calibration Calibration Comparability Comparability

9 F. Kaftandjieva Milestones in Comparability 1904 19511971 1992 1993

10 F. Kaftandjieva Milestones in Comparability 1904 19511971 1992 1993 1997 2001

11 F. Kaftandjieva Alignment Alignment refers to the degree of match between test content and the standards Dimensions of alignment Content Depth Emphasis Performance Accessibility

12 F. Kaftandjieva Alignment content validity Alignment is related to content validity Specification (Manual – Ch. 4) “Specification … can be seen as a qualitative method. … There are also quantitative methods for content validation but this manual does not require their use.” (p. 2) 24 pages of forms Outcome: “A chart profiling coverage graphically in terms of levels and categories of CEF.” (p. 7) Crocker, L. et al. (1989). Quantitative Methods for Assessing the Fit Between Test and Curriculum. In: Applied Measurement in Education, 2 (2), 179-194.

13 F. Kaftandjieva 0.235 Alignment (Porter, 2004) www.ncrel.org

14 F. Kaftandjieva Milestones in Comparability 1904 19511971 1992 1993 1997 2001

15 F. Kaftandjieva ConstructInstrumentExamineesModerator Equating === no Calibration =  = no Projection  = no Statistical moderation  Other test Social moderation  Judges Mislevy & Linn: Linking Assessments Equating  Linking

16 in Calibration The Good & The Bad

17 F. Kaftandjieva Model – Data Fit

18 F. Kaftandjieva Model – Data Fit

19 F. Kaftandjieva Model – Data Fit

20 F. Kaftandjieva Sample-Free Estimation

21 F. Kaftandjieva The ruler (θ scale)

22 F. Kaftandjieva The ruler (θ scale)

23 F. Kaftandjieva The ruler (θ scale)

24 F. Kaftandjieva The ruler (θ scale) boiling waterabsolute zero

25 F. Kaftandjieva The ruler (θ scale) F° = 1.8 * C° + 32 C° = (F° – 32) / 1.8

26 F. Kaftandjieva ConstructInstrumentExamineesModerator Equating === no Calibration =  = no Projection  = no Statistical moderation  Other test Social moderation  Judges Mislevy & Linn: Linking Assessments

27 Standard Setting

28 F. Kaftandjieva The Ugly

29 F. Kaftandjieva Human judgment is the epicenter of every standard-setting method Berk, 1995 Fact 1:

30 F. Kaftandjieva When Ugliness turns to Beauty

31 F. Kaftandjieva When Ugliness turns to Beauty

32 F. Kaftandjieva The cut-off points on the latent continuum do not possess any objective reality outside and independently of our minds. They are mental constructs, which can differ within different persons. Fact 2:

33 F. Kaftandjieva Whether the levels themselves are set at the proper points is a most contentious issue and depends on the defensibility of the procedures used for determining them Messick, 1994 Consequently:

34 F. Kaftandjieva Defensibility

35 F. Kaftandjieva National Standards Understands manuals for devices used in their everyday life Defensibility: Claims vs. Evidence CEF – A2 Can understand simple instructions on equipment encountered in everyday life – such as a public telephone (p. 70)

36 F. Kaftandjieva Cambridge ESOL DIALANG Finnish Matriculation CIEP (TCF) CELI Universitа per Stranieri di Perugia Goethe-Institut TestDaF Institut WBT (Zertifikat Deutsch) Defensibility: Claims vs. Evidence

37 F. Kaftandjieva Common Practice (Buckendahl et al., 2000) External Evaluation of the alignment of 12 tests by 2 publishers Publisher reports: No description of the exact procedure followed Reports include only the match between items and standards Evaluation study At least 10 judges per test Comparison results % of agreement: 26% - 55% Overestimation of the match by test-publishers Defensibility: Claims vs. Evidence

38 F. Kaftandjieva Standard 1.7: When a validation rests in part of the opinion or decisions of expert judges, observers or raters, procedures for selecting such experts and for eliciting judgments or ratings should be fully described. The description of procedures should include any training and instruction provided, should indicate whether participants reached their decisions independently, and should report the level of agreement reached. If participants interacted with one another or exchanged information, the procedures through which they may have influenced one another should be set forth. Standards for educational and psychological testing,1999

39 F. Kaftandjieva Evaluation Criteria Hambleton, R. (2001). Setting Performance Standards on Educational Assessments and Criteria for Evaluating the Process. In: Setting Performance Standards: Concepts, Methods and Perspectives., Ed. by Cizek, G., Lawrence Erlbaum Ass., 89-116. A list of 20 questions as evaluation criteria Planning & Documentation 4 (20%) Judgments11 (55%) Standard Setting Method 5 (25%) Planning

40 F. Kaftandjieva Judges Because standard-setting inevitably involves human judgment, a central issue is who is to make these judgments, that is, whose values are to be embodied in the standards. Messick, 1994

41 F. Kaftandjieva Selection of Judges The judges should have the right qualifications, but some other criteria such as occupation, working experience, age, sex may be taken into account, because ‘… although ensuring expertise is critical, sampling from relevant different constituencies may be an important consideration if the testing procedures and passing scores are to be politically acceptable’ (Maurer & Alexander, 1992).

42 F. Kaftandjieva Number of Judges Livingston & Zieky (1982) suggest the number of judges to be not less than 5. Based on the court cases in the USA, Biddle (1993) recommends 7 to 10 Subject Matter Experts to be used in the Judgement Session. As a general rule Hurtz & Hertz (1999) recommend 10 to 15 raters to be sampled. 10 judges is a minimum number, according to the Manual (p. 94).

43 F. Kaftandjieva Training Session The weakest point How much? Until it hurts (Berk, 1995) Main focus Intra-judge consistency Evaluation forms Hambleton, 2001 Feedback

44 F. Kaftandjieva Training Session: Feedback Form

45 F. Kaftandjieva Training Session: Feedback Form

46 F. Kaftandjieva Standard Setting Method Good Practice The most appropriate Due diligence Field tested Reality check Validity evidence More than one

47 F. Kaftandjieva Probably the only point of agreement among standard-setting gurus is that there is hardly any agreement between results of any two standard-setting methods, even when applied to the same test under seemingly identical conditions. Berk, 1995 Standard Setting Method

48 F. Kaftandjieva Test-centered methods Examinee-centered methods He that increaseth knowledge increaseth sorrow. (Ecclesiastes 1:18)

49 F. Kaftandjieva He that increaseth knowledge increaseth sorrow. (Ecclesiastes 1:18)

50 F. Kaftandjieva In sum, it may seem that providing valid grounds for valid inferences in standards- based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences. Messick, 1994 In sum, it may seem that providing valid grounds for valid inferences in standards- based educational assessment is a costly and complicated enterprise. But when the consequences of the assessment affect accountability decisions and educational policy, this needs to be weighed against the costs of uninformed or invalid inferences. Messick, 1994 Instead of Conclusion

51 F. Kaftandjieva consequences The chief determiner of performance standards is not truth; it is consequences. Popham, 1997 Instead of Conclusion

52 F. Kaftandjieva Perhaps by the year 2000, the collaborative efforts of measurement researchers and practitioners will have raised the standard on standard-setting practices for this emerging testing technology. Berk, 1996 Instead of Conclusion

53 F. Kaftandjieva

54


Download ppt "F. Kaftandjieva. Terminology F. Kaftandjieva Milestones in Comparability 1904 “The proof and measurement of association between two things“ association."

Similar presentations


Ads by Google