Download presentation
1
Test Development
2
stages Test conceptualization Test construction Test tryout
defining the test Test construction Selecting a measurement scale Developing items Test tryout Item analysis Revising the test
3
1. Test conceptualization
Defining the scope, purpose, and limits of the test.
4
Initial questions in test construction
Should the item content be similar or varied? Should the range of difficulty be narrow or broad? ceiling effect vs. floor effect How many items should be created?
5
Which domains should be tapped?
the test developer may specify content domains and cognitive skills that must be included on the test. What kind of test item should be used?
6
2. Test construction
7
Selecting a scaling method
8
levels of measurement N O I R
9
Scaling methods Most are rating scales that are summative
May be unidimensional or multi-dimensional
10
Method of paired comparisons
Aka forced choice Test taker is forced to pick one of two items paired together
11
Comparative scaling Test takers sort cards or rank items from “least” to “most”
12
Categorical scaling Test takers sort cards into one of 2 or more categories. Stimuli are thought to differ quantitatively not qualitatively
13
Likert type scales Response choices are ordered on a continuum from one extreme to the other (e.g., strongly agree to strongly disagree). Likert assumes an interval scale although this may not be realistically accurate.
14
Guttman scales Response choices for each item are various statements that lie on a continuum. Endorsing the most extreme statement reflects endorsement of milder statements as well.
15
Method of equal-appearing intervals
Presumed to be interval For knowledge scale: obtain T/F statements Experts rate each item For attitude scale Judges rate each item on a likert scale assuming equal intervals For both Total test score for the test taker is based on “weighted” items (determined by averaging the experts ratings)
16
Method of absolute scaling
Way to determine the difficulty level of items. Give items to several age groups, with one age group acting as the anchor. Item difficulty is assessed by noting the performance of each age group on each item as compared to the anchor group.
17
Method of empirical keying
Based entirely on empirical findings. Test developer comes up with several items and then gives these to a group of people who are known to possess the construct and a group who is known not to possess the construct. Items are selected based on how well they distinguish one group from the other.
18
Writing the items
19
Item format Selected response Constructed response
20
Multiple choice Pros---- Cons----
21
Matching Pros---- Cons----
22
True/False Pros---- Cons---- Forced-choice methodology.
23
Fill in Pros---- Cons----
24
Short answer objective item
Pros--- Cons---
25
Essay Pros---- Cons----
26
Scoring items Cumulative model Class/category Ipsative
Correction for guessing
27
3. Test tryout Should be on group that represents the ultimate group of test takers (who the test is intended for) Good items Reliable Valid Discriminate well
28
Before item analysis, look at the variability of scores within the test
Floor effect? Ceiling effect?
29
4. Item analysis helps determine which items should be kept, revised, deleted.
30
Item-difficulty index
proportion of examinees who get the item correct. can get a mean item difficulty.
31
Ideal item difficulty when using multiple guess items, try to account for the probability of chance. Optimal item difficulty = 1+g/2 exception to choosing item difficulty around mid-range involves tests of extreme groups.
32
Item endorsement proportion of examinees who endorsed the item.
33
Item reliability index
Indication of internal consistency Product of the item SD and the correlation between the item and total scale Items with low reliability can be eliminated
34
Item validity index Correlate item with criterion – (helps identify predictively useful test items) Multiply the item score and the criterion total score with the SD of the item. The usefulness of an item also depends on its dispersion or ability to discriminate
35
Item discrimination index
how well the item discriminates between high scorers and low scorers on the test. For each item, compare the performance of those in the upper vs lower performance ranges. Formula: d= (U-L)/N U = # of pple in the upper range who got it right L= # of pple in the lower range who got it right N= total # of pple in the upper OR lower range.
36
Interpreting the IDI can vary from –1 to +1. A (–) number =
A 0 indicates = The closer the IDI is to +1 Can also use the IDI approach to examine the pattern of incorrect responses.
37
Item characteristic curves
“Graphic representation of item difficulty and discrimination” horizontal line = ability vertical line = probability of a correct response
38
plots the probability of a correct response relative to the position on the entire test.
If the curve is an incline slope or like an S, the item is doing a good job of separating low and high scorers.
39
Item fairness Items should measure the same thing across groups
Items should have similar ICC across groups Items should have similar predictive validity across groups
40
Speed tests Easy items, similar items – everyone gets correct.
Measuring response time Traditional analyses of items do not apply
41
Qualitative item analysis
Test takers descriptions of the test Think aloud administrations Expert panels
42
5. Revising the test based on the info we obtained from the item analysis. New items and additional testing of these items may be required.
43
Cross validation Once you have your revised test, need to seek new, independent confirmation of the test’s validity. The researcher uses a new sample to determine if the test predicts the criterion as well as it did in the original sample.
44
Validity shrinkage Typically, with cross validation, you will find that the test is less accurate in predicting the criterion with this new sample.
45
Co-validation Validating two or more tests at the same time Co-norming
Saves $ Beneficial for tests that are used together
46
6. Publishing the test final step that involves development of a test manual.
47
Production of testing materials
Testing materials that are user friendly will be more accepted. The lay out of the materials should allow for smooth administration.
48
Technical manual Summarizes the technical data and references. Item analyses, scale reliabilities, validation evidence , etc can be found here.
49
User’s manual provides instruction for administration, scoring, and interpretation. The Standards for Educational and Psychological Testing recommend that manuals meet several goals (p 135). two of the most important: 1. describe the rationale and recommended uses of the test 2. provide data on reliability and validity.
50
Testing is big business
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.