Test Development
stages Test conceptualization Test construction Test tryout defining the test Test construction Selecting a measurement scale Developing items Test tryout Item analysis Revising the test
1. Test conceptualization Defining the scope, purpose, and limits of the test.
Initial questions in test construction Should the item content be similar or varied? Should the range of difficulty be narrow or broad? ceiling effect vs. floor effect How many items should be created?
Which domains should be tapped? the test developer may specify content domains and cognitive skills that must be included on the test. What kind of test item should be used?
2. Test construction
Selecting a scaling method
levels of measurement N O I R
Scaling methods Most are rating scales that are summative May be unidimensional or multi-dimensional
Method of paired comparisons Aka forced choice Test taker is forced to pick one of two items paired together
Comparative scaling Test takers sort cards or rank items from “least” to “most”
Categorical scaling Test takers sort cards into one of 2 or more categories. Stimuli are thought to differ quantitatively not qualitatively
Likert type scales Response choices are ordered on a continuum from one extreme to the other (e.g., strongly agree to strongly disagree). Likert assumes an interval scale although this may not be realistically accurate.
Guttman scales Response choices for each item are various statements that lie on a continuum. Endorsing the most extreme statement reflects endorsement of milder statements as well.
Method of equal-appearing intervals Presumed to be interval For knowledge scale: obtain T/F statements Experts rate each item For attitude scale Judges rate each item on a likert scale assuming equal intervals For both Total test score for the test taker is based on “weighted” items (determined by averaging the experts ratings)
Method of absolute scaling Way to determine the difficulty level of items. Give items to several age groups, with one age group acting as the anchor. Item difficulty is assessed by noting the performance of each age group on each item as compared to the anchor group.
Method of empirical keying Based entirely on empirical findings. Test developer comes up with several items and then gives these to a group of people who are known to possess the construct and a group who is known not to possess the construct. Items are selected based on how well they distinguish one group from the other.
Writing the items
Item format Selected response Constructed response
Multiple choice Pros---- Cons----
Matching Pros---- Cons----
True/False Pros---- Cons---- Forced-choice methodology.
Fill in Pros---- Cons----
Short answer objective item Pros--- Cons---
Essay Pros---- Cons----
Scoring items Cumulative model Class/category Ipsative Correction for guessing
3. Test tryout Should be on group that represents the ultimate group of test takers (who the test is intended for) Good items Reliable Valid Discriminate well
Before item analysis, look at the variability of scores within the test Floor effect? Ceiling effect?
4. Item analysis helps determine which items should be kept, revised, deleted.
Item-difficulty index proportion of examinees who get the item correct. can get a mean item difficulty.
Ideal item difficulty when using multiple guess items, try to account for the probability of chance. Optimal item difficulty = 1+g/2 exception to choosing item difficulty around mid-range involves tests of extreme groups.
Item endorsement proportion of examinees who endorsed the item.
Item reliability index Indication of internal consistency Product of the item SD and the correlation between the item and total scale Items with low reliability can be eliminated
Item validity index Correlate item with criterion – (helps identify predictively useful test items) Multiply the item score and the criterion total score with the SD of the item. The usefulness of an item also depends on its dispersion or ability to discriminate
Item discrimination index how well the item discriminates between high scorers and low scorers on the test. For each item, compare the performance of those in the upper vs lower performance ranges. Formula: d= (U-L)/N U = # of pple in the upper range who got it right L= # of pple in the lower range who got it right N= total # of pple in the upper OR lower range.
Interpreting the IDI can vary from –1 to +1. A (–) number = A 0 indicates = The closer the IDI is to +1 Can also use the IDI approach to examine the pattern of incorrect responses.
Item characteristic curves “Graphic representation of item difficulty and discrimination” horizontal line = ability vertical line = probability of a correct response
plots the probability of a correct response relative to the position on the entire test. If the curve is an incline slope or like an S, the item is doing a good job of separating low and high scorers.
Item fairness Items should measure the same thing across groups Items should have similar ICC across groups Items should have similar predictive validity across groups
Speed tests Easy items, similar items – everyone gets correct. Measuring response time Traditional analyses of items do not apply
Qualitative item analysis Test takers descriptions of the test Think aloud administrations Expert panels
5. Revising the test based on the info we obtained from the item analysis. New items and additional testing of these items may be required.
Cross validation Once you have your revised test, need to seek new, independent confirmation of the test’s validity. The researcher uses a new sample to determine if the test predicts the criterion as well as it did in the original sample.
Validity shrinkage Typically, with cross validation, you will find that the test is less accurate in predicting the criterion with this new sample.
Co-validation Validating two or more tests at the same time Co-norming Saves $ Beneficial for tests that are used together
6. Publishing the test final step that involves development of a test manual.
Production of testing materials Testing materials that are user friendly will be more accepted. The lay out of the materials should allow for smooth administration.
Technical manual Summarizes the technical data and references. Item analyses, scale reliabilities, validation evidence , etc can be found here.
User’s manual provides instruction for administration, scoring, and interpretation. The Standards for Educational and Psychological Testing recommend that manuals meet several goals (p 135). two of the most important: 1. describe the rationale and recommended uses of the test 2. provide data on reliability and validity.
Testing is big business