Download presentation
Presentation is loading. Please wait.
Published byDominic Moore Modified over 9 years ago
1
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic to the extent that they are uniform from one test administration to the next) (2) attributes of the test taker that are not considered part of the language abilities we want to measure cognitive style and knowledge of particular content areas, and group characteristics such as sex, race, and ethnic background. (3) random factors that are largely unpredictable and temporary. These include unpredictable and largely temporary conditions, such as his mental alertness or emotional state, and uncontrolled differences in test method facets, such as changes in the test environment from one day to the next, or idiosyncratic differences in the way different test administrators carry out their responsibilities.
2
Classical true score measurement theory When we investigate reliability, it is essential to keep in mind the distinction between unobservable abilities, on the one hand, and observed test scores, on the other. Classical true score ( C T S ) measurement theory consists of a set of assumptions about the relationships between actual, or observed test scores and the factors that affect these scores: The first assumption of this model states that an observed score on a test comprises two factors or components: a true score that is due to an individual’s level of ability and an error score, that is due to factors other than the ability being tested. A second set of assumptions has to do with the relationship between true and error scores. Error scores are unsystematic, or random, and are uncorrelated with true scores.
3
Parallel tests In order for two tests to be considered parallel, we assume that they are measures of the same ability, that is, that an individual’s true score on one test will be the same as his true score on the other. Two tests are parallel if, for every group of persons taking both tests, (1)the true score on one test is equal to the true score on the other, and (2)the error variances for the two tests are equal. parallel tests are two tests of the same ability that have the same means and variances and are equally correlated with other tests of that ability.
4
In summary, reliability is defined in the CTS theory in terms of true score variance. Since we can never know the true scores of individuals, we can never know what the reliability is, but can only estimate it from the observed scores.
5
Approaches to estimating reliability Internal consistency Internal consistency is concerned with how consistent test takers’ performances on the different parts of the test are with each other. Performance on the parts of a reading comprehension test, for example, might be inconsistent if passages are of differing lengths and vary in terms of their syntactic, lexical, and organizational complexity, or involve different topics. One approach to examining the internal consistency of a test is the split-half method, in which we divide the test into two halves and then determine the extent to which scores on these two halves are consistent with each other (1)they both measure the same trait. (2)individuals’ performance on one half does not depend on how they perform on the other A convenient way of splitting a test into halves might be to simply divide it into the first and second halves. odd-even method
6
Stability (test-retest reliability) There are also testing situations in which it may be necessary to administer a test more than once. For example, if a researcher were interested in measuring subjects 'language ability at several different points in time, as part of a time-series design. In this approach, we administer the test twice to a group of individuals and then compute the correlation between the two sets of scores. The primary concern in this approach is assuring that the individuals who take the test do not themselves change differentially in any systematic way between test administrations. That is, we must assume that both practice and learning (or unlearning) effects are either uniform across individuals or random
7
Equivalence (parallel form reliability) It is of particular interest in testing situations where alternate forms of the test may be actually used, either for security reasons, or to minimize the practice effect. In some situations it is not possible to administer the test to all examinees at the same time, and the test user does not wish to take the chance that individuals who take the test first will pass on information about the test to later test takers. In other situations, the test user may wish to measure individuals’ language abilities frequently over a period of time, and wants to be sure that any changes in performance are not due to practice effect, and therefore uses alternate forms.
8
Problems with the classical true score model In many testing situations these apparently straightforward procedures for estimating the effects of different sources of error are complicated by the fact that the different sources of error may interact with each other, even when we carefully design our reliability study. A second, related problem is that the CTS model considers all error to be random, and consequently fails to distinguish systematic error from random error.
9
Generalizability theory It investigating the relative effects of different sources of variance in test scores. on the basis of an individual’s performance on a test we generalize to her performance in other contexts. The more reliable the sample of performance, or test score, is, the more generalizable it is. The application of G-theory to test development and use takes place in two stages: First, the test developer designs and conducts a study to investigate the sources of variance that are of concern or interest. This involves identifying the relevant sources of variance (including traits, method facets, personal attributes, and random factors), designing procedures for collecting data that will permit the test developer to clearly distinguish the different sources of variance, administering the test according to this design, and then conducting the appropriate analyses. On the basis of this generalizability study (‘G-study’), the test developer obtains estimates of the relative sizes of the different sources of variance (‘variance components’).
10
Depending on the outcome of this G-study, the test developer may revise the test or the procedures for administering it, and then conduct another G-study. Or, if the results of the G-study are satisfactory (if sources of error variance are minimized), the test developer proceeds to the second stage, a decision study (‘D-study’). Second, In a D-study, the test developer administers the test under operational conditions, that is, under the conditions in which the test will be used to make the decisions for which it is designed, and uses G theory procedures to estimate the magnitude of the variance components. The application of G-theory thus enables test developers and test users to specify the different sources of variance that are of concern for a given test use, to estimate the relative importance of these different sources simultaneously, and to employ these estimates in the interpretation and use of test scores.
11
In general: It takes into account all possible sources of error (due to individual factors, situational characteristics of the evaluator, and instrumental variables) and tries to differentiate by applying the classical procedures of analysis of variance (ANOVA).
12
Item Response theory A major limitation to CTS theory is that it does not provide a very satisfactory basis for predicting how a given individual will perform on a given item. There are two reasons for this. First, CTS theory makes no assumptions about how an individual’s level of ability affects the way he performs on a test. Second, the only information that is available for predicting an individual’s performance on a given item is the index of difficulty, which is simply the proportion of individuals in a group that responded correctly to the item. Thus, the only information available in predicting how an individual will answer an item is the average performance of a group on this item. Because of this and other limitations in CTS theory (and G-theory, as well), psychometricians have developed a number of mathematical models for relating an individual’s test performance to that individual’s level of ability.”
13
Item response theory makes stronger predictions about individuals’ performance on individual items, their levels of ability, and about the characteristics of individual items. Item characteristic curves (the relationship between the test taker’s ability and his performances on a given item) The types of information about item characteristics may include: (1) the degree to which the item discriminates among individuals of differing levels of ability (the ‘discrimination’ parameter a ) (2) the level of difficulty of the item (the ‘difficulty’ parameter b) (3)the probability that an individual of low ability can answer the item correctly (the ‘pseudo-chance’ or ‘guessing’ parameter c ).
14
An individual’s expected performance on a particular test question, or item, is a function of both the level of difficulty of the item and the individual’s level of ability.
15
Item Characteristic Curves Specific assumptions about the relationship between the test taker's ability and his performance on a given item are explicitly stated in the mathematical formula, or item characteristic curve (ICC).
16
Item Characteristic Curves The form of the ICC is determined by the particular mathematical model on which it is based. The types of information about item characteristics may include: (1) the degree to which the item discriminates among individuals of differing levels of ability (the 'discrimination' parameter a);
17
Item Characteristic Curves (2) the level of difficulty of the item (the 'difficulty' parameter b), and (3) the probability that an individual of low ability can answer the item correctly (the 'pseudo-chance' or 'guessing' parameter c). One of the major considerations in the application of IRT models, therefore, is the estimation of these item parameters.
18
ICC pseudo-chance parameter c: p=0.20 for two items difficulty parameter b: halfway between the pseudo-chance parameter and one discrimination parameter a: proportional to the slop of the ICC at the point of the difficulty parameter The steeper the slope, the greater the discrimination parameter. Ability Scale Probability
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.