Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

Slides:



Advertisements
Similar presentations
Measurement Concepts Operational Definition: is the definition of a variable in terms of the actual procedures used by the researcher to measure and/or.
Advertisements

Topics: Quality of Measurements
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
The Department of Psychology
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.
Chapter 4 – Reliability Observed Scores and True Scores Error
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
VALIDITY AND RELIABILITY
Research Methodology Lecture No : 11 (Goodness Of Measures)
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT
Testing 05 Reliability.
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
RELIABILITY consistency or reproducibility of a test score (or measurement)
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
Uses of Language Tests.
Research Methods in MIS
Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.
Richard M. Jacobs, OSA, Ph.D.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.
Technical Issues Two concerns Validity Reliability
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Foundations of Educational Measurement
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Unanswered Questions in Typical Literature Review 1. Thoroughness – How thorough was the literature search? – Did it include a computer search and a hand.
Foundations of Recruitment and Selection I: Reliability and Validity
Standardization and Test Development Nisrin Alqatarneh MSc. Occupational therapy.
Reliability Chapter 3.  Every observed score is a combination of true score and error Obs. = T + E  Reliability = Classical Test Theory.
Reliability & Validity
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Appraisal and Its Application to Counseling COUN 550 Saint Joseph College For Class # 3 Copyright © 2005 by R. Halstead. All rights reserved.
Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.
Chapter 2: Behavioral Variability and Research Variability and Research 1. Behavioral science involves the study of variability in behavior how and why.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
McGraw-Hill/Irwin © 2012 The McGraw-Hill Companies, Inc. All rights reserved. Obtaining Valid and Reliable Classroom Evidence Chapter 4:
Criteria for selection of a data collection instrument. 1.Practicality of the instrument: -Concerns its cost and appropriateness for the study population.
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.
Chapter 6 - Standardized Measurement and Assessment
Writing A Review Sources Preliminary Primary Secondary.
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
Reliability EDUC 307. Reliability  How consistent is our measurement?  the reliability of assessments tells the consistency of observations.  Two or.
Lesson 2 Main Test Theories: The Classical Test Theory (CTT)
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
© 2013 by Nelson Education1 Foundations of Recruitment and Selection I: Reliability and Validity.
Copyright © 2009 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 47 Critiquing Assessments.
Concept of Test Validity
Reliability.
Classical Test Theory Margaret Wu.
Item Analysis: Classical and Beyond
Introduction to Measurement
پرسشنامه کارگاه.
Reliability and Validity of Measurement
PSY 614 Instructor: Emily Bullock, Ph.D.
Evaluation of measuring tools: reliability
By ____________________
Measurement Concepts and scale evaluation
Psy 425 Tests & Measurements
Item Analysis: Classical and Beyond
Chapter 8 VALIDITY AND RELIABILITY
Item Analysis: Classical and Beyond
Presentation transcript:

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic to the extent that they are uniform from one test administration to the next) (2) attributes of the test taker that are not considered part of the language abilities we want to measure cognitive style and knowledge of particular content areas, and group characteristics such as sex, race, and ethnic background. (3) random factors that are largely unpredictable and temporary. These include unpredictable and largely temporary conditions, such as his mental alertness or emotional state, and uncontrolled differences in test method facets, such as changes in the test environment from one day to the next, or idiosyncratic differences in the way different test administrators carry out their responsibilities.

Classical true score measurement theory When we investigate reliability, it is essential to keep in mind the distinction between unobservable abilities, on the one hand, and observed test scores, on the other. Classical true score ( C T S ) measurement theory consists of a set of assumptions about the relationships between actual, or observed test scores and the factors that affect these scores: The first assumption of this model states that an observed score on a test comprises two factors or components: a true score that is due to an individual’s level of ability and an error score, that is due to factors other than the ability being tested. A second set of assumptions has to do with the relationship between true and error scores. Error scores are unsystematic, or random, and are uncorrelated with true scores.

Parallel tests In order for two tests to be considered parallel, we assume that they are measures of the same ability, that is, that an individual’s true score on one test will be the same as his true score on the other. Two tests are parallel if, for every group of persons taking both tests, (1)the true score on one test is equal to the true score on the other, and (2)the error variances for the two tests are equal. parallel tests are two tests of the same ability that have the same means and variances and are equally correlated with other tests of that ability.

In summary, reliability is defined in the CTS theory in terms of true score variance. Since we can never know the true scores of individuals, we can never know what the reliability is, but can only estimate it from the observed scores.

Approaches to estimating reliability Internal consistency Internal consistency is concerned with how consistent test takers’ performances on the different parts of the test are with each other. Performance on the parts of a reading comprehension test, for example, might be inconsistent if passages are of differing lengths and vary in terms of their syntactic, lexical, and organizational complexity, or involve different topics. One approach to examining the internal consistency of a test is the split-half method, in which we divide the test into two halves and then determine the extent to which scores on these two halves are consistent with each other (1)they both measure the same trait. (2)individuals’ performance on one half does not depend on how they perform on the other A convenient way of splitting a test into halves might be to simply divide it into the first and second halves. odd-even method

Stability (test-retest reliability) There are also testing situations in which it may be necessary to administer a test more than once. For example, if a researcher were interested in measuring subjects 'language ability at several different points in time, as part of a time-series design. In this approach, we administer the test twice to a group of individuals and then compute the correlation between the two sets of scores. The primary concern in this approach is assuring that the individuals who take the test do not themselves change differentially in any systematic way between test administrations. That is, we must assume that both practice and learning (or unlearning) effects are either uniform across individuals or random

Equivalence (parallel form reliability) It is of particular interest in testing situations where alternate forms of the test may be actually used, either for security reasons, or to minimize the practice effect. In some situations it is not possible to administer the test to all examinees at the same time, and the test user does not wish to take the chance that individuals who take the test first will pass on information about the test to later test takers. In other situations, the test user may wish to measure individuals’ language abilities frequently over a period of time, and wants to be sure that any changes in performance are not due to practice effect, and therefore uses alternate forms.

Problems with the classical true score model In many testing situations these apparently straightforward procedures for estimating the effects of different sources of error are complicated by the fact that the different sources of error may interact with each other, even when we carefully design our reliability study. A second, related problem is that the CTS model considers all error to be random, and consequently fails to distinguish systematic error from random error.

Generalizability theory It investigating the relative effects of different sources of variance in test scores. on the basis of an individual’s performance on a test we generalize to her performance in other contexts. The more reliable the sample of performance, or test score, is, the more generalizable it is. The application of G-theory to test development and use takes place in two stages: First, the test developer designs and conducts a study to investigate the sources of variance that are of concern or interest. This involves identifying the relevant sources of variance (including traits, method facets, personal attributes, and random factors), designing procedures for collecting data that will permit the test developer to clearly distinguish the different sources of variance, administering the test according to this design, and then conducting the appropriate analyses. On the basis of this generalizability study (‘G-study’), the test developer obtains estimates of the relative sizes of the different sources of variance (‘variance components’).

Depending on the outcome of this G-study, the test developer may revise the test or the procedures for administering it, and then conduct another G-study. Or, if the results of the G-study are satisfactory (if sources of error variance are minimized), the test developer proceeds to the second stage, a decision study (‘D-study’). Second, In a D-study, the test developer administers the test under operational conditions, that is, under the conditions in which the test will be used to make the decisions for which it is designed, and uses G theory procedures to estimate the magnitude of the variance components. The application of G-theory thus enables test developers and test users to specify the different sources of variance that are of concern for a given test use, to estimate the relative importance of these different sources simultaneously, and to employ these estimates in the interpretation and use of test scores.

In general: It takes into account all possible sources of error (due to individual factors, situational characteristics of the evaluator, and instrumental variables) and tries to differentiate by applying the classical procedures of analysis of variance (ANOVA).

Item Response theory A major limitation to CTS theory is that it does not provide a very satisfactory basis for predicting how a given individual will perform on a given item. There are two reasons for this. First, CTS theory makes no assumptions about how an individual’s level of ability affects the way he performs on a test. Second, the only information that is available for predicting an individual’s performance on a given item is the index of difficulty, which is simply the proportion of individuals in a group that responded correctly to the item. Thus, the only information available in predicting how an individual will answer an item is the average performance of a group on this item. Because of this and other limitations in CTS theory (and G-theory, as well), psychometricians have developed a number of mathematical models for relating an individual’s test performance to that individual’s level of ability.”

Item response theory makes stronger predictions about individuals’ performance on individual items, their levels of ability, and about the characteristics of individual items. Item characteristic curves (the relationship between the test taker’s ability and his performances on a given item) The types of information about item characteristics may include: (1) the degree to which the item discriminates among individuals of differing levels of ability (the ‘discrimination’ parameter a ) (2) the level of difficulty of the item (the ‘difficulty’ parameter b) (3)the probability that an individual of low ability can answer the item correctly (the ‘pseudo-chance’ or ‘guessing’ parameter c ).

An individual’s expected performance on a particular test question, or item, is a function of both the level of difficulty of the item and the individual’s level of ability.

Item Characteristic Curves Specific assumptions about the relationship between the test taker's ability and his performance on a given item are explicitly stated in the mathematical formula, or item characteristic curve (ICC).

Item Characteristic Curves The form of the ICC is determined by the particular mathematical model on which it is based. The types of information about item characteristics may include: (1) the degree to which the item discriminates among individuals of differing levels of ability (the 'discrimination' parameter a);

Item Characteristic Curves (2) the level of difficulty of the item (the 'difficulty' parameter b), and (3) the probability that an individual of low ability can answer the item correctly (the 'pseudo-chance' or 'guessing' parameter c). One of the major considerations in the application of IRT models, therefore, is the estimation of these item parameters.

ICC pseudo-chance parameter c: p=0.20 for two items difficulty parameter b: halfway between the pseudo-chance parameter and one discrimination parameter a: proportional to the slop of the ICC at the point of the difficulty parameter The steeper the slope, the greater the discrimination parameter. Ability Scale Probability