Technical Issues Two concerns Validity Reliability Let’s turn our attention to the technical issues related to measurement. Two very, very important concerns are the validity and reliability of the instruments being used.
Data Collection – Quiz 1 Answer the five questions on Quiz 1. Before we begin our discussion I’d like to take a few moments to work on an exercise. Use the link on this slide to access Quiz 1. Take this quiz assuming I’m going to use your score as a grade in the class. You can pause the slide show if needed. PAUSE
Data Collection – Quiz 1 Answers Score you paper using the following key A B The answers to Questions 1-5 are A, B, A, B, and B. Score your paper and remember the number of items answered correctly.
Data Collection – Quiz 1 How well did you do? Should I use this score as a part of your grade? Does this score indicate your level as a graduate student? “What we have here is a serious lack of communication!” Most students object strongly to using this score as a part of their grade because it isn’t fair. Most students object strongly to being labeled “bright” or “challenged” on the basis of their grade. Their reasoning is the test isn’t fair – it doesn’t cover material relevant to this course. Welcome to the technical world of instrumentation. How well did you do? If you did well, would you mind if I used your score as a part of your grade for EDF 800? If you didn’t do well, and most students do not, would you mind if I used your score as a part of your grade for EDF 800? If you did well, can I conclude you are exceptionally bright? If you didn’t do well, can I conclude you are quite challenged academically? Most everyone objects to using their score from this quiz – good or bad – for any purpose because it isn’t fair. The test simply doesn’t cover material appropriate to an introductory educational research course. We’ve studied absolutely none of the content on this quiz; expecting you to know it just isn’t right. Welcome to the technical world of instrumentation.
Technical Issues Validity – extent to which interpretations made from a test score are appropriate Characteristics The most important technical characteristic Situation specific Does not refer to the instrument but to the interpretations of scores on the instrument Best thought of in terms of degree The formal definition of validity is written on this slide. If you think for a moment, the definition makes a lot of sense. When you give a test to the students in your class, you use the scores to make some decisions about each student. If one student had a very high score, you usually “infer” this is a “good” student. If another student had a very low score you could “infer” this student was having serious difficulties mastering this material. The question ultimately comes down to whether or not such inferences or decisions you make are appropriate, meaningful, or useful. The answer depends on two characteristics of the test.
Technical Issues Validity (continued) Four types Content – to what extent does the test measure what it is supposed to measure Item validity Sampling validity Determined by expert judgment If your test covered appropriate content for the instruction provided to students, then the extent to which your inferences are appropriate, meaningful, or useful is high. If, like the quiz I gave you, the content is not relevant to the instruction, your inferences are not appropriate, meaningful, or useful to anyone. This is known as content validity and is a fundamental characteristic of any test. Please note that whether a test has evidence of content validity or not, nothing stops someone from using the scores to make decisions. Anyone ever taken an exam where the professor wrote items that had nothing to do with what was being taught? Did he or she still use your scores in your grades? Was that fair? Appropriate? Meaningful? Useful? I need to caution you about the situation specific nature of validity evidence. The quiz you took earlier was not content valid for this course, but it was taken off of a History of Education exam where every question was appropriate to the instruction. In our case the test was not content valid; in the case of another course it is 100% content valid.
Technical Issues Validity (continued) Construct – the extent to which a test measures the construct it represents Underlying difficulty defining constructs Estimated in many ways Criterion-related Predictive – to what extent does the test predict a future performance Concurrent - to what extent does the test predict a performance measured at the same time Estimated by correlations between two tests Sometimes the purpose of a test is not to measure specific concrete content like that we are studying. Often what is being measured is very nebulous or abstract in nature. How would you measure my intelligence? Probably with an intelligence test, but would the test be developed around Binet’s conception of intelligence as verbal and mathematical reasoning or Gardner’s 8 or 9 – I forget the number - multiple intelligences? Obviously the “tests” would look very, very different based on the manner by which the researcher interprets the “construct” of intelligence. While closely related to content validity in that we worry about whether the test “measures what it is supposed to measure” construct validity is difficult to estimate. If a test has sufficient evidence to suggest it measures intelligence, my score on that test and your use of it is reasonable. If not, any decision you make on the basis of that score is not appropriate, meaningful, or useful. Many times we find ourselves using test scores to predict a student’s performance on some later task. The ACT, SAT, GRE, or MCAT are good examples of such tests. Scores on the ACT or SAT are supposed to predict a student’s performance in their freshman year in college. Do they do so well? If so, we can make some decisions about whether or not to admit a student to a university; if not such a decision is not appropriate, meaningful, or useful.
Technical Issues Validity (continued) Factors affecting validity Consequential – to what extent are the consequences that occur from the test harmful Estimated by empirical and expert judgment Factors affecting validity Unclear test directions Confusing and ambiguous test items Vocabulary that is too difficult for test takers Consequential validity is a relatively new way to think about validity evidence. As the definition implies, we are interested in the consequences of testing that might prove to be particularly disconcerting for some students. For example, the Louisiana Department of Education mandates that all special needs students take the LEAP test that corresponds to the grade in which they are enrolled. Often this means a student is taking an exam that is well beyond their ability to read the test much less understand the content. Is this fair to that student? What about non-English speaking students? Is it fair to give them grade level tests that are completely dependent on the ability to read the English language? Welcome to the concerns related to consequential validity. There are many factors that can affect the validity of a test. Can you see how each of the three factors on this slide will have a negative effect on validity?
Technical Issues Factors affecting validity (continued) Overly difficult and complex sentence structure Inconsistent and subjective scoring Untaught items Failure to follow standardized administration procedures Cheating by the participants or someone teaching to the test items How about these factors?
Technical Issues Reliability – the degree to which a test consistently measures whatever it is measuring Characteristics Expressed as a coefficient ranging from 0 to 1 A necessary but not sufficient characteristic of a test Reliability is the second technical characteristic important to measurement. Reliability is basically the consistency with which we measure. If you took Exam 1 a first time and made a 40, a second time and made a 45, and a third time and made a 43, what score should I use to provide a reliable estimate of your knowledge of the material? There are three perspectives from which reliability is viewed: test reliability, score reliability, and agreement.
Technical Issues Test reliability Stability – consistency over time with the same instrument Test – retest Estimated by a correlation between the two administrations of the same test Equivalence – consistency with two parallel tests administered at the same time Parallel forms Estimated by a correlation between the parallel tests When speaking of test reliability, we estimate the extent to which the results of a test are likely to be the same. An estimate could be calculated using two administrations of the same test. This is know as stability or test-retest reliability. Coefficients close to 1 suggest a test that produces very consistent scores; those close to 0 suggest a lack of consistency for the test. Sometimes we don’t want to give one test twice – what a pain for the students! Besides, there is often a high chance that you’ll correct something from the first to the second administration of the test. When we develop two tests that examine the same material with different items, we are creating an opportunity to estimate reliability through equivalence or parallel forms. Comparing the scores from Form 1 of a test to those of Form 2 results in a coefficient that ranges from 0 to 1. Again the closer to 1 the more consistent the test.
Technical Issues Test reliability (continued) Internal consistency – artificially splitting the test into halves Several coefficients – split halves, KR 20, KR 21, Cronbach alpha All coefficients provide estimates ranging from 0 to 1 If one test is hard to develop, think about two! Think also about giving a second form of the test to your students! I’m sure they’d be delighted to help you out! Because of this limitation, researchers have developed an estimate of test reliability called internal consistency. In essence, we think of one test of say 100 items as two tests of 50 items each. We “split” the test into halves. The two most common estimates of internal consistency are the KR 20 and Cronbach alpha. The former is used when the items for a test are scored as right or wrong; the latter when the answers can fall on a continuous scale. An example of this is a Likert scale where a student responds to a five point scale ranging from strongly disagreeing to strongly agreeing. Regardless of which estimate is used, the coefficients always range from 0 to 1 with 1 representing greater reliability.