Reliability Lesson Six What is reliability? When we talk about the reiliable, we talk about the result or scores. The test result is consistent . I give you the quiz today to this class, and I give you the quiz again with the same group. I will give the similar result. You rank no. 1 and you will rank no. 1 another time. With the test. We are talking about the test itself. We did not concern the test content.
Case Imagine that a hundred students take a 100-item test at three o’clock one Thursday afternoon. The test is not impossible difficult or ridiculously easy for these students, so they do not all get zero or a perfect score of 100. Now what if in fact they had not taken the test on the Thursday but had taken it at three o’clock the previous afternoon? Would we expect each student to have got exactly the same score on the Wednesday as they actually did on the Thursday? The answer to this question must be no. Even if we assume that the test is excellent, that the conditions of administration are almost identical, that the scoring calls for no judgment on the part of the scorers and is carried out with perfect care, and that no learning or forgetting has taken place during the one-day interval. Human beings are not like that ; they simply do not behave in exactly the same way on every occasion, even when the circumstances seem identical. What we have to do is construct, administer and score tests in such a way that the scores actually obtained on a test on a particular occasion are likely to be very similar to those which would have been obtained if it had been administered to the same students with the same ability, but at a different time. The more similar the scores would have been, the more reliable the test is said to be.
Contents Definition of reliability Factors contributing to unreliability Types of reliability Indication of reliability: Reliability coefficient Ways of obtaining reliability coefficient: Alternate/Parallel forms Test-retest Split-half & KR-21/KR-20 Two ways of testing reliability How to make test more reliable Online video http://www.le.ac.uk/education/testing/ilta/faqs/main.html
Definition of Reliability (1) “The consistency of measures across different times, test forms, raters, and other characteristics of the measurement context” (Bachman, 1990, p. 24). If you give the same test to the same testees on two different occasions, the test should yield similar results. In other words, the different times the result should be the same. Test forms are equivalent/ parallel form. You might get the similar result. Test gives to different test taker and get the similar score. It concern the rater, how the test is given, administration will affect the reliability.
Definition of Reliability (2) A reliable test is consistent and dependable. Scores are consistent and reproducible. The accuracy or precision with which a test measures something; that is, consistency, dependability, or stability of test results. We are talking about the test results. Again, we did not concern about the content.
Factors Contributing to Unreliability X=T+ E (observed score = true score + error score) Concerned with freedom from nonsystematic fluctuation. Fluctuations in the student scoring test administration the test itself What kinds of factor will influence the test reliability. The test takers- student; or scoring means test raters. How the test is given (test administration), the problem with test (ambiguous item). In reality, we have acknowledge the testing exists errors. X is the test results or scores. These scores consist of two parts. One is his true ability, and the error score. For, example, multiple choice. 35% of get right. If your eyes closed, you pick up a as the correct answer. You guess right. It is not your own ability, it is called error score. In that case, we have to know in the testing situation, there is nonsystematic flunctuation. The really error from the test taker themselves. You expect you get the right answers. However, you did not perform well (careless, it is not my day, stay up late), in your normal
Types of Reliability Student- (or Person-) related reliability Rater- (or Scorer-) related reliability Intra-rater reliability Inter-rater reliability Test administration reliability Test (or instrument-related) reliability To do with test taker itself Or some people call scorer. Intra-rater: we are talking about the one rater. the same people; inter-raters – more than 2 raters. The way you give the test; the classroom situation, times. (too early before the ring. To do with the test itself.
Student-Related Reliability (1) The source of the error score comes from the test takers. Temporary illness Fatigue Anxiety Other physical or psychological factors Test-wiseness (i.e., strategies for efficient test taking) We are going to talk each of them one by one. The students refer to test takers. Fatigue (too tired ; stay up late). Anxiety (too nervous; oral interview. You did not say the word.) Test-wiseness. I give you some example. For example.. Test taking strategies. How to distribute your time? You have to finish your test in one hour. You might use your time evenly to finish four times. Multiple choice. How to get the correct answers? “four choice” eliminate the unlikely answers. Those strategies call test-wiseness. To stick to the passage, don’t think too much. Otherwise, it will lead to the different direction. It is the test-role.
Student-Related Reliability (2) Principles: Assess on several occasions Assess when person is prepared and best able to perform well Ensure that person understands what is expected (e.g., instructions are clear) Some principles to follow; teacher cannot control students’ behavior. Give more tests and get average. Give more chances. Those functions will given out. Only final exams is mainly score for the whole grades. It is caud the control to fit into this 3, make sure that student understand what is expected. “oral instructions.
Rater (or Scorer) Reliability (1) Fluctuations: including human error, subjectivity, and bias Principles: Use experienced trained raters. Use more than one rater. Raters should carry out their assessments independently. Errors – human error as the rater. Carelessness and give the wrong grading. The composition and oral test is hard to be objectives. Iraters need to be independent not be influced by other raters.
Rater Reliability (2) Two kinds of rater reliability: Intra-rater reliability Inter-rater reliability One rater have to use the
Intra-Rater Reliability Fluctuations including: Unclear scoring criteria Fatigue Bias toward particular good and bad students Simple carelessness One rater has clear scoring criteria. No one is totally objective. We are human beings, and making mistakes.
Inter-Rater Reliability (1) Fluctuations including: Lack of attention to scoring criteria Inexperience Inattention Preconceived biases You did not attention to scoring. Two or more raters involved in this situation, we need to calculate the inter-rater reliability.
Inter-Rater Reliability (2) Used with subjective tests when two or more independent raters are involved in scoring Train the raters before scoring (e.g., TWE, dept. oral and composition tests for recommended students).
Inter-Rater Reliability (3) Compare the scores of the same testee given by different raters. If r= high, there’s inter-rater reliability. The evidence will be called. Statistical formula is to help you get the r. if we want to know if there is inter-rater reliability you need to do the calculation and get r. try to convince the people. They did not train the raters and did not have inter-rater reliability.
Test Administration Reliability Street noise Listening comprehension test Photocopying variations Lighting Variations in temperature Condition of desks and chairs Monitors When I give the test, some factors might affect your performance. (noisy street.) 2. The lousy copying machine, you cannot read clear and take more time to 3. Light broken, too hot/ sweating. 4.
Test Reliability Measurement errors come from the test itself: Test is too long Test with a time limit Test format allows for guessing Ambiguous test items Test with more than one correct answer The long test will be more reliable than the short term. It includes more samples of test taker’s ability. One question, and 100 questions. If you miss one, you still have 90% to get right. But it is too long, 200 items, you will be tired. There is more chances that I will make the mistakes in the test. Test with a time limit will have cause some problems. Take 100 items, 70% answers are known, 30% is guessing. He got 25% correct. He got 95. It is likely his test-wiseness good. Multiple-choice is good for inter-rater reliability but increases the guessing factors.
Ways of Enhancing Reliability General strategies: Consider possible sources of unreliability Reduce or average out nonsystematic fluctuations in raters persons test administration instruments
How to Make Tests More Reliable? (1) Take enough samples of behavior Try to avoid ambiguous items Provide clear and explicit instructions Ensure tests are well layout & perfectly legible Provide uniform and undistracted condition of administration Try to use objective tests
How to Make Tests More Reliable? (2) Try to use direct tests Have independent, trained raters Provide a detailed scoring key Try to identify the test takers by number, not by names Try to have more multiple independent scoring in subjective tests (Hughes, 1989, pp. 36-42).
Reliability Coefficient (r) To quantify the reliability of a test allow us to compare the reliability of different tests. 0 ≤ r ≤ 1 (ideal r= 1, which means the test gives precisely the same results for particular testees regardless of when it happened to be administered). If r = 1: 100% reliable A good achievement test: r>= .90 R<.70 shouldn’t use the test Let’s talk about the r. usually, the ideal r between 1 to 0 . When we have r = o , what does that mean? The result is not reliable .You don’t use the test results in ay ways, something is terrible wrong in the test. In other words, the r = is bigger is better. is more reliable. The perfect r is 1. the average .7 to 1. When you have r < .7, you should not use the test. It refers to test items itself own. We did not take into account of test taker and raters.
How to Get Reliability Coefficient Type of Reliability How to Measure Stability or Test-Retest Give the same assessment twice, separated by days, weeks, or months. Reliability is stated as the correlation between scores at Time 1 and Time 2. Alternate Form Create two forms of the same test (vary the items slightly). Reliability is stated as correlation between scores of Test 1 and Test 2. Internal Consistency (Alpha, a) Compare one half of the test to the other half. Or, use methods such as Kuder-Richardson Formula 20 (KR20) or Cronbach's Alpha. We are talking about the theory. If we give the test to same people several times. The results will be consist. Two parallel forms to the same test takers on different occasion. This time, next time. If the tests have good reliable, it means the results are similar. One form – you use the exact same test. Pre-test and post test. One administration. One time. And divided into two part. Or two sets of scores. Individual test takers have two set of scores. And we can correlate those two score. We called it internal consistency. There are three types.
How to Get Reliability Coefficient Two forms, two administrations: alternate/parallel forms One form, two administrations: test-retest One form, one administration (internal consistency): split-half (Spearman-Brown procedure) KR-21 KR-20 We are talking about the theory. If we give the test to same people several times. The results will be consist. Two parallel forms to the same test takers on different occasion. This time, next time. If the tests have good reliable, it means the results are similar. One form – you use the exact same test. Pre-test and post test. One administration. One time. And divided into two part. Or two sets of scores. Individual test takers have two set of scores. And we can correlate those two score. We called it internal consistency. There are three types.
Alternate/Parallel Forms Two forms, two administrations: Equivalent forms (i.e., different items testing the same topic) taken by the same test taker on different days If r is high, this test is said to have good reliability. the most stringent form Two forms, or called parallel form.. Are measuring the same ability. If you give two test and give to test takers at the same time same day, the result is correlated. How do you like this method as the test takers? It is boring to take the test twice. It is very ideal. If you are not highly motive, you will perform differently. The teacher is not easy to prepare two parallel forms. It takes time and limit time. T likes to take test-retest.
Test-Retest One form, two administrations The same test is administered to the same testees with a short time lag, and then calculate r. Appropriate for highly speeded test Learner effect. The first one, you did not answer items correct. The second one, after four month, you may answer items correct. However, it cannot be a too short. Tests takers might memorized the item. One or two weeks are better time lag. The problem is test takers. They have to take the test twice. Therefore, we have the third method. Called split-half.
Split-half (Spearman-Brown Procedure) One test, one administration Split the test into halves (i.e., odd questions vs even questions) to form two sets of scores. Also called internal consistency Q1 Q2 Q3 Q4 Q5 Q6 Was designed by spearman and Brown. First Half Second Half
Split-half (2) Note that the r isn’t the reliability of the test A math relationship between test length and reliability: the longer the test, the more reliable it is. Rel.total = nr/1+ (n-1)r Spearman & Brown Prophecy Formula E.g., correlation between 2 parts of test; r= .6 rel. of full test = .75 If lengthen the test items into 3 times: r= .82 The r is not the real reliability of the test. You know why. Because the test is cut into half. The real test should be twice long. 100 items cut into half. 50 items. The reliably become the half of the reliablgigiy of the all test. We need some adjustment. The real = number times r / 1 150 items , we will increase the reliability of the test. The theory is the longer the test is, the more reliable the test is. I did not want you to remember the formula.
Kuder-Ridchardson formula 21 KR-21 = k/(k-1){1-[x (1- x/k)]/s2} k= number of items; x= mean s= standard deviation (formula see Bailey 100) description of the spread outness in a set of scores (or score deviations from the mean) o<=s the larger s, the more spread out E.g., 2 sets of scores: (5, 4,3) and (7,4,1); which group in general behaves more similarly? The important concept I am going to talk about is the standard deviation. We use S to represent it. For example, you have 20 people in the group. Get 20 scores. Add up all those score up. Use the mean as the center, your score can distance from the mean. Deviate from the mean. How can we get sd.: 3 test takers. Average of the mean is four, another group take the same test. The mean is 4 too. You cannot say they have a similar general behaves. Sd. In the second group is bigger than the first one. Because the SDs are different , they don’t have same behaves. 50 scores. The first group is scattered in the middle crowded. Not spread out like the second group. p. 102. the first one SD 1; the second SD 3. we have small no. it is easy to tell. However, you have the large no. you need calculate them.
Kuder-Ridchardson formula 20 KR-20= [k/(k-1)][1-(∑pq/s2) p= item difficulty (percent of people who got an item right) q= 1-p (i.e., percent of people who got an item wrong) It is another way to calculate the reliability. The first formula is conservative. You might higher. We can come out the ways of reliability.
Ways of Testing Reliability Examine the amount of variation Standard Error of Measurement (SEM) The smaller the better Calculate “reliability coefficient” “r” The bigger the better What the r you are expected? The bigger is better. It is the one way. Another way is we examine the amought of variation can calculate the SEM. Is smaller is the better. SEM has to do with individual