RELIABILITY BY DESIGN Prepared by Marina Gvozdeva, Elena Onoprienko, Yulia Polshina, Nadezhda Shablikova.

RELIABILITY BY DESIGN Prepared by Marina Gvozdeva, Elena Onoprienko, Yulia Polshina, Nadezhda Shablikova

Outline Factors affecting reliability Ways to increase reliability
Classroom test reliability Tips for classroom tests 2

Two Components of Test Reliability
Candidates’ Performance Reliability of the scoring Ways to increase reliability There are two directions in which we can go to increase test reliability.

Factors affecting reliability:
test factors; administrative factors; affective factors. (Coombe et al., 2007:xxiii) Results on tests may vary for a number of reasons. (Coombe et al. 2010: xxiii. A Practical Guide to Assessing English Language Learners.) define three groups of factors that affect reliability – test factors, administrative factors and affective factors. Test factors: Tests vary in formats and content of the questions which may affect reliability. Time which is given to test takers to work on a test, or different parts of the test, is also a factor. The number of items on a test also affects reliability. It is believed that generally, the more items we have on the test, the more reliable it is (because then we have more samples of test takers’ performance and language ability), and vice versa – shorter quizzes usually are less reliable. Administrative factors: Administrative factors refer to the manner in which a test is administered. A room where the test takes place may be too dark, or too cold. Test takers may sit too close to each other. The acoustics in the room may be poor. The teacher who administers the test may behave differently. All these factors affect reliability and should be taken into consideration. Affective factors: Affective factors refer to the way individual test takers respond to a test. Test takers taking a test may be tired or rested, anxious or not, introverted or extroverted (which is particularly important for speaking tests), may have different learning styles, etc. 4

Factors affecting reliability:
fluctuations in the learner; fluctuations in scoring; fluctuations in test administration. (Coombe et al. 2010) Henning (Henning, G. 1987, in Coombe C. et al., 2010: xxiii-xxiv. A Practical Guide to Assessing English Language Learners)) describes threats to reliability as fluctuations. Fluctuations in the Learner: These are changes that may take place within the learner and thus change his/her true score every time the learner takes the test. They include additional learning or forgetting that may take place between the times the learner takes the test, as well as fatigue, sickness, emotional problems. If the learner takes the test several times, we may also expect practice effect which means that the learner improves his/her score merely because the content is familiar. All these may cause the learner’s observed score to be lower or higher than the score that reflects his/her actual ability. Fluctuations in Scoring: Fluctuations in scoring may be caused by subjectivity in scoring due to low intra-rater or inter-rater reliability, or some mechanical errors (like misprints). Fluctuations in Test Administration: These refer to administrative procedures and testing conditions that are not consistent. For example, different groups of test takers may take the the same test in different locations on different days and the conditions such as timing of the test, heating of the room, etc. may be different each time. 5

Reliability vs validity
E.g. 100 MCQ translation items: may be reliable, but will not be valid as a writing test. Increasing reliability contributes to validity: an unreliable test cannot be valid. A reliable test may not be a valid test. For example, a test in which test takers answer 100 MCQ translation items may be reliable but would not be a valid test of writing as it does not test the constructs of writing.

Reliability vs validity
This picture shows the relationship between reliability and validity. If the person consistently hits a certain area of the target, their aim is reliable. However, it is not valid to say he/she is a good archer (see target 1); Target number 2 shows an example of unreliable and inaccurate shooting because the person hits various places around the target but without any consistency. Target 3 shows a reliable and accurate performance because the archer consistently hits close to the bullseye. A good test must, like a good archer, give results that are accurate (valid) as well as consistent (reliable). Reliable, not valid Not reliable, not valid Reliable and valid

Ways to increase reliability
Get enough samples of behaviour: the more items there are in a test, the more reliable it is; additional items should be independent of each other and of existing items; a test should be neither too long nor too short. If you want more reliable scores you need more evidence of candidates’ behaviour. One way to solve this problem is to add more items to a test. If not done carefully, this can lead to another problem. This is how Hughes explains it: “Imagine a reading test that asks the question: ‘Where did the thief hide the jewels?’ If an additional item following that took the form ‘What was unusual about the hiding place?’”, Hughes explains that this second question would not contribute to the reliability of the test because answering it correctly depends on getting the first question right. For test takers who failed to answer the first question there is, in reality, no second question (Hughes, 1989: 37). To contribute to reliability, each individual item should represent a fresh start for a test taker. Longer tests are generally more reliable. On the other hand, if the test is too long, test takers will lose concentration or become tired and will not perform as well as they should.

Ways to increase reliability:
do not allow test takers too much choice; write clear items with simple instructions; include items with a range of difficulty; ensure that test materials are easy to read and have a clear layout; make sure test takers know what to expect in the test. (Hughes, 1989:39) The ways listed on the slide are connected with tests themselves. The first point is mostly concerned with performance tests, such as writing and speaking tests. For example, in writing tests test takers are often given a choice of questions and then they have too much freedom in answering the questions they chose. This can have a negative impact on reliability. The more freedom we give, the more likely it is that we get different performances from a candidate on different days (Hughes, 1989: 38). So, test takers should not be given choice. This is how we can restrict possible answers. Compare the tasks (next slide). The instructions, both written and spoken, should be clear. Quite often test takers misinterpret instructions because they are vague and not clearly worded. To avoid such problems, give your tests to colleagues for criticism of instructions (including those that will be spoken). For spoken instructions, prepare a text which you will read in order to avoid confusion (Hughes, 1989: 39). Sometimes tests are badly printed, reproduced or handwritten, or have too much text in a too small space, or a text has gaps and the box and the second half of the text are on the reverse side of a sheet. As a result, candidates have additional difficulty to cope with and their performance can be affected in a negative way. This lowers test reliability. If any aspect of a test is unfamiliar to candidates, they are likely to do less well on a test than otherwise. Teachers should do their best to provide candidates with sample tests or past tests to familiarize their test takers with test formats and techniques. One of the affective factors which can reflect on reliability is anxiety. It ‘can be allayed by coaching test takers in good test taking strategies” (Coombe C. et al, 2010 : xxiii).

Compare four tasks Write about tourism.
Write a composition on tourism in this country. Write a composition on how we might develop the tourist industry in this country. Discuss the following measures intended to increase the number of foreign tourists coming to this country: more/better advertising and/or information (where? what form should it take?); improve facilities (hotels, transportation, communication etc.); training of personnel (guides, hotel managers etc.). (Hughes, 1989:38) Ask the students: “Which task is more reliable?” You can see that the tasks provide increasing amounts of support from 1 to 4. The fourth task is likely to be considerably more reliable than the first one because the language elicited from test takers is likely to be more consistent between test takers than in the case of the first task where it will be unclear to test takers which aspects of tourism they should focus on. As more guidance is given, answers are more likely to become more comparable.

Ambiguous items Complete the sentences using the words and word combinations from Essential Vocabulary: He has been promoted three times this year. No doubt he is … Yes, that party was a disaster but it was not me who organized it, so you … I am thinking of buying this laptop but I am not sure if I can afford it. Could you … I think politics is really boring – it …. The movie was very touching but I managed to … II. Name: three types of waste; three ways to dispose of garbage. It is important not present test takers with items that are not clear or to which there are a number of acceptable answers which the test writer has not anticipated. If a test taker might have different options of an answer for a question on different occasions it means that the item is not contributing to reliability of the test. One very good way to produce unambiguous items is to give them to colleagues for critical examination. One more way to avoid ambiguity is to give a pretest to a group of people similar in their level to the target group. Most of the problems can be identified before an actual test. These are authentic test tasks that represent ambiguous items. In the first task presented on the slide test takers failed to complete some sentences not because they didn’t know the words, but because they failed to guess which word they were expected to use. For example, in question 1, the correct answer might be ambitious, successful, delighted or one of many other possible words. It is not surprising, therefore, that some test takers produced answers the test writer did not anticipate. In the second task they found it difficult to understand what was required of them. The task instructions are very unclear. Three types of waste could be “waste of time”, “waste of space” and “waste of money”!

Scoring reliably: use items that permit scoring which is as objective as possible; make comparisons between candidates as direct as possible; provide a detailed scoring key; train scorers. 1. MCQ seem to yield objective results. However, MCQ are not always appropriate for testing different abilities. What is more they are very difficult to construct. An alternative to MCQ is the open-ended item which has a unique, possibly one-word correct response which the candidates produce themselves. Ask students why it might be important that only a single word can be the correct answer: the wider the range of possible answers, the greater the risk of unreliability. Having only one possible answer should also ensure objective scoring, although there can still be problems with such matters as spelling which can make a candidate’s meaning unclear and make scoring difficult. 2. The second point reinforces the suggestion that candidates should not be given too much choice. If the range of possible responses is limited, scoring will be more reliable. Scoring compositions on one, clearly defined topic is more reliable than if the candidates are allowed to choose from six topics. 3. The scoring key should specify acceptable answers and assign points for partially correct responses. For higher scorer reliability the key should be as detailed as possible. To anticipate all possible responses it is advisable to subject the key to group criticism. 4. Training scorers is especially important when testing speaking and writing, where scoring is most subjective. After each administration, patterns of scoring should be analyzed. Those whose scoring deviates markedly and inconsistently from the norm should not be used again without further training. In writing, benchmarked samples of different levels of ability should be selected. Scorers or raters should be trained to agree on the scores to be awarded. In subjective testing at least two independent raters should rate scripts. Neither rater should know the scores given by the other. Any scores where disagreements occur between the two raters should be given to a third person to provide a third, independent score.

Reliability for speaking assessment:
designing the rating process – rating scale; reporting scores and giving feedback – setting cut scores; assuring intrarater / interrater reliability. Designing the rating process. The rating process involves developing the criteria that will be applied to the performances. To gather detailed rating information, test developers need to design a rating form for their test. The form is usually applied to one examinee at a time. This allows the rater to compare a testee’s performance against the certain criteria rather than against performances of other testees. Rating forms should be viewed as a concrete result of the design of the rating process and should help to make the process of rating easier and faster. (Luoma, 2011: ). The results of speaking tests are usually reported as overall grades in terms of numbers and/or letters. In a learning-oriented setting it is necessary to set a score for pass or fail so we need a point on a scale below which a performance is considered a fail. This is called setting cut scores In criterion-referenced tests it is known as standard setting.

Reliability for classroom speaking assessment
Intrarater / interrater reliability – training raters: understanding criteria for assessment; agreement with other raters; consistency of performance. A major worry about classroom assessment is subjectivity. One of the issues is “intrarater reliability” or internal consistency” – raters agree within themselves, over a period of a few days, about the ratings that they give”. A second issue is “interrater reliability”, which “means that different raters rate performances similarly” (Luoma, 2004:179). Threats to the level of intra- and inter-rater reliability can be reduced through well-defined criteria and effective rater training. If disagreements occur frequently, this may mean that either some raters can not apply criteria consistently or that the criteria need to be better defined .

Reliability of teacher made tests:
increasing number of items increases reliability; moderate difficulty level increases reliability; having items measuring similar content increases reliability. Increasing the number of items increases reliability. The longer the test, the more reliable the score is likely to be. Moderate difficulty level increases reliability: again, we see this in terms of item analysis. When we want to discriminate between test takers, item facility should be ideally 0.5 but anywhere between 0.3 and 0.7 is acceptable. Having items measuring similar content increases reliability ( but we must not forget that we should also have a variety of test tasks and be testing a variety of constructs. 15

Tips for writing short answer/completion items
There should be only one short, concise answer. Allow for partial credit. (Coombe C. et al, 2010:30-45) These tips are mostly for teachers developing classroom tests and they can help them ensure reliability. The source of the tips is Coombe C., Folse K., & Hubley N., 2010: A Practical Guide to Assessing English Language Learners (online).

Tips for cloze/gap-fill items:
ensure that answers are concise; provide enough context; develop and allow for a list of acceptable responses; don’t put a gap in the first sentence of a paragraph or text. (Coombe C. et al, 2010:34) A Practical Guide to Assessing English Language Learners, Coombe C., Folse K., Hubley N., p 34.

Tips for true/false questions:
write items that test meaning rather than trivial detail; questions should be written at a lower level of language difficulty than the text; consider the effects of background knowledge; questions should appear in the same order as the answers appear in the text. (Coombe C. et al., 2010:30) A Practical Guide Assessing English Language Learners, Coombe C., Folse K., Hubley N., pp.29-30

Tips for true/false questions:
make sure you paraphrase questions in simple clear language; avoid ‘absoluteness’ clues; add another option where plausible to decrease the guessing factor. (Coombe C. et al., 2010:30) A Practical Guide Assessing English Language Learners, Coombe C., Folse K., Hubley N., pp.29-30

Tips for writing matching items:
give more options than premises; number the premises and letter the options; make options shorter than premises; avoid widows; make it clear to test takers if they can use options more than once. Coombe C. et al., 2010:33 “Widows” occur when half of a test section or question overlaps onto another page. Make sure the matching section on your test is all on the same page”. A Practical Guide to Assessing English Language Learners, Coombe C., Folse K., Hubley N., pp 32-33

Tips for writing good MCQ:
write MCQ that test only one concept; provide as much context as possible; keep sensitivity and fairness issues in mind; standardize the number of response options; one response option should be an unambiguous correct or best answer. (Coombe C. et al, 2010:25-7) A Practical Guide Assessing English Language Learners, Coombe C., Folse K., Hubley N., pp 25-27

Tips for writing good MCQ:
all response options should be similar in length and level of difficulty; avoid writing absurd or giveaway distracters; avoid extraneous clues; make the stem positive. (Coombe C. et al, 2010:25-7) A Practical Guide Assessing English Language Learners, Coombe C., Folse K., Hubley N., pp 25-27

RELIABILITY BY DESIGN Prepared by Marina Gvozdeva, Elena Onoprienko, Yulia Polshina, Nadezhda Shablikova.

Similar presentations

Presentation on theme: "RELIABILITY BY DESIGN Prepared by Marina Gvozdeva, Elena Onoprienko, Yulia Polshina, Nadezhda Shablikova."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RELIABILITY BY DESIGN Prepared by Marina Gvozdeva, Elena Onoprienko, Yulia Polshina, Nadezhda Shablikova.

Similar presentations

Presentation on theme: "RELIABILITY BY DESIGN Prepared by Marina Gvozdeva, Elena Onoprienko, Yulia Polshina, Nadezhda Shablikova."— Presentation transcript:

Similar presentations

About project

Feedback