Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Analysis and Standard Setting

Similar presentations


Presentation on theme: "Data Analysis and Standard Setting"— Presentation transcript:

1 Data Analysis and Standard Setting
Stephen Hills Examinations Administrator Undergraduate Medical Office

2 Educational Theories Classical Test Theory Item Response Theory
True score Observed score Error of measurement Item Response Theory Latent trait (or ability) Higher, higher, higher CTT - This theory is based on the premise that a candidate has a true score which cannot be measured directly. An examination can be set to measure the score but what you get is an estimate of the true score, an observed score. The difference between the true score and the observed score is the error of measurement caused by either candidate issues (eg tiredness, hunger, attitude) or examination flaws (eg relevance, reliability, examiner bias). IRT - This theory is based on the premise that items can be located on a scale of ability, or latent trait. This allows individuals to be measured against that trait or ability as defined by the set of items used. In brief, items can be identified as measuring a trait at a certain point in the scale for that trait, ie a specific level of difficulty. Therefore if you have 10 items ranging from difficulty level 1 through to level 10, a candidates level of ability is defined as the point at which all items up to that point have been answered correctly and all questions beyond that point are answered incorrectly. The main practical difference between these two theories is that the first defines item characteristics in terms of the sample of candidates being tested, ie an item is easy or difficult for that sample but this may be different for other samples, whereas IRT identifies an item on a scale that does not change depending on who is answering the items.

3 Measures of Difficulty
With which organ is the word coronary associated? A Brain. B Cardiac. C Heart. D Kidney. E Liver. Is this a good question? You cannot answer without recourse to a set of criteria, it may be a good question for testing knowledge of the word coronary but bad for testing the function of the Brian. In the Medical school we know what we want the exams to achieve and that is to identify those candidates who have the knowledge and skills detailed in the syllabus and those that do not. Prior to an exam taking place it is usual (and is practice in this medical school) for a number of checks to be made and since this presentation is dealing with post examination analysis, we can assume those checks have been made. We can assume therefore that experts have viewed this question and were happy that is at the right level, covers the required testing point, is testing the right skill or knowledge etc.. So, now that we know these checks have been completed, I’ll ask again – is this a good question? Of course you still cannot answer, prior to the examination taking place experts can only make a best guess that the question will perform in the way that they desire, it is not until after the examination is sat that this can be confirmed.

4 Measures of Difficulty
Facility values for MCQs. With which organ is the word coronary associated? A Brain B Cardiac C Heart D Kidney E Liver Facility Values - For dichotomously scored questions such as MCQs, facility values for an item are a record of those that score positively. For such questions the only possible outcomes are a score of 1 if answered correctly, or a score of 0 if answered incorrectly. Facility values then are expressed as a proportion (or multiplied by 100 to express as a percentage) of the total number of candidates. In this example 60% of answered correctly. This is usually expressed as the item has a difficulty of 0.6. It is useful to also collect the facility values for each of the options in a question. Here facility value is used to mean the proportion of candidates who select that option as their response to the problem posed. In this example 4% of candidates chose Brain, 3% chose Kidney, 33% chose Cardiac and no-one responded Liver. This provides information on how well the wrong options (or distractors) are operating. It is useful to set a minimum and maximum expectation for an incorrect response as a means to indicate whether a distractor is working properly. Too low may indicate it is not plausible for the candidature and too high may mean it is partially true or indeed a second correct response. I would suggest that if less than 3% of the candidature chose a distractor, it is not likely to be plausible, and if more than 30% chose a distractor, that option should be scrutinised to see if it is also correct or if it is unfair in any way. If a distractor is attracting too few candidates, and therefore not working, it can affect the standard, eg in this example guessing has changed from 20% to 25% as one distractor is not working. If the question has two correct options (or one marked unfairly as incorrect) candidates may be wrongly penalised.

5 Facility Values EMQs A Brain. D Kidney. G Pancreas.
B Cardiac. E Liver. H Spleen. C Heart. F Lung. I Stomach. 1. To which organ is the word coronary associated? 2. Which organ is affected by Bright’s disease? 3. Through which organ does oxygen enter the bloodstream? 4. Which organ acts as a filter against foreign organisms that enter the bloodstream? 5. What is divided into the fundus and antrum? The EMQs are really multiple choice questions sharing the same options and displayed in a non traditional manner. Nevertheless if desired, each could be looked at separately and analysed as individual questions. It is possible to analyse the group of questions as a group if that would provide useful information. In this example it probably would not as the questions are quite diverse. If the questions were all concerned with the functionality of the kidney for example, there may be some benefit from considering the group as a whole instead of the individual components.

6 Facility Values MCQs Which of the listed symptoms are commonly presented by patients suffering from influenza? 1. Aching joints. T or F 2. Blindness. T or F 3. Fever . T or F 4. Lethargy. T or F 5. Skin rash. T or F It is usual to perform the facility analysis based on the scoring criteria. If one mark is scored per correct True or False response, then it is likely that the examination evaluator will require analysis at that level. It may be that in addition it is useful to have the analysis over the whole group for other purposes, such as teaching assurance. Where one mark is awarded for the whole group (ie one mark for indicating TFTTF), it is very likely that as well as the facility value being reported for the correct response pattern, evaluators will also want to see the facility value for each option to identify if there are any problems within the item. For example if the facility values of options 1 – 4 are all between 0.6 and 0.7 but the facility value for option 5 is 0.3, the evaluator may wish to investigate the use of the option and possibly decide to exclude option 5 from the marking process.

7 Measures of Difficulty
Mean scores for written papers Relative mean scores Interpretation Where a variety of scores can be achieved by candidates (for example a question attracting a maximum of 20 marks can have 21 different possible results across the candidates), mean scores form a measure of how difficult the candidature found a question. Where there is more than one question on a question paper, mean scores for each question can be compared. In this way the mean scores relative to each question on the paper can be analysed. Sometimes it is desirable to do this for question parts within a single question. Comparing mean scores for individual questions and against the mean score of the paper as a whole helps to identify anomalies that the examiners need to investigate. For example if one question has three parts worth 10 marks each, it may be that the mean scores are 80%, 80% and 20%. The mean score for the paper overall might be 70%. Examiners concerned with the quality of the question paper will want to investigate the question part with the low mean score to identify if a problem exists.

8 Candidate Separation Objective of an examination Measuring
Discrimination An examination is designed to measure whether or not candidates have adequately meet the requirements for the criteria being examined. In the case of this medical school, learning objectives are set and an examination is used as the tool to measure whether candidates have the required knowledge and skills to move onto the next part of the course, or to practice as Doctors.


Download ppt "Data Analysis and Standard Setting"

Similar presentations


Ads by Google