Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using statistics to evaluate your test Gerard Seinhorst

Similar presentations


Presentation on theme: "Using statistics to evaluate your test Gerard Seinhorst"— Presentation transcript:

1 Using statistics to evaluate your test Gerard Seinhorst
STANAG 6001 Testing Workshop 2018 Workshop C2 Kranjska Gora, Slovenia

2 WORKSHOP OBJECTIVES Understand how to describe and analyze test data, and draw conclusions from it Learn how to calculate and interpret the B-Index Learn how to create a test summary report from raw test data Understand how quantitative data analysis can support your claims about the test

3 B-INDEX The B-Index is an item statistic that indicate the degree to which the Masters (those who passed, e.g., a Level 3 test) outperformed the Non-Masters (test takers who failed the Level 3 test) on each item. Calculation of the B-Index: Determine what the cut score for passing the test is; e.g. 70% Split the scores in a group of Masters (at least 70% correct on the test) and Non-Masters. For each item, subtract the FV for the Non-Masters from the FV for the Masters. Interpretation of the B-Index is similar to that for the DI.

4 SMALL-GROUP WORK Work in small groups (2-4 persons)
Each group should have: a handout a flash drive with the data file a laptop with MS Excel (preferably in English!) at least one group member who is familiar with doing calculations in MS Excel The data file is named Workshop C2_Dataset Activity.xlsx and can be found on the flash drive The handout gives some guidance, but ask for help whenever needed Work on the activities until 11.45hrs At 11.45hrs discussion of findings in plenary

5 Remember… Numbers are like people:
torture them enough and they’ll tell you anything. ANONYMOUS

6

7 ANALYSIS OF QUANTITATIVE TEST DATA
TEST ANALYSIS - Describing/analyzing test results and test population Measures of Central Tendency Measures of Dispersion Reliability estimates ITEM ANALYSIS - Describing/analyzing individual item characteristics Item Difficulty Item Discrimination Distractor Efficiency Descriptive Statistics

8 TEST ANALYSIS: Describing/Analyzing test results
Measures of Central Tendency Gives us an indication of the typical score on a test Answers questions such as: In general, how did the test takers do on the test? Was the test easy or difficult for this group? How many test takers passed the test? Statistics: Mean (average score) Mode (most frequent score) Median (the middle point in a rank-ordered set of scores)

9 TEST ANALYSIS: Describing/Analyzing test results
Measures of Central Tendency When the mean, mode and median are all very similar, we have a “normal distribution” of scores (bell-shaped curve) When they are not similar, the results are ‘skewed’

10 MEAN MEDIAN MODE Which measure of central tendency should you use?
Depends on your data: If there are no extreme scores, use the MEAN {8, 9, 10, 10, 11, 11, 12, 13, 14} If there are extreme scores, use the MEDIAN {2, 9, 10, 10, 11, 12, 12, 12, 13} If your data cannot be rank-ordered (nominal variables, e.g., gender or occupation), or if one score occurs substantially more often than any other score, use the MODE {8, 10, 11, 12, 13, 13, 13, 13, 13} Use the measure that best indicates the ‘typical’ score in your data set

11 TEST ANALYSIS: Describing/Analyzing test results
Measures of Dispersion Gives us an indication of how similar or spread out the scores are Answers questions such as: How much difference is there between the highest and lowest score? How similar were the test takers’ results? Are there any extreme scores (‘outliers’)? Statistics: Range (difference between highest and lowest score) Standard Deviation (average distance of the scores from the mean)

12 TEST ANALYSIS: Describing/Analyzing test results
Standard Deviation (SD, s.d. or ) Small SD: scores are mostly close to the mean Large SD: scores are spread out Example scores of test 1: 48, 49, 50, 51, 52 scores of test 2: 10, 20, 40, 80, 100 MEAN of both tests (250:5) = 50 RANGE test 1 (52 minus 48) = 4 test 2 (100 minus 10) = 90 STANDARD DEVIATION test 1 = 1.58 test 2 = 38.73

13 VISUALIZING DATA – Bar Chart

14 VISUALIZING DATA – Box Plot
maximum score mean n = 32 25% of the scores 25% of the scores 25% of the scores 25% of the scores median minimum score outlier

15 ITEM DISCRIMINATION (DI)
The degree to which test takers with high overall test scores also got a particular item correct Indicates how well an item distinguishes between high achievers and low achievers Calculation: (FVupperFVlower) FV top group (1/3 of test takers with the highest test scores) minus FV bottom group (1/3 of test takers with the lowest test scores) Ranges from to +1.00 Optimal values: .40 and above  very good item  reasonably good item, possibly room for improvement  acceptable, but needing improvement <.20  poor item, to be rejected or revised

16 DISTRACTOR ANALYSIS Distractor Efficiency is the degree to which a distractor worked as intended, i.e., attracting the low achievers, but not the high achievers. The Distractor Efficiency is the number of test takers that selected that particular distractor, divided by the total number of test takers Example: A distractor that is chosen by less than 7% of the test takers (less than 0.07) is normally not functioning well and should be revised. However, bear in mind that the easier the item, the lower the distractor efficiency will be. Item # A * B C D Omitted 14 140 2 12 46 % selected 70% 1% 6% 23% 0%

17 OPTIMAL VALUES STATISTIC OPTIMAL VALUE Limitation TEST ANALYSIS
ITEM ANALYSIS Mean, mode, median N/A * is affected by test taker ability, should be interpreted in relation to max. possible score Range N/A * SD N/A * FV Depends on test population, test type/purpose DI > 0.40 Is affected by range of test takers’ ability Indicates only how often a distractor was chosen, not if it was chosen by a high achiever or low achiever Distractor Efficiency ≥ 0.07 * Note: Descriptive statistics do not have an optimal value – they merely describe and summarize test or population characteristics without one value a priori being ‘better’ than another

18 TEST RELIABILITY (Alpha)
Test score reliability is an estimate of the likelihood that scores would remain consistent over time if the same test was administered repeatedly to the same learners A reliability coefficient of .85 indicates that 85% of the variation in observed scores was due to variation in the “true” scores, and that 15% cannot be accounted for and is called ‘error’ (owing to chance) Reliability coefficients range from .00 to 1.00. Ideal score reliabilities are >.80. Higher reliabilities = less measurement error

19 STANDARD ERROR of MEASUREMENT (SEM)
An obtained test score is an estimate of a person’s “true” test score The “true” score is the score that a test taker would get if s/he took the test infinite times SEM indicates how accurate a test taker’s obtained score is. An obtained score is more accurate if it is closer to a test taker’s “true” score The smaller the SEM, the less error and the greater the precision of the test score As the reliability of a test increases, the SEM decreases A test with a reliability coefficient of 1.00 has a SEM of zero – there is no error

20 STANDARD ERROR of MEASUREMENT (SEM)
In a normal distribution it can be expected that there is a 68% chance that the true score is between 1 SEM below of above the obtained score there is a 95% chance that the true score is between 2 SEMs below or above the obtained score

21 STANDARD ERROR of MEASUREMENT (SEM)
Example obtained score = 70 SEM = 4 (SEMs are expressed in the same units as test scores) there is 68% chance that the test taker’s true score is between 66 and 74 points (70 minus or plus 4 [-/+ 1 SEM] we can be 95% certain that his true score is between 62 and 78 points (70 minus or plus 8 [-/+ 2 SEMs]) If SEM = 2 there is 68% chance that his true score is between 68 and 72 points (70 minus or plus 2 [-/+ 1 SEM]) 70 68 72 74 76 66 64 78 62 - 1 SEM + 1 SEM - 2 SEMs + 2 SEMs 68% 95%

22 STANDARD ERROR of MEASUREMENT (SEM)
The SEM not only indicates how accurate the test is, but can be used to adjust your cut score pass point based on that accuracy. Another example 100 item test (max. obtainable score: 100) Pass point: 70 (70%) Reliability (alpha): SEM: 3 Due to the comparatively low reliability, you can be less confident that the pass score truly represents the pass/fail point. There is fair chance that a test taker with an obtained score of 69 might have a “true” score of 70 or 71 Potentially this leads to a higher number of false negatives (Masters who fail) Dropping the pass point 1 SEM would change the passing score to 67 (67%). This will diminish the number of false negatives, but increases the number of false positives.


Download ppt "Using statistics to evaluate your test Gerard Seinhorst"

Similar presentations


Ads by Google