Testing the Test – Serbian STANAG 6001 English Language Test STANAG 6001 Testing Team PELT Directorate, Serbian MOD STANAG 6001 Testing Workshop Brno, Czech Republic, 6 – 8 September 2016
General and Specific Concerns Any kind of testing/examination has some general and some specific points of concern. In general points, relevant to any kind of language examination, we are governed by the set of principles as presented in the Principles of Good Practice for ALTE Examinations (Association of Language Testers in Europe) Specific points of concern arise from the following: STANAG 6001 is a high-stake examination; It is a language proficiency test testing general English in military setting; It is a criterion –referenced test, based on STANAG 6001 table of level descriptors, incommensurate with other criterion-referenced tests (e.g. Cambridge ESOL exams, IELTS, etc.) and language proficiency scales (CEFR, ALTE levels, etc.)
Limiting Factors Bearing this in mind, there are many serious constraints when designing the test (including the things beyond your control): What are the actual needs of the particular nation? (NATO member? PfP member? MD member? Test all levels? Test L4?) What kind of test? (Multi-level1-2-3? Bi-level L1/2, L2/3? Single level?) STANAG 6001 language descriptors are uniform, not open to individual/national interpretation Number of test takers per cycle Number of testing cycles per year Testing facilities at your disposal: premises (small/large testing rooms?), amenities (multimedia equipment? PCs/laptops? Headphones/loudspeakers?), staff (Number of invigilators? Trained OPI-ers?), etc.
Your Responsibilities Things you are in control of and can make individual decisions on are the following: Test format (based on the test specifications you designed) Number of questions, type of questions, elicitation techniques, etc. Rating criteria (analytic/holistic? Mixed?), cut-off scores, etc. But, even these decisions are heavily influenced by aforesaid constraints. Whatever your test eventually come to be, it has to meet the following examination qualities: Validity Reliability Impact Practicality
Quick Overview of the Serbian STANAG 6001 Test Particulars: Levels Multilevel (1-2-3) Receptive skills 40-question pen and paper test Type of questions MCQ, T/F, CR, matching Scoring Objective Method Modified REDS, establishing cut-off scores for each level Productive skills Adaptable test with multilevel prompts Ranging from simple questions/tasks to descriptive preludes Subjective/Rater’s judgment (based on analytic scale) Mixed (Analytic-holistic), establishing MAC for each level No. of candidates per testing cycle 80- 140 No. of testing cycles per year 3 - 4 Test results validity 3 years Partial testing /Retesting individual skills not possible
Testing the Test Test analyses are done in different modes and at different stages of test development and test administration. 1. Qualitative analysis: questionnaires, feedback forms, comments from both test takers and invigilators/interlocutors, after each pre-testing and official test administration Quantitative analysis: different statistical operations (MS Excel, SPSS) after each pre-testing and official test administration 2. Analysis of individual items: FV, DI, calibration against anchor items for each level, variance, distractor efficiency analysis Analysis of the entire reading /listening test: total score analysis and discrete levels analysis; central tendency mean, median, mode); dispersion (standard deviation, range, variance); distribution: normal/skewed (skewness, kurtosis;) histograms; reliability coefficients (Cronbach’s alpha)
Testing the Test 3. qualitative quantitative Analysis of receptive skills: qualitative usual + verbal protocol quantitative statistical analysis Analysis of productive skills: Feedback from interlocutors/candidates, comments both on and off the record correlations, inter/intra rater reliability 4. Analysis of the test: qualitative after test administration in the form of report quantitative Analysis of the achieved SLPs: after test administration,
Testing the Test Relating final test results to: 5. *when and if possible *ECL/ALCPT scores (reading, listening) Previously achieved SLPs STANAG SLPs acquired abroad (Hungary, Germany...) Pro-achievement test results from MA, intensive courses and similar tests CEFR and other certificates acquired in civilian sector (foreign language schools, the British Council Cambridge ESOL and IELTS certificates, etc.) BAT (at some point hopefully) for external benchmarking purposes and criterion-related validity
Scoring Criteria for STANAG 6001 Speaking & Writing Tests Interlocutor frame (scripted interview) in speaking test enhances standardization of the speaking test and reduces variability amongst different raters. Analytic rating scales enhance reliability in speaking and writing tests due to more consistency in scores and also reduce “rater-candidate interaction“ and bias. Recorded speaking responses and writing responses are cross-rated for higher degree of consistency /reliability.
Rating Scale for STANAG 6001 Speaking Test Candidate no. Speaking task no. Discourse adequacy, coherence and length Fluency, pronunciation and general intonation Lexical competence and accuracy Grammatical competence and accuracy Awarded level
Inter-Rater Reliability Inter-Rater Reliability *calculated on 12 randomly selected independently rated speaking samples Candidate Rater B Rater S Rater N Cand1 2+ 3 2 Cand2 1+ Cand3 Cand4 Cand5 1 0+ Cand6 Cand7 Cand8 Cand9 Cand10 Cand11 Cand12 Correlations Rater B Rater S Rater N Pearson Correlation 1 ,872** ,502 Sig. (2-tailed) ,000 ,096 N 12 ,538 ,071 **. Correlation is significant at the 0.01 level (2-tailed).
Scoring Criteria for STANAG 6001 Reading & Listening Tests Scoring criteria for STANAG 6001 Reading & Listening Comprehension Tests *Adapted REDS method (originally: Sustained = 70-100%, Developing = 55-65 %, Emerging = 40-50%, Random = 0-35%) Total: 40 questions. Maximum: 40 points / 100%. Sustained: Level 1: 8 points out of 10 / 80% Level 2: 11 points out of 15 / 73.3% Level 3: 11 points out of 15 / 73.3%
Scoring Criteria for STANAG 6001 Reading & Listening Tests Level 1 No. of points S 8 - 10 D 6 - 7 E 4 - 5 R 0 - 3 Level 2 11 - 15 0 - 5 Level 3
Scoring Criteria for STANAG 6001 Reading & Listening Tests SIMPLIFIED TABLE FOR AWARDING LEVELS: LEVEL 1 LEVEL 2 LEVEL 3 AWARDED LEVEL SUSTAINED 3 DEVELOPING 2+ EMERGING/RANDOM 2 RANDOM 1+ - 1 0+
Statistical Operations in Reading Test Analysis June 2015 No. of candidates Listening Speaking Reading Writing Average rating* (base levels) 118 2,06 1,59 2,01 1,57 Mode rating 2 Reading Test June 2016 Statistics Cronbach's Alpha Cronbach's Alpha Based on Standardized Items N of Items ,769 ,759 40 *Level 1 (10 items) Level 2 (15 items) Level 3 N Valid 118 Missing Mean 9,88 11,62 6,42 Median 10,00 12,00 6,00 Mode 10 12 4 Std. Deviation ,417 1,912 3,393 Variance ,174 3,657 11,511 Skewness -4,386 -,678 ,226 Std. Error of Skewness ,223 Kurtosis 22,795 ,155 -,966 Std. Error of Kurtosis ,442 Range 3 8 15
Statistical Operations in Reading Test Analysis
Statistical Operations in Reading Test Analysis Distribution of candidates’ scores (0 - 15) per level shows some overlapping of outliers, but the majority of scores don’t overlap. (*L1 is excluded due to smaller no. of items (10))
Statistical Operations in Reading Test Analysis Distribution of items’ facility indexes also shows some overlapping.
Statistical Operations in Reading Test Analysis Average facility value per level *(L1 = 98.8 % L2 = 77.5 % L3 = 42.8 %)
Comparison Chart *Approximations for comparison purposes, not equations STANAG 6001 Levels ECL/ALCPT Score CEFR Scale ALTE Levels Cambridge ESOL Certificates IELTS Band score 1 50 – 65 A1 Basic user KET 3.0 – 4.0 A2 2 66 – 85 B1 Independent user PET 4.5 – 5.0 (Threshold) B2 3 FCE 5.5 – 6.5 (Plus) 86 – 100 C1 Competent user 4 CAE 7.0 – 9.0 C2 5 CPE
Test Results Correlation (*different test construct) Candidate No. Rank Last name First name Alcpt score Alcpt value Stanag L, R SLP 1 xxx 85 2 2232 58 1221 3 61 2121 4 73 2222 5 78 6 77 3222 7 79 8 82 2221 9 65 10 11 80 12 86 13 72 14 68 15 62 16 32 1121 17 75 18 19 76 20 39 21 56 22 69 23 70 24 21+21+ 25 55 26 27 63 11+11+ 28 1+1+21+ 29 30 81 31 59 2122 33 2111 34 48 1111 35 2121+ 36 21+21 37 38 40 1010 Correlation Alcpt value Stanag L, R Pearson Correlation 1 ,675** Sig. (2-tailed) ,000 N 38 **. Correlation is significant at the 0.01 level (2-tailed).
SLPs Re-Testing Results Testing cycle: October 2015 February 2016 July 2016 No. of retested candidates 70 76 56 Confirmed SLP 16 (22.9%) 10 (13.2%) 24 (42.9%) Slightly weaker SLP, at least by a (+) in one of four skills 19 (27.1%) 15 (19.7%) 7 (12.5%) Slightly improved SLP, at least by a (+) in one of four skills 35 (50%) 51 (67.1%) 25 (44.6%) *The results of retested candidates are as expected. There is typically a 3-5 year gap between testing and retesting, during which the majority of candidates have had some language training improving their skills. However, these shifts are not too dramatic: 2222 ↔ ↔222+2 ↔ ↔2+222 ↔ ↔2+22+2 21+21+↔ ↔2221+ ↔ ↔1+1+21+ ↔ ↔21+21 3232 ↔ ↔32+32 ↔ ↔2+2+32 ↔ ↔32+32+
Pretesting Locally at the Military Academy Selected 50-80 senior year Military Academy cadets with 4 years of continual English language training The upside: cheap, easy to organize, good sample, testing demographics similar enough The downside: certain limitations due to cadets’ lack of real life and job experience Pretesting abroad currently unavailable due to budget cuts and organizational complexity Pretesting materials are secure because cadets are normally tested in a separate testing session and not eligible for retesting for another 3 years
Cooperation with Other Language Professionals Cooperation with English language professionals, experts and teachers within the system of defence exists on all levels and in all forms (Military Academy Department of Foreign Languages, GS J-7 Training and Doctrine Department – Group for English language training, PELT part-time English language experts and lecturers, etc.) English teachers act as invigilators, interlocutors and expert judges when determining content and face validity, cut-off scores, feedback, etc.
Cooperation with Other Functional Units in HR Sector Reporting the test results Interpreting STANAG 6001 language proficiency levels to language non-professionals Consulting with Personnel departments in MoD and GS, the Centre for Peacekeeping Operations, the Military Academy, The National Defence School, etc. about language-related career matters, language requirements for appointments, attending courses abroad, participation in PK missions, etc.
Thank you for your attention Questions?