Testing the Test – Serbian STANAG 6001 English Language Test

Slides:



Advertisements
Similar presentations
Quality Control in Evaluation and Assessment
Advertisements

Standardized Scales.
A Tale of Two Tests STANAG and CEFR Comparing the Results of side-by-side testing of reading proficiency BILC Conference May 2010 Istanbul, Turkey Dr.
Issues of Technical Adequacy in Measuring Student Growth for Educator Effectiveness Stanley Rabinowitz, Ph.D. Director, Assessment & Standards Development.
Evaluating tests and examinations What questions to ask to make sure your assessment is the best that can be produced within your context. Dianne Wall.
Introduction to Educational Statistics
Listening Task Purpose of the test:
LG675 Session 5: Reliability II Sophia Skoufaki 15/2/2012.
11/08/ Individualisation-Standardisation 11/08/
How to use the scores achieved by university graduates under CEPAS as reference in hiring the right employees 1.What is CEPAS? 2.CEPAS Statistics.
Assessment Literacy for Language Teachers by Peggy Garza Partner Language Training Center Europe Associate BILC Secretary for Testing Programs.
The Data Analysis Plan. The Overall Data Analysis Plan Purpose: To tell a story. To construct a coherent narrative that explains findings, argues against.
Raili Hildén University of Helsinki Relating the Finnish School Scale to the CEFR.
Challenges in Developing and Delivering a Valid Test Michael King and Mabel Li NAFLE, July 2013.
Measurement in Exercise and Sport Psychology Research EPHE 348.
UKNARIC conference Understanding IELTS scores explanation and practical exercise.
Principles in language testing What is a good test?
Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.
THE RELATIONSHIP BETWEEN PRE-SERVICE TEACHERS’ PERCEPTIONS TOWARD ACTIVE LEARNING IN STATISTIC 2 COURSE AND THEIR ACADEMIC ACHIEVEMENT Vanny Septia Efendi.
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Military Language Testing at the National Defence University and the Common European Framework BILC CONFERENCE BUDAPEST.
MINISTRY OF DEFENCE REPUBLIC OF BULGARIA
Quantitative SOTL Research Methods Krista Trinder, College of Medicine Brad Wuetherick, GMCTE October 28, 2010.
Experimental Research Methods in Language Learning Chapter 9 Descriptive Statistics.
NATO BAT Testing: The First 200 BILC Professional Seminar 6 October, 2009 Copenhagen, Denmark Dr. Elvira Swender, ACTFL.
Using the IRT and Many-Facet Rasch Analysis for Test Improvement “ALIGNING TRAINING AND TESTING IN SUPPORT OF INTEROPERABILITY” Desislava Dimitrova, Dimitar.
Assessment. Workshop Outline Testing and assessment Why assess? Types of tests Types of assessment Some assessment task types Backwash Qualities of a.
UKNARIC conference Understanding IELTS scores
Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog June 2008.
Testing Team MoD, The Republic of Serbia Monterey, October
Stages of Test Development By Lily Novita
1Personnel Policy Directorate Ministry of Defense The Bulgarian Strategy for Developing English Language Training and Testing ( ) IN ACTION (Standardizing.
ESTABLISHING RELIABILITY AND VALIDITY OF RESEARCH TOOLS Prof. HCL Rawat Principal UCON,BFUHS Faridkot.
Relating Foreign Language Curricula to the CEFR in the Maltese context
Test Validation Topics in the BILC Testing Seminars
ECML Colloquium2016 The experience of the ECML RELANG team
Statistical tests for quantitative variables
Research Methodology Lecture No :25 (Hypothesis Testing – Difference in Groups)
Key findings on comparability of language testing in Europe ECML Colloquium 7th December 2016 Dr Nick Saville.
STANAG 6001 Testing Update and Introduction to the 2017 Workshop
Test Validity.
Introduction to the Validation Phase
Understanding Results
(Standardizing the Standards of Teaching and Testing in the Military)
پرسشنامه کارگاه.
RELATING NATIONAL EXTERNAL EXAMINATIONS IN SLOVENIA TO THE CEFR LEVELS
Learning About Language Assessment. Albany: Heinle & Heinle
SPEAKING ASSESSMENT Joko Nurkamto UNS Solo 11/8/2018.
EALTA MILSIG: Standardising the assessment of writing across nations
Chapter Eight: Quantitative Methods
Chief of English Testing, Language Programs
Roadmap Towards a Validity Argument
SPEAKING ASSESSMENT Joko Nurkamto UNS Solo 12/3/2018.
TOPIC 4 STAGES OF TEST CONSTRUCTION
Using statistics to evaluate your test Gerard Seinhorst
PELT Programme for English Language Training
Best Practices in STANAG 6001 Testing
Basic Statistics for Non-Mathematicians: What do statistics tell us
STANAG 6001 Testing Workshop
Defence Requirements Authority for Culture and Language (DRACL)
Challenges of Piloting Test Items
THE RELATIONSHIP BETWEEN PRE-SERVICE TEACHERS’ PERCEPTIONS TOWARD ACTIVE LEARNING IN STATISTIC 2 COURSE AND THEIR ACADEMIC ACHIEVEMENT Vanny Septia Efendi.
Analyzing test data using Excel Gerard Seinhorst
Test format Total test time:
Inferential Statistics
Psych 231: Research Methods in Psychology
Assessment in Language Learning
BiH Test Piloting Mary Jo DI BIASE.
Descriptive Statistics
BILC ANNUAL CONFERENCE 2019 Tartu, Estonia
Presentation transcript:

Testing the Test – Serbian STANAG 6001 English Language Test STANAG 6001 Testing Team PELT Directorate, Serbian MOD STANAG 6001 Testing Workshop Brno, Czech Republic, 6 – 8 September 2016

General and Specific Concerns Any kind of testing/examination has some general and some specific points of concern. In general points, relevant to any kind of language examination, we are governed by the set of principles as presented in the Principles of Good Practice for ALTE Examinations (Association of Language Testers in Europe)  Specific points of concern arise from the following:  STANAG 6001 is a high-stake examination; It is a language proficiency test testing general English in military setting; It is a criterion –referenced test, based on STANAG 6001 table of level descriptors, incommensurate with other criterion-referenced tests (e.g. Cambridge ESOL exams, IELTS, etc.) and language proficiency scales (CEFR, ALTE levels, etc.)

Limiting Factors Bearing this in mind, there are many serious constraints when designing the test (including the things beyond your control): What are the actual needs of the particular nation? (NATO member? PfP member? MD member? Test all levels? Test L4?) What kind of test? (Multi-level1-2-3? Bi-level L1/2, L2/3? Single level?) STANAG 6001 language descriptors are uniform, not open to individual/national interpretation Number of test takers per cycle Number of testing cycles per year Testing facilities at your disposal: premises (small/large testing rooms?), amenities (multimedia equipment? PCs/laptops? Headphones/loudspeakers?), staff (Number of invigilators? Trained OPI-ers?), etc.

Your Responsibilities Things you are in control of and can make individual decisions on are the following: Test format (based on the test specifications you designed) Number of questions, type of questions, elicitation techniques, etc. Rating criteria (analytic/holistic? Mixed?), cut-off scores, etc.   But, even these decisions are heavily influenced by aforesaid constraints.  Whatever your test eventually come to be, it has to meet the following examination qualities: Validity Reliability Impact Practicality

Quick Overview of the Serbian STANAG 6001 Test Particulars: Levels Multilevel (1-2-3) Receptive skills 40-question pen and paper test Type of questions MCQ, T/F, CR, matching Scoring Objective Method Modified REDS, establishing cut-off scores for each level Productive skills Adaptable test with multilevel prompts Ranging from simple questions/tasks to descriptive preludes Subjective/Rater’s judgment (based on analytic scale) Mixed (Analytic-holistic), establishing MAC for each level No. of candidates per testing cycle 80- 140 No. of testing cycles per year 3 - 4 Test results validity 3 years Partial testing /Retesting individual skills not possible

Testing the Test Test analyses are done in different modes and at different stages of test development and test administration. 1. Qualitative analysis: questionnaires, feedback forms, comments from both test takers and invigilators/interlocutors, after each pre-testing and official test administration Quantitative analysis: different statistical operations (MS Excel, SPSS) after each pre-testing and official test administration 2. Analysis of individual items: FV, DI, calibration against anchor items for each level, variance, distractor efficiency analysis Analysis of the entire reading /listening test: total score analysis and discrete levels analysis; central tendency mean, median, mode); dispersion (standard deviation, range, variance); distribution: normal/skewed (skewness, kurtosis;) histograms; reliability coefficients (Cronbach’s alpha)

Testing the Test 3. qualitative quantitative Analysis of receptive skills: qualitative usual + verbal protocol quantitative statistical analysis Analysis of productive skills: Feedback from interlocutors/candidates, comments both on and off the record correlations, inter/intra rater reliability 4. Analysis of the test: qualitative after test administration in the form of report quantitative Analysis of the achieved SLPs: after test administration,

Testing the Test Relating final test results to: 5. *when and if possible *ECL/ALCPT scores (reading, listening) Previously achieved SLPs STANAG SLPs acquired abroad (Hungary, Germany...) Pro-achievement test results from MA, intensive courses and similar tests  CEFR and other certificates acquired in civilian sector (foreign language schools, the British Council Cambridge ESOL and IELTS certificates, etc.) BAT (at some point hopefully) for external benchmarking purposes and criterion-related validity

Scoring Criteria for STANAG 6001 Speaking & Writing Tests Interlocutor frame (scripted interview) in speaking test enhances standardization of the speaking test and reduces variability amongst different raters. Analytic rating scales enhance reliability in speaking and writing tests due to more consistency in scores and also reduce “rater-candidate interaction“ and bias. Recorded speaking responses and writing responses are cross-rated for higher degree of consistency /reliability.

Rating Scale for STANAG 6001 Speaking Test Candidate no. Speaking task no. Discourse adequacy, coherence and length Fluency, pronunciation and general intonation Lexical competence and accuracy Grammatical competence and accuracy Awarded level

Inter-Rater Reliability Inter-Rater Reliability *calculated on 12 randomly selected independently rated speaking samples Candidate Rater B Rater S Rater N Cand1 2+ 3 2 Cand2 1+ Cand3 Cand4 Cand5 1 0+ Cand6 Cand7 Cand8 Cand9 Cand10 Cand11 Cand12 Correlations Rater B Rater S Rater N Pearson Correlation 1 ,872** ,502 Sig. (2-tailed) ,000 ,096 N 12 ,538 ,071 **. Correlation is significant at the 0.01 level (2-tailed).

Scoring Criteria for STANAG 6001 Reading & Listening Tests Scoring criteria for STANAG 6001 Reading & Listening Comprehension Tests *Adapted REDS method (originally: Sustained = 70-100%, Developing = 55-65 %, Emerging = 40-50%, Random = 0-35%) Total: 40 questions. Maximum: 40 points / 100%. Sustained: Level 1: 8 points out of 10 / 80% Level 2: 11 points out of 15 / 73.3% Level 3: 11 points out of 15 / 73.3%

Scoring Criteria for STANAG 6001 Reading & Listening Tests Level 1 No. of points S 8 - 10 D 6 - 7 E 4 - 5 R 0 - 3 Level 2 11 - 15 0 - 5 Level 3

Scoring Criteria for STANAG 6001 Reading & Listening Tests SIMPLIFIED TABLE FOR AWARDING LEVELS: LEVEL 1 LEVEL 2 LEVEL 3 AWARDED LEVEL SUSTAINED 3 DEVELOPING 2+ EMERGING/RANDOM 2 RANDOM 1+ - 1 0+

Statistical Operations in Reading Test Analysis June 2015 No. of candidates Listening Speaking Reading Writing Average rating* (base levels) 118 2,06 1,59 2,01 1,57 Mode rating 2 Reading Test June 2016 Statistics Cronbach's Alpha Cronbach's Alpha Based on Standardized Items N of Items   ,769 ,759 40 *Level 1 (10 items) Level 2 (15 items) Level 3 N Valid 118 Missing Mean 9,88 11,62 6,42 Median 10,00 12,00 6,00 Mode 10 12 4 Std. Deviation ,417 1,912 3,393 Variance ,174 3,657 11,511 Skewness -4,386 -,678 ,226 Std. Error of Skewness ,223 Kurtosis 22,795 ,155 -,966 Std. Error of Kurtosis ,442 Range 3 8 15

Statistical Operations in Reading Test Analysis

Statistical Operations in Reading Test Analysis Distribution of candidates’ scores (0 - 15) per level shows some overlapping of outliers, but the majority of scores don’t overlap. (*L1 is excluded due to smaller no. of items (10))

Statistical Operations in Reading Test Analysis Distribution of items’ facility indexes also shows some overlapping.

Statistical Operations in Reading Test Analysis Average facility value per level *(L1 = 98.8 % L2 = 77.5 % L3 = 42.8 %)

Comparison Chart *Approximations for comparison purposes, not equations STANAG 6001 Levels ECL/ALCPT Score CEFR Scale ALTE Levels Cambridge ESOL Certificates IELTS Band score 1 50 – 65 A1 Basic user KET 3.0 – 4.0 A2 2 66 – 85 B1 Independent user PET 4.5 – 5.0 (Threshold) B2 3 FCE 5.5 – 6.5 (Plus) 86 – 100 C1 Competent user 4 CAE 7.0 – 9.0 C2 5 CPE

Test Results Correlation (*different test construct) Candidate No. Rank Last name First name Alcpt score Alcpt value Stanag L, R SLP 1 xxx 85 2 2232 58 1221 3 61 2121 4 73 2222 5 78 6 77 3222 7 79 8 82 2221 9 65 10 11 80 12 86 13 72 14 68 15 62 16 32 1121 17 75 18 19 76 20 39 21 56 22 69 23 70 24 21+21+ 25 55 26 27 63 11+11+ 28 1+1+21+ 29 30 81 31 59 2122 33 2111 34 48 1111 35 2121+ 36 21+21 37 38 40 1010 Correlation Alcpt value Stanag L, R Pearson Correlation 1 ,675** Sig. (2-tailed) ,000 N 38 **. Correlation is significant at the 0.01 level (2-tailed).

SLPs Re-Testing Results Testing cycle: October 2015 February 2016 July 2016 No. of retested candidates 70 76 56 Confirmed SLP 16 (22.9%) 10 (13.2%) 24 (42.9%) Slightly weaker SLP, at least by a (+) in one of four skills 19 (27.1%) 15 (19.7%) 7 (12.5%) Slightly improved SLP, at least by a (+) in one of four skills 35 (50%) 51 (67.1%) 25 (44.6%) *The results of retested candidates are as expected. There is typically a 3-5 year gap between testing and retesting, during which the majority of candidates have had some language training improving their skills. However, these shifts are not too dramatic: 2222 ↔ ↔222+2 ↔ ↔2+222 ↔ ↔2+22+2 21+21+↔ ↔2221+ ↔ ↔1+1+21+ ↔ ↔21+21 3232 ↔ ↔32+32 ↔ ↔2+2+32 ↔ ↔32+32+

Pretesting Locally at the Military Academy Selected 50-80 senior year Military Academy cadets with 4 years of continual English language training The upside: cheap, easy to organize, good sample, testing demographics similar enough The downside: certain limitations due to cadets’ lack of real life and job experience Pretesting abroad currently unavailable due to budget cuts and organizational complexity Pretesting materials are secure because cadets are normally tested in a separate testing session and not eligible for retesting for another 3 years

Cooperation with Other Language Professionals Cooperation with English language professionals, experts and teachers within the system of defence exists on all levels and in all forms (Military Academy Department of Foreign Languages, GS J-7 Training and Doctrine Department – Group for English language training, PELT part-time English language experts and lecturers, etc.) English teachers act as invigilators, interlocutors and expert judges when determining content and face validity, cut-off scores, feedback, etc.

Cooperation with Other Functional Units in HR Sector Reporting the test results Interpreting STANAG 6001 language proficiency levels to language non-professionals Consulting with Personnel departments in MoD and GS, the Centre for Peacekeeping Operations, the Military Academy, The National Defence School, etc. about language-related career matters, language requirements for appointments, attending courses abroad, participation in PK missions, etc.

Thank you for your attention Questions?