STANAG 6001- OPI Testing Julie J. Dubeau Bucharest BILC 2008.

Slides:

Advertisements

Similar presentations

Presented by Eroika Jeniffer.  We want to set tasks that form a representative of the population of oral tasks that we expect candidates to be able to.

Advertisements

How IEP Teams Make Assessment Accommodation Decisions: Rhode Island’s Research Findings Paul V. Sherlock Center on Disabilities at Rhode Island College.

Catia Cucchiarini Quantitative assessment of second language learners’ fluency in read and spontaneous speech Radboud University Nijmegen.

June 19, Proposal: An overall Plan Design to obtain answer to the research questions or problems Outline the various tasks you plan to undertake.

Literacy Assessment and Monitoring Programme (LAMP) UNESCO Institute for Statistics.

Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.

Shanshan Zhu. L2 Summary Writing “attempted paraphrase” “A writer selects a specific excerpt of a source text and makes at least one attempt to change.

NATO BILC 5-10 May 2013 LTCOL Fiona Curtis.

THE IMPACT OF APPLYING TWO OF THE CONCEPT MAPPING TECHNIQUES AS A FORM OF PRE-TASK ON EFL LEARNERS’ USE OF TENSES In Oral Accuracy BY: ZOHREH SAADATI.

How Does Ability to Speak English Affect Earnings?

California Profile of Adult Learning Adults with No High School Diploma (%) Age Age Speak English Poorly or Not at All – Age 18 to 64 (%) High.

Raili Hildén University of Helsinki Relating the Finnish School Scale to the CEFR.

Evaluating the Validity of NLSC Self-Assessment Scores Charles W. Stansfield Jing Gao Bill Rivers.

The Importance of Language Diversity in ESL Writing Workgroups By Aseel Kanakri The University of Akron.

Identifying the gaps in state assessment systems CCSSO Large-Scale Assessment Conference Nashville June 19, 2007 Sue Bechard Office of Inclusive Educational.

Background RateMyProfessors.com (RMP.com) is a public forum where students rate instructors on several characteristics: Clarity Helpfulness Overall Quality.

RESEARCH IN MATH EDUCATION-3

The Impact of Multimodal Texts on the Development of English Language Proficiency By Shashi Nallaya.

Compliment responses among native and non-native English speakers Evidence of Pragmatic transfer from Swedish into English Author: Thérèse Bergqvist.

1 Who, What, Where, WENS? The Native Speaker in the ILR ECOLT 2010 October 2010 ILR Testing Committee ECOLT 2010 October 2010 ILR Testing Committee.

Standardizing Testing in NATO Peggy Garza and the BAT WG Bureau for International Language Co-ordination.

The Effects of Authentic Audience on ESL Writers: A Task-Based, Computer-Mediated Approach By Julian Chen & Kimberly Brown.

Assessing Writing Writing skill at least at rudimentary levels, is a necessary condition for achieving employment in many walks of life and is simply taken.

Bureau for International Language Coordination Julie J. Dubeau BILC Secretary Istanbul, Turkey May 24, 2010.

Measuring Complex Achievement

 Background and Motivation of this Study  Statement of the Problem  Research Questions  Significance of the Study  Definition of Terms  Organization.

Prepared by Opinion Dynamics Corporation May 2006.

Task Force on Continuing Professional Education Update Report 2011 BOC Athletic Trainer Regulatory Conference.

BILC UPDATE Rome, Italy, June 8, 2009 BILC Secretary & D/Secretary Bureau de Coordination Linguistique Internationale Bureau for International Language.

HOW TO WRITE RESEARCH PROPOSAL BY DR. NIK MAHERAN NIK MUHAMMAD.

Prepared by Opinion Dynamics Corporation May 2004.

BILC CONFERENCE – MADRID, SPAIN 3-7 May 2015 “NATO REQUIREMENTS VERSUS NATIONAL POLICIES: BRIDGING THE DIVIDE AT THE LANGUAGE SCHOOL” BILC´s Mission: “To.

Alternate Assessments: A Case Study of Students and Systems: Gerald Tindal UO.

ראמ " ה The National Authority for Measurement and Evaluation in Education Correlation between Pre-primary Education and Achievements in PISA 2009 Joel.

By Cao Hao Thi - Fredric W. Swierczek

The Effects of Authentic Audience on ESL Writers: A Task-Based, Computer-Mediated Approach By Julian Chen & Kimberly Brown.

NATO BAT Testing: The First 200 BILC Professional Seminar 6 October, 2009 Copenhagen, Denmark Dr. Elvira Swender, ACTFL.

English stress teaching and learning in Taiwan 林郁瑩 MA0C0104.

Missouri Profile of Adult Learning Adults with No High School Diploma (%) Age Age Speak English Poorly or Not at All – Age 18 to 64 (%) High.

Inter-rater reliability in the KPG exams The Writing Production and Mediation Module.

The California ELD Standards Part 2 Rowland Unified School District Presented by Annabel Dannemann and Danielle Caro.

NTUST IM AHP Case Study 2 Identifying key factors affecting consumers' choice of wealth management services: An AHP approach.

Language proficiency evaluation: Raters Henry Emery PRICESG Linguistic Sub-Group.

Welcome! - Current BILC activities. - Comments regarding the theme of this seminar. Dr. Ray T. Clifford BILC Seminar, Vienna 8 October 2007.

The World of SIFE : Potential and Possibilities Drs. Elaine Klein and Gita Martohardjono RISLUS/CUNY Graduate Center The SIFE Forum, Nov 12, 2010.

Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog June 2008.

Exploration of the Academic Experience of International Students Studying Project Management *Dr Reda M Lebcir, Hany Wells and Angela Bond The Business.

Monterey, USA October, 2011 “ Furthering our Training Goals Through Research” BILC UPDATES Julie J. Dubeau & Jana Vasilj-Begovic BILC Secretaries.

Definition Title: Motivation and Attitude toward Integrated Instruction through Technology in College-level EFL Reading and Writing in Taiwan Integrated.

VALIDITY, RELIABILITY & PRACTICALITY Prof. Rosynella Cardozo Prof. Jonathan Magdalena.

Training The Spanish Language and Culture Instructors at FSI

HEADQUARTERS ”A Language Needs Assessment (LNA) Study at SHAPE/NATO HQ: What Lies Behind the Standardised Language Profiles (SLPs) in the Job Descriptions?"

Washington Profile of Adult Learning Adults with No High School Diploma (%) Age Age Speak English Poorly or Not at All – Age 18 to 64 (%) High.

Assistant Instructor Nian K. Ghafoor Feb Definition of Proposal Proposal is a plan for master’s thesis or doctoral dissertation which provides the.

Oklahoma Profile of Adult Learning Adults with No High School Diploma (%) Age Age Speak English Poorly or Not at All – Age 18 to 64 (%) High.

1 Innovative Teaching and Learning (ITL) Research Corinne Singleton SRI International.

GGGB6022: ACADEMIC WRITING 2 PRESENTATION: 'ATTITUDES & MOTIVATION TOWARDS THE LEARNING OF L2' AISHAH BINTI ADNAN (P79048)

Vermont Profile of Adult Learning Adults with No High School Diploma (%) Age Age Speak English Poorly or Not at All – Age 18 to 64 (%) High.

You have a computer at home and you are using the internet, facebook, viber. You want to make e pals from other countries. You find an in your e-box.

Author: Zhenhui Rao Student: 范明麗 Olivia I D:

Case Study of the TOEFL iBT Preparation Course: Teacher’s perspective Jie Chen UWO.

50 Years of BILC: The Evolution of STANAG – 2016 and the first Benchmark Advisory Test Ray Clifford 24 May 2016.

Key findings on comparability of language testing in Europe ECML Colloquium 7th December 2016 Dr Nick Saville.

STUDY GROUPS Lifelong Language Learning: Enhancing Educational Effectiveness Julie J. Dubeau May 14, 2012 It is.

Canadian Defence Academy

Roadmap Towards a Validity Argument

FUTURE BILC THEMES AND TOPICS

Training Teachers to Assess the Productive Skills

BiH Test Piloting Mary Jo DI BIASE.

BILC ANNUAL CONFERENCE 2019 Tartu, Estonia

Presentation transcript:

STANAG OPI Testing Julie J. Dubeau Bucharest BILC 2008

Julie J. Dubeau Bill Who???

Julie J. Dubeau

Are We All On the Same Page? An Exploratory Study of OPI Ratings Across NATO Countries Using the NATO STANAG 6001 Scale* *This research was completed in 2006 as part of a M.A. Thesis in Applied Linguistics

Julie J. Dubeau Presentation Outline Context Research Questions Literature Review Methodology Results –Ratings –Raters –Scale Conclusion

Julie J. Dubeau NATO Language Testing Context Standardized Language Profile (SLP) based on the NATO STANDARDIZATION AGREEMENT (NATO STANAG) 6001 Language Proficiency Levels (Ed 1? Ed 2?) –26 NATO countries, 20 Partnership for Peace (PfP) countries & others …

Julie J. Dubeau Interoperability Problem? Language training is central within armed forces due to the increasing number of peace-support operations, and is considered as having an important role in achieving interoperability among the various players. “ The single most important problem identified by almost all partners as an impediment to developing interoperability with the Alliance has been shortcomings in communications ” (EAPC (PARP) D, 1997, 1, p.10).

Julie J. Dubeau Overarching Research Question Since no known study had investigated inter-rater reliability in this context, the main research question was: How comparable or consistent are ratings across NATO raters and countries?

Julie J. Dubeau Research Questions Research questions pertaining to the ratings RQ1 Research questions pertaining raters’ training and background RQ2 Research questions pertaining to the rating process and to the scale RQ3

Julie J. Dubeau Research Questions RQ1-Ratings: How do ratings of the same oral proficiency interviews (OPIs) compare from rater to rater? Would the use of plus levels increase rater agreement? How do the ratings of the OPIs compare from country to country? Are there differences in scores within the same country?

Julie J. Dubeau Research Questions RQ2-Raters’ training and background: Are there differences in ratings between raters who have received varying degrees of tester/rater training and STANAG training? Did very experienced raters score more reliably than lesser experienced ones? Are experienced raters scoring as reliably as trained raters? Are there differences in ratings between participants who test part-time versus full-time, are native or non-native speakers of English, and are from ‘Older’ and ‘Newer’ NATO countries?

Julie J. Dubeau Research Questions RQ3-Rating process and scale use: Do differing rating practices affect ratings? Do raters appear to use the scale in similar ways? What are the raters’ comments regarding the use and application of the scale?

Julie J. Dubeau Literature Review Testing Constructs –What are we testing? General proficiency & Why Rating scales Rater Variance – How do raters vary? Rater/scale interaction Rater training & background

Julie J. Dubeau Methodology Design of study: Exploratory survey –2 Oral Proficiency Interviews (OPIs A & B) –Rater data questionnaire –Questionnaire accompanying each sample OPI Participants : Countries recruited at BILC Seminar in Sofia raters from 18 countries and 2 NATO units

Julie J. Dubeau Analysis: –Rating comparisons –Original ratings –‘Plus’ ratings –Rater comparisons –Training –Background –Country to country comparisons Within country dispersion –Rating process Rating factors –Rater/scale interaction Scale user-friendliness

Julie J. Dubeau Results RQ1- Summary Ratings : To compare OPI ratings and to explore the efficacy of ‘ plus ratings ’. –Some rater-to-rater differences –‘Plus’ levels brought ratings closer to the mean –Some country-to-country differences –Greater ‘within-country’ dispersion in some countries

Julie J. Dubeau View of OPI ratings sample A

Julie J. Dubeau Results Sample A (L1) All Ratings (with +) LevelsNumbers% Within Level 1 range Within Level 2 range Within Level 3 range 11.0 Total

Julie J. Dubeau All Countries’ Means for Sample A Overall Country Mean Country numbers

Julie J. Dubeau All Ratings for Sample B (level 2) LevelsNumbers% Total9693.2

Julie J. Dubeau View of OPI ratings sample B

Julie J. Dubeau All Countries’ Means for Sample B

Julie J. Dubeau Results RQ2- Summary Raters: To investigate rater training and scale training and see how (or if) they impacted the ratings, and to explore how various background characteristics impacted the ratings –Trained raters scored within the mean, especially for sample B –Experienced raters did not do as well as scale- trained raters –Full-time raters scored closer to mean –‘New’ NATO raters scored slightly closer to mean –NNS raters scored slightly closer to mean

Julie J. Dubeau substantial to lotsnone to little Frequency 63.27% 36.73% Tester (Rater) Training

Julie J. Dubeau Years of Experience 5 years +4 to 5 years2 to 3 years0 to 1 year Frequency 49.5% 15.84% 19.8% 14.85%

Julie J. Dubeau substantial to lotsnone to little Percent 40.0% 60.0% STANAG Scale Training

Julie J. Dubeau ‘Old’ vs. ‘New’ NATO Countries Summary of Tester Trg Little Lots Total Newer NATO member? Yes No Total

Julie J. Dubeau ‘Old’ vs. ‘New’ NATO Countries Rating OPI B Correct? Yes No Total Newer NATO member? Yes No Total Other/Missing

Julie J. Dubeau Results Raters’ Background Conducts Testing Full-time? Yes 34 (33.0 %) No 67 (65.0 %) Full-time testers more reliable (accurate) –NNS (60%) raters better trained? –‘New’ raters better trained?

Julie J. Dubeau Results RQ3- Summary Scale: To explore the ways in which raters used the various STANAG statements and rating factors to arrive at their ratings. –Rating process did not affect ratings significantly –3 main ‘types’ of raters emerged: Evidence-based Intuitive Extra-contextual

Julie J. Dubeau Results An ‘ evidenced-based ’ rating for Sample B (level 2): I compared the candidate ’ s performance with the STANAG criteria (levels 2 and 3) and decided that he did not meet the requirements for level 3 with regard to flexibility and the use of structural devices. Errors were frequent not only in low frequency structures, but in some high frequency areas as well. (Rater 90 – rated 2)

Julie J. Dubeau Results An ‘ intuitive ’ rating for Sample A (level 1): I would say that just about every single sentence in the interpretation of the level 2 speaking could be applied to this man. And because of that I would say that he is literally at the top of level 2. He is on the verge of level 3 literally. So I would automatically up him to a low 3. (Rater 1- rated 3)

Julie J. Dubeau Results An ‘extra-contextual’ rating for Sample A (level 1): Level 3 is the basic level needed for officers in (my country). I think the candidate could perform the tasks required of him. He could easily be bulldozed by native speakers in a meeting, but would hold his own with non-native speakers. He makes mistakes that very rarely distort meaning and are rarely disturbing. (Rater 95 – rated 2)

Julie J. Dubeau Implications Training not equal in all countries Scale interpretation Plus levels useful Different grids, speaking tests Institutional perspectives

Julie J. Dubeau Limitations & Future Research Participants may not have rated this way in their own countries OPIs new to some participants Future research could –Get participants to test –Investigate rating grids –Look at other skills

Julie J. Dubeau Conclusion of Research So, are we all on the same page? YES! BUT… Plus levels were instrumental in bridging gap Training was found to be key to reliability More in-country training should be the first step toward international benchmarking.

Thank You! Are We All On the Same Page? An Exploratory Study of OPI Ratings Across NATO Countries Using the NATO STANAG 6001 Scale The full thesis is available on the CDA website Or google Dubeau thesis