The BILC BAT: A Research and Development Success Story Ray T. Clifford BILC Professional Seminar Vienna, Austria 11 October.

Slides:



Advertisements
Similar presentations
Assessing Student Performance
Advertisements

Assessment types and activities
Test Development.
A Tale of Two Tests STANAG and CEFR Comparing the Results of side-by-side testing of reading proficiency BILC Conference May 2010 Istanbul, Turkey Dr.
Psychometric Aspects of Linking Tests to the CEF Norman Verhelst National Institute for Educational Measurement (Cito) Arnhem – The Netherlands.
Assessment and Accountability at the State Level NAEP NRT (Iowa Tests) Core CRTs DWA UAA UBSCT NCLB U-PASS Alphabet Soup.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
What is a Good Test Validity: Does test measure what it is supposed to measure? Reliability: Are the results consistent? Objectivity: Can two or more.
Standardized Tests What They Measure How They Measure.
HONG KONG EXAMINATIONS AND ASSESSMENT AUTHORITY PROPOSED HKDSE ENGLISH LANGUAGE ASSESSMENT FRAMEWORK.
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
Chapter Fifteen Understanding and Using Standardized Tests.
1 New England Common Assessment Program (NECAP) Setting Performance Standards.
Chapter 4 Validity.
VALIDITY.
Seminar /workshop on cognitive attainment ppt Dr Charles C. Chan 28 Sept 2001 Dr Charles C. Chan 28 Sept 2001 Assessing APSS Students Learning.
Standard Setting Different names for the same thing Standard Passing Score Cut Score Cutoff Score Mastery Level Bench Mark.
Uses of Language Tests.
Presented by: Mohsen Saberi and Sadiq Omarmeli  Language testing has improved parallel to advances in technology.  Two basic questions in testing;
Stages of testing + Common test techniques
Formative and Summative Assessment
MEASUREMENT AND EVALUATION
1 Development of Valid and Reliable Case Studies for Teaching, Diagnostic Reasoning, and Other Purposes Margaret Lunney, RN, PhD Professor College of.
Technical Adequacy Session One Part Three.
Standardization and Test Development Nisrin Alqatarneh MSc. Occupational therapy.
01-1-S230-EP Unit S230-EP S230-EP Unit 1 Objectives Describe the values and principles of operational leadership. Identify the qualities.
Classroom Assessments Checklists, Rating Scales, and Rubrics
 Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing.
The Analysis of the quality of learning achievement of the students enrolled in Introduction to Programming with Visual Basic 2010 Present By Thitima Chuangchai.
EDU 385 Education Assessment in the Classroom
Teaching Today: An Introduction to Education 8th edition
Providing Effective Descriptive Feedback: Designing Rubrics, Part 2 --Are you assessing what you think you’re assessing? Princess Anne Middle School Instructional.
NRTs and CRTs Group members: Camila, Ariel, Annie, William.
RUBRICS AND CHECKLISTS KEITHA LUCAS HAMANN ASSESSMENT IN ARTS EDUCATION.
Performance-Based Assessment HPHE 3150 Dr. Ayers.
Standard Setting Results for the Oklahoma Alternate Assessment Program Dr. Michael Clark Research Scientist Psychometric & Research Services Pearson State.
NATO BAT Testing: The First 200 BILC Professional Seminar 6 October, 2009 Copenhagen, Denmark Dr. Elvira Swender, ACTFL.
Assessment and Testing
Work Sample Seminar1 Developing a Pretest & Posttest for the Literacy Work Sample Portland State University.
Welcome! - Current BILC activities. - Comments regarding the theme of this seminar. Dr. Ray T. Clifford BILC Seminar, Vienna 8 October 2007.
Assessment Information from multiple sources that describes a student’s level of achievement Used to make educational decisions about students Gives feedback.
Assessment. Workshop Outline Testing and assessment Why assess? Types of tests Types of assessment Some assessment task types Backwash Qualities of a.
Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog June 2008.
Chapter 6 - Standardized Measurement and Assessment
Standardized Testing EDUC 307. Standardized test a test in which all the questions, format, instructions, scoring, and reporting of scores are the same.
USING ILR/STANAG LEVEL DESCRIPTORS AND TEXT TYPOLOGY AND PASSAGE RATING IN CLASSROOM TEACHING TOWARDS PROFICIENCY James Dirgin Director Proficiency Standards.
Aligning Program Goals, Instructional Practices, and Outcomes Assessment Dr. Ray T. Clifford BILC Conference, Budapest 29 May 2006.
Educational Research Chapter 8. Tools of Research Scales and instruments – measure complex characteristics such as intelligence and achievement Scales.
Principles of Instructional Design Assessment Recap Checkpoint Questions Prerequisite Skill Analysis.
Criterion-Referenced Proficiency Testing BILC 26 May 2016 Ray Clifford.
RelEx Introduction to the Standardization Phase Relating language examinations to the Common European Framework of Reference for Languages Gilles Breton.
Copyright © Springer Publishing Company, LLC. All Rights Reserved. DEVELOPING AND USING TESTS – Chapter 11 –
EVALUATING EPP-CREATED ASSESSMENTS
Classroom Assessments Checklists, Rating Scales, and Rubrics
Test Validation Topics in the BILC Testing Seminars
50 Years of BILC: The Evolution of STANAG – 2016 and the first Benchmark Advisory Test Ray Clifford 24 May 2016.
Classroom Assessment A Practical Guide for Educators by Craig A
ARDHIAN SUSENO CHOIRUL RISA PRADANA P.
Test Based on Response There are two kinds of tests based on response. They are subjective test and objective test. 1. Subjective Test Subjective test.
Concept of Test Validity
Test Standardization: From Design to Concurrent Validation
ASSESSMENT OF STUDENT LEARNING
Classroom Assessments Checklists, Rating Scales, and Rubrics
Chief of English Testing, Language Programs
TOPIC 4 STAGES OF TEST CONSTRUCTION
Summative Assessment Grade 6 April 2018 Develop Revise Pilot Analyze
Basic Statistics for Non-Mathematicians: What do statistics tell us
Understanding and Using Standardized Tests
TESTING AND EVALUATION IN EDUCATION GA 3113 lecture 1
Relationship between Standardized and Classroom-based Assessment
Presentation transcript:

The BILC BAT: A Research and Development Success Story Ray T. Clifford BILC Professional Seminar Vienna, Austria 11 October

Language is the most complex of human behaviors. Language proficiency is clearly not a simple, one-dimensional trait. Therefore, language development can not be expected to be linear. However, language proficiency can be assessed against a hierarchy of identifiable common stages of language skill development.

Testing Language Proficiency in the Receptive Skills Norm-referenced statistical analyses are problematic when testing for proficiency. –Rasch one-factor IRT analysis assumes: A one-dimensional trait. Linear skill development. All test items discriminate equally well. –Norm-referenced statistics are meant to distinguish all students from each another, not separating passing students from failing students.

Testing Language Proficiency in the Receptive Skills Norm-referenced statistical analyses are problematic when testing for proficiency. –They require too many subjects for use in LCTLs. About 100 to 300 test subjects of varying abilities must answer each item. There may not be that number of people to be tested. –The results do not have a direct relationship to proficiency levels or other external criteria.

Testing Language Proficiency in the Receptive Skills Norm-referenced statistical analyses are problematic when testing for proficiency. –There has not been an adequate way of insuring that the range of skills tested and the difficulty of any given test match the targeted range of the language proficiency scale. –Setting passing scores using norm-referenced statistics is an imprecise process. –Setting multiple cut-scores from a total test score violates the criterion-referenced principle of non-compensatory scoring.

Test Development Procedures: Norm-Referenced Tests Create a table of test specifications. Train item writers in item-writing techniques. Develop items. Test the items for difficulty and reliability by administering them to several hundred learners. Use statistics to eliminate “bad” items. Administer the resulting test. Report results compared to other students or attempt to relate these norm-referenced results to a polytomous set of criteria (such as the STANAG scale).

Traditional Method of Setting Cut Scores Level 3 Grou p Level 2 Group Level 1 Group Test to be calibrated Groups of ”known” ability

The Results You Hope For: Level 3 Grou p Level 2 Group Level 1 Group Test to be calibrated Groups of “known” ability

The Results You Always Get: 100 ??? 50 ??? 0 Level 3 Grou p Level 2 Group Level 1 Group Test Scores Received Groups of ”known” ability

Why is there always an overlap? Total scores are by definition “compensatory” scores. –Every answer guessed correctly adds to the individual’s score. –There is no way to check for ability at a given proficiency level. Students with different abilities may have attained the same scores, by –Answering only the Level 1 questions right. –Answering 25% of all the questions right.

No matter where the cut scores are set, they are wrong for someone. 100 ??? 50 ??? 0 Level 3 Grou p Level 2 Group Level 1 Group Test Scores Received Groups of ”known” ability

A Better Way We can test language proficiency using criterion-referenced instead of norm- referenced testing procedures.

Criterion-Referenced Proficiency Testing in the Receptive Skills Items must strictly adhere to the proficiency “Table of Specifications”. Every component of the test item must be aligned with and match the specifications of a single level of the proficiency scale. –The text difficulty –The author purpose –The task asked of the reader/listener

Criterion-Referenced Proficiency Testing in the Receptive Skills Testing reading and listening proficiency requires “Independent, non-compensatory scoring” for each proficiency level, not calculating a single score for the entire test. This makes the test development process more complex. –Requires trained item writers and reviewers. –Begins with “modified Angoff” ratings instead of IRT procedures to validate items.

The BILC Benchmark Advisory Test (Reading) Is a Criterion-Referenced Proficiency Test.

Steps in the Process 1.We updated the STANAG 6001 Proficiency Scale. a.Each level describes a measurable point on the scale. b.These assessment points are not arbitrary, but represent useful levels of ability, e.g. Survival, Functional, Professional, etc. c.Thus, each level represents a defined “construct” of language ability.

Steps in the Process 2.We validated the scale. a.The hierarchical nature of these constructs had been operationally – but not statistically – validated. b.A statistical validation process was run in Sofia, Bulgaria. c.The results substantiated the validity of the scale’s operational use.

STANAG 6001 Scale Validation Exercise Conducted at Sofia, Bulgaria 13 October 2005

Instructions On the top of a blank piece of paper, write the following information: 1.Your current work assignment: Teacher, Tester, Administrator, Other______ 2.Your first (or dominate) language: _________ 3.You do not need to write your name!

Instructions Next, write the numbers: down the left side of the paper.

Instructions You will now be shown 6 descriptions of language speaking proficiency. Each description will be labeled with a color.

Instructions Rank the descriptions according to their level of difficulty by writing their color designation next the appropriate number: 0 (easiest) = Color ? 1 (next easiest) = Color ? 2 (next easiest) = Color ? 3 (next easiest) = Color ? 4 (next easiest) = Color ? 5 (most difficult) = Color ?

Ready? The descriptions will now be presented… –One at a time, –In a random sequence, –For 15 seconds each. You will see each of the descriptors 4 times. Thank you for participating in this experiment.

STANAG 6001 Scale Validation: A Timed Exercise Without Training 74 people turned in their rankings. They marked their current work assignments as: –Administrator 49 –Teacher26 –Tester19 –Other 1

Results of the STANAG Scale Validation ( n = 74 )

Steps in the Process 3.We used the STANAG 6001 base proficiency levels as the definitive specifications for item development. a.Author task and purpose in producing the text have to be aligned with the question or task asked of the reader. b.The written (or audio) text type and linguistic characteristics of each item must also be characteristic of the proficiency level targeted by the item.

Steps in the Process 4.The items developed had to then pass a strict review of whether each item matched the design specifications. a.Multiple expert judges made independent judgments of whether each item matched the targeted level. b.Only the items which passed this review with the unanimous consensus of trained judges, were taken to the next step.

Steps in the Process 5.The next step was a “bracketing” process to check the adequacy of the question’s multiple choice options. a.Experts were asked to make independent judgments about how likely a learner at the next lower level would be to answer the question correctly. Responses significantly above chance (or 25%) made the item unacceptable. In such cases the item, item question, or item choices had to be discarded or revised.

Steps in the Process 5.(Cont.) b.Experts made independent judgments about how likely a learner at the next higher level would be to answer each question correctly. If the item would not be answered correctly by this more competent group, it was rejected. (Because of human limitations, inattention, fatigue, carelessness, etc, it was recognized that the correct response probability for this more competent group would be less than 100%.)

Steps in the Process 6.Items that passed the technical specifications review and the bracketing process, then underwent a “Modified Angoff” rating procedure. a.Expert judges rated the probability that each item would be correctly answered by a person who was fully competent at the targeted proficiency level. b.If the independent probability ratings produced an outlier rating or a standard deviation of more than 5 points, the item was rejected and/or revised.

Steps in the Process 7.Items found acceptable in the “Modified Angoff” rating procedure, where assembled into an online test. a.The test had three subtests of 20 items each. b.A separate subtest for each of the Reading proficiency Levels 1, 2, and 3. c.Each test was to be graded separately. d.“Sustained performance” (passing) on each subtest was defined as the mean Angoff rating minus one standard deviation or 70%.

More About Scoring Scoring had to follow Criterion- Referenced, non-compensatory Proficiency assessment procedures. –“Sustained” ability would be required to qualify as proficient at each level. –Summary ratings would consider both “Floor” and “Ceiling” abilities. –Each learner’s performance profile would determine “between-level” ratings (if any).

And the results… More pilot testing will be done, but here are the results of the first 36 pilot tests:

Congratulations! Working together, we have solved a major testing problem – a problem which has plagued language testers for decades. We have developed a criterion- referenced proficiency test of Reading which –Accurately assigns proficiency levels. – Has both face- and statistical validity.

Questions?

Some additional thoughts… The assessment points or levels in the STANAG scale may be thought of as “chords” – each of which describe a short segment along an extended multi-dimensional proficiency development scale. These “chords” represent cross-dimensional constellations of factors that represent different levels of language ability. Like the concept of “chords” in calculus, these defined progress levels allow us to accurately measure whether the particular set of factors described at each level has been mastered. Each proficiency level or factor constellation can also be seen as a separate construct, and these constructs can be shown to form an ascending array or hierarchy of increasing language proficiency which meets Guttman scaling criteria. Therefore, these “points” in the scale can also indicate overall proficiency development.