Criterion-Referenced Proficiency Testing BILC 26 May 2016 Ray Clifford.

Criterion-Referenced Proficiency Testing BILC 26 May 2016 Ray Clifford

Norm-Referenced (NR) or Criterion-Referenced (NR) Does it make a difference? The classic distinction made between CR and NR has focused on why the test is given. –Norm-Referenced tests compare people with each other. –Criterion-Referenced tests compare each person against a criterion or standard. But there are other differences as well!

Test Characteristics of Typical NR and CR Tests ExampleTesting Purpose Domains Covered Type of Learning Tested Scoring Procedures NR Tests Chapter test in a textbook. Compare a student's ability with the ability scores of other students. Curriculum- based content. Direct- application of rote learning and rehearsed material. Generate a single (total or average) test score. CR Tests STANAG 6001 OPI. Determine whether the test taker has specified real- world abilities. Curriculum- independent, real-world content. Far-transfer learning that enables spontaneous and creative responses. Generate a “floor and ceiling” proficiency rating.

Are all language tests proficiency tests? No. –Achievement tests assess rote, direct application learning. –Performance tests assess rehearsed near transfer learning. –Proficiency tests assess far-transfer learning. And not all tests that are called “proficiency” tests are criterion-referenced tests.

Creating a C-R Test The Construct must have Task, Context, and Accuracy (TCA) expectations. The following elements must be aligned: –The construct to be tested. –The test design and development. –The scoring model. If all of these elements are aligned, the test is legally defensible. Shrock, S. A. & Coscarelli, W. C. (2007). Criterion-referenced test development: Technical and legal guidelines for corporate training and certification. (3rd.ed.). San Francisco, CA: John Wiley and Sons. Technical and legal guidelines for corporate training and certification

Creating a CR Language Test 1.Define incremental TCA stages of the trait (where each stage is more complex in dimensionality than the preceding stage). 2.Maintain strict alignment of the: –Theoretical construct model. –Test development model. –Psychometric scoring model. Luecht, R. (2003). Multistage complexity in language proficiency assessment: A framework for aligning theoretical perspectives, test development, and psychometrics. Foreign Language Annals, 36, 527-535.

Testing Experts Stress the Need for Alignment Construct Definition Test Design & Development Test Scoring

Further Research With the approval of ATC, “Permissive” BAT research continued using English language learners interested in applying for admittance to a U.S. university. A diversity of first languages was represented among the test takers. The number who have taken the BAT Reading test now exceeds 600. With 600+ test takers, we have done the IRT analyses needed for adaptive testing.

Validating the BAT 1.WinSteps IRT Analyses confirmed that –The BAT test items did “cluster” by difficulty level and the clusters didn’t overlap. –The average logit values for each level were separated by more than 1 logit. 2.Clustered items were then assembled into testlets, and the 5-item testlets were assigned to the appropriate stage. –For every level, the testlets were of comparable difficulty – within 0.02 logits.

The BAT Reading Test Is a Criterion-Referenced Proficiency Test. –Defines the construct with a hierarchy of level- by-level TCA requirements. –Follows those level-by-level criteria in the test design and development process. –Applies the same level-by-level criteria in its scoring process. Has been empirically validated. –Items clustered in difficulty at each level. –The clusters did not overlap.

The BAT: A CR Reading Test 1.WinSteps IRT Analyses confirmed that –The BAT test items did “cluster” by difficulty. –The clusters didn’t overlap. 2.When the clustered items were assembled into 5-item, level-specific testlets: –For every level, the testlets were of comparable difficulty – within 0.02 logits. –The standard error ranged from.04 to.06. –The average item logit values for each level were separated by more than 1 logit.

WinSteps Results (With 4 testlets of 5 items each / level) n = 680 STANAG 6001 Level Logit value of Testlet A Logit value of Testlet B Logit value of Testlet C Logit value of Testlet D Standard Error of the model in Logits 3+ 1.8.04 2+ 0.3 + 0.2.04 1- 1.5- 1.6 - 1.7.06

Testlet Difficulty by Level Testlet difficulty within +/-.02 logits at each level Standard Error 1 logit

And how did the national test results compare to the results of the validated BAT test?

BAT and National Listening Scores 59% exact agreement of BAT and National Scores. But there was also disagreement. –34% disagreed by 1 level. – 7% disagreed by 2 levels. –Of the disagreement s, the National score was HIGHER in 81% of the cases.

BAT and National Speaking Scores 50% exact agreement of BAT and National Scores. But there was also disagreement. –45% disagreement by 1 level. – 5% disagreement by 2 levels. –Of the disagreements, the National score was HIGHER in 97% of the cases.

BAT and National Reading Scores 65% exact agreement of BAT and National Scores. When there was also disagreement. –33% disagreement by 1 level. – 2% disagreement by 2 levels. –Of the disagreements, the National score was HIGHER in 81% of the cases..

BAT and National Writing Scores 47% exact agreement of BAT and National Scores. When there was also disagreement –43% disagreement by 1 level. –10% disagreement by 2 levels. (1/3 and 2/4) –Of the disagreements, the National score was HIGHER in 98% of the cases..

Overall Observations The national test results were within 1 level of the BAT results 90% to 98% of the time. But they matched the BAT results exactly only 47% to 65% of the time. The general trend was that the non- matching national ratings were higher. How can we improve these numbers?

Was there a slip at the construct stage? Construct Definition Test Design & Development Test Scoring

STANAG 6001 Construct Definition Every base STANAG 6001 level description has 3 components: –Tasks (Communication Functions) –Context (Content/Topics) –Accuracy (Precision expectations) At every base level, each of these components is different from the descriptions at the other levels. The levels are not a single “scale”, but a hierarchy of Criterion-Referenced abilities.

High-Level Language Learning Requires Far-Transfer A Proficiency Summary with General Text Characteristics and Learning Types (Green = Far Transfer, Blue = Near Transfer, Red = Direct Application) 5 LEVELFUNCTION/TASKSCONTEXT/TOPICSACCURACY 4 3 2 1 0 All expected of an educated NS [Books] All subjects Accepted as an educated NS Tailor language, counsel, motivate, persuade, negotiate [Chapters] Wide range of professional needs Extensive, precise, and appropriate Support opinions, hypothesize, explain, deal with unfamiliar topics [Multiple pages] Practical, abstract, special interests Narrate, describe, give directions [Multiple paragraphs] Concrete, real- world, factual Intelligible even if not used to dealing with non-NS Errors never interfere with communication & rarely disturb Q & A, create with the language [Multiple sentences] Everyday survival Intelligible with effort or practice Memorized [Words and Phrases]RandomUnintelligible

The Construct to be Tested Proficient reading: The active, automatic, far-transfer process of using one’s internalized language and culture expectancy system to efficiently comprehend an authentic text for the purpose for which it was written. Author purpose Reading purpose Orient – Get necessary information Inform – Learn Evaluate – Evaluate and synthesize

Definition of Proficient Reading Proficient reading is the… –active, –automatic, –far transfer process –of using one’s internalized language and culture expectancy system –to efficiently comprehend –an authentic text –for the purpose for which it was written.

STANAG 6001 Reading Grid Reader Task Conditions Accuracy Expectations Level Author PurposeText TypeContent 3 Understand literal and figurative meanings by reading both "the lines" and "between the lines". Recognize the author's tone and unstated positions. Evaluate the adequacy of arguments made. Evaluate situations, concepts, and conflicting ideas. Present and support arguments and/or hypotheses with both factual and abstract reasoning. Multiple-paragraph prose on a variety of professional or abstract subjects such as found in editorials, formal papers, and professional writing. Multiple, well-organized abstract concepts interlaced with attitudes and feelings. Social/cultural/political issues, with abstract aspects and supporting facts presented as well. Most allusions and references are explained by their context. Understands the facts, nuanced details and the author's opinion, tone, and attitude. 2 Understand the facts and supporting details including any causal, temporal, and spatial relationships. Convey structured, factual information, supporting details and factual relationships in extended narratives and descriptions. News reports, magazine articles, short stories, human interest features, and instructional and descriptive materials. Concrete information about real-world phenomenon with supporting details, as well as interrelated facts about world, local, and personal events. Grasps both the main ideas and the supporting details. 1 Understand the main idea, orient oneself by identifying the topic or main idea. Orient by Communicating one or more general ideas. Very simple announcements; ads, personal notes. Information about places, times, people, etc. that are associated with everyday events, personal invitations, or general information. Recognizes the main idea and some broad, categorical distinctions. 0 Recognize some random items in a list or short text. List, enumerate.Lists, simple tables. Sparse or random; format or external context may reveal internal relationships. Correctly recognizes some words.

Did the national test construct adhere to the “TCA” alignment expected in the updated STANAG 6001 blueprint?

Was there a slip at the Design and Development stage? Construct Definition Test Design & Development Test Scoring

Did the test design and development process follow the level-by-level approach of the STANAG 6001 blueprint? Difficult Items Level 3 Items Level 2 ItemsNot=> Level 1 Items Easy Items

If the national tests followed a traditional design of a single bank of test items that design would reduce or limit scoring accuracy, even if the difficulty range of the items is identical to the difficulty range of the 3-stage test.

In fact, if a multi-level test is designed to report a single score, that test is no longer a CR test! A single score mixes criteria and includes test results across all of the levels tested. Then, the test takers’ total score includes: –Their score at their sustained ability level. –Their score resulting from their partial control of the next-higher level. –Their score resulting from conceptual control at the second-higher level.

Was there a slip in the scoring process? Construct Definition Test Design & Development Test Scoring

Test Scoring Procedures: The Single Score Approach Note: –Relating a single, overall score to a multi-level set of criteria (such as the hierarchical set of STANAG 6001 criteria) presents formidable theoretical and practical challenges. –That is why multiple statistical and quasi- statistical procedures have been developed to accomplish this cut-score-setting task.

Test Scoring Procedures: The Single Score Approach 100 50 0 Level 3 Group Level 2 Group Level 1 Group Test results to be calibrated Groups of ”known” ability

The Results One Hopes For: 100 50 0 Level 3 Group Level 2 Group Level 1 Group Test results to be calibrated Groups of “known” ability

The Results One Always Gets (Some test takers score below and score above their “known” ability.) 100 ??? 50 ??? 0 Level 3 Group Level 2 Group Level 1 Group Test results to be calibrated Groups of ”known” ability

No matter where the cut scores are set, they are wrong for some test takers. 100 ??? 50 ??? 0 Level 3 Group Level 2 Group Level 1 Group Test Scores Received Groups of ”known” ability

Why is a single score so difficult to interpret? Language learners do not completely master the communication tasks and topical domains of one proficiency level before they begin learning the skills described at the next higher level. Usually, learners will develop conceptual control or even partial control over the next higher proficiency level by the time they have attained sustained, consistent control over the lower level.

If STANAG 6001 Levels Were Buckets… Notes: The buckets may begin filling at the same time. Some Level 2 ability will develop before Level 1 is mastered. That is ok, because the buckets will still reach their full (mastery) state sequentially. The blue arrows indicate the water (ability) observed at each level.

Is there a better way than indirect extrapolation from a single score to assign proficiency levels? Would close adherence to the proficiency scale levels and C-R scoring improve testing accuracy?

A mini experiment: The BAT was scored two ways. The “floor and ceiling” criterion- referenced ratings were more defensible than were the total-score results. The criterion-referenced rating process ranked 37% of the test takers differently than they were ranked by their total score results. Let’s see why that happens…

Results for Two Students On a multi-level reading test. There were 60 total points possible. –20 points for Level 1. –20 points for Level 2. –20 points for Level 3. Lets compare their total scores and their level-by-level, “floor and ceiling”, criterion- based scores.

Example A: Alice’s total score = 35 C-R Proficiency Level = 2 Level 2 (with Random abilities at Level 3) Level 1Level 2Level 3 "Almost all" 17 points, 85% 14 points, 70% Most Some None 4 points, 20%

Example B: Bob’s total score = 37 C-R Proficiency Level = 1+ Level 1 (with Developing abilities at Level 2) Level 1Level 2Level 3 "Almost all" 17 points, 85% Most 11 points, 55% Some 9 points, 45% None

A Comparison of Results Single Score Results Alice: 35 total points. Bob: 37 total points. C-R Results Alice: Level 2 Bob: Level 1+  Based on their total scores, where would you set the cut-score between 1+ and 2?  Would both be given a 2? Or both a 1+?

A Comparison of Results Single Score Results Alice: 35 total points. Bob: 37 total points. C-R Results Alice: Level 2 Bob: Level 1+  Based on their total scores, where would you set the cut-score between 1+ and 2?  Would both be given a 2? Or both a 1+?  CR scoring solves that problem!

To Apply C-R Scoring to STANAG 6001 Tests Calculate a separate score for each level. With a separate score for each level, that level-specific score is not influenced by developing abilities and guessing at other levels. The test taker’s proficiency level is then determined by a comparison of her/his level-specific scores, not by the total score.

Advantages of C-R Scoring C-R scoring is non-compensatory scoring (whereas a single overall score on a multi- level test is always a compensatory score). C-R test designs combined with C-R scoring yields a score for each level tested. “Floor and Ceiling” scores can explain ability distinctions that would be regarded as error variance in multi-level tests that report only a single total test score.

Are you ready to take your test scoring to the next level? 1.If the difficulty indices of your test items group those items into level-specific clusters, and 2.If those level-specific clusters array themselves in an ascending hierarchy of difficulty groupings, you are ready!

If they align – everything’s fine! Construct Definition ☺ Test Design & Development ☺ Test Scoring ☺

Questions?

For more information on the empirical validation of the CR approach to the testing of reading proficiency, see: Clifford, R. & Cox, T. (2013) “Empirical Validation of Reading Proficiency Guidelines”, Foreign Language Annals. Vol. 46, No. 1, 45-61.

Criterion-Referenced Proficiency Testing BILC 26 May 2016 Ray Clifford.

Similar presentations

Presentation on theme: "Criterion-Referenced Proficiency Testing BILC 26 May 2016 Ray Clifford."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Criterion-Referenced Proficiency Testing BILC 26 May 2016 Ray Clifford.

Similar presentations

Presentation on theme: "Criterion-Referenced Proficiency Testing BILC 26 May 2016 Ray Clifford."— Presentation transcript:

Similar presentations

About project

Feedback