Role of Statistics in Developing Standardized Examinations in the US

Role of Statistics in Developing Standardized Examinations in the US
by Mohammad Hafidz Omar, Ph.D. April 19, 2005

Road Map of Talk What is a standardize test? Why standardize Tests?
Who builds standardize tests in the United States? Steps to Building a standardize test Test Questions & some statistics used to describe them Statistics used for describing exam scores Research studies in educational testing that uses advanced statistical procedures

What is a “standardized Examination”?
A standardized test: A test which the conditions of administration and the scoring procedures are designed to be the same in all uses of the test Conditions of administration: 1) physical test setting 2) directions for examinees 3) test materials 4) administration time Scoring procedures: 1) derivation of scores 2) transformation of raw scores

Why standardize tests? Statistical reason: Practical reason:
Reduction of unwanted variations in Administration conditions Scoring practices Practical reason: Appeal to many test users Same treatment and conditions for all students taking the tests (fairness)

Who builds standardize tests in the United States?
Testing Organizations Educational Testing Service (ETS) American College Testing (ACT) National Board of Medical Examiners (NBME) Iowa Testing Programs (ITP) Center for Educational Testing and Evaluation (CETE) State Department of Education New Mexico State Department of Education Build tests themselves or Contract out job to testing organizations Large School Districts Wichita Public School Districts

a) Administration conditions
Design of experiment concept of control for unnecessary factors Apply the same treatment conditions for all test takers 1) physical test setting (group vs individual testing, etc) 2) directions for examinees 3) test materials 4) administration time

b) Scoring Procedures Same scoring process
Scoring rubric for open-ended items Same score units and same measurements for everybody Raw test scores (X) Scale Scores Same Transformation of Raw Scores Raw (X)  Equating process  Scale Scores h(X)

Overview of Typical Standardized Examination building Process
Costly process Important Quality control procedures at each phase Process takes time (months to years) Creating Test specifications Fresh Item Development Field-Test Development Operational (Live) Test Development

1) Creating Test specifications
Purpose: To operationalize the intended purpose of testing A team of content experts and stakeholders discuss the specifications vs the intended purpose Serves as a guideline to building examinations How many items should be written in each content/skill category? Which Content/skill area is more important than others? 2-way table of specifications typically contains content areas (domains) versus learning objectives with % of importance associated in each cell

2) Fresh Item Development
Purpose: building quality items to meet test specifications Writing Items to meet Test Specifications Q: Minimum # of items to write? Which cell will need to have more items? Item Review (Content & Bias Review) Design of Experiment stage Design of Test (easy items first, then mixture – increase motivation) Design of Testing event (what time of year, sample, etc) Data Collection stage: Pilot-testing of Items Scoring of items & PT exams Analyses Stage: analyzing Test Items Data Interpretation & decision-making stage: Item Review with aid of item statistics Content Review Bias review Quality control step: (1) Keep good quality item, (2) Revise items with minor problem & re-pilot or (3)Scrap bad items

3) Field-Test Development
Purpose: building quality exam scales to measure the construct (structure) of the test as intended by the test specifications Design of Experiment stage Designing Field-Test Booklets to meet Specifications Use good items only from previous stage (items with known descriptive statistics) Design of Testing event Data collection: Field-Testing of Test booklets Scoring of items and FT Exams Analyses analyzing Examination Booklets (for scale reliability and validity) Interpreting results: Item & Test Review Do tests meet the minimum statistical requirements. (rxx’ > 0.90) If not, what can be done differently?

4) Operational (Live) Test Development
Purpose: To measure student abilities as intended by the purpose of the test Design of Experiment stage Design of Operational Test Use only good FT items and FT item sets Assembling Operational Exam Booklets Design of Pilot Tests (e.g. some state mandated programs) New & Some of the revised items Design of Field Test (e.g. GRE experimental section) Good items that has been piloted before How many sections? How many students per section? Design of additional research studies e.g. Different forms of the test (Paper-&-pencil vs computer version) Design of Testing events Data Collection: First Operational Testing of Students with Final version of examinations Scoring of items and Exams Analyses of Operational Examinations Research studies to establish Reporting scales

Different types of Exam item format
Machine –Scorable formats Multiple-choice Questions True-False Multiple true-false Multiple-mark questions (Pomplun & Omar, 1997) – aka multiple-answer multiple-choice questions Likert-like Type Items (agree/disagree continuum) Manual (Human) scoring formats Short answers Open-ended test items Requires a scoring rubric to score papers

Statistical considerations in Examination construction
Overall design of tests to achieve reliable (consistent) and valid results Designing testing events to collect reliable and valid data (correct pilot sample, correct time of the year, etc) e.g. SAT: Spring/Summer student population difference Appropriate & Correct Statistical analyses of examination data Quality Control of test items and exams

Analyses & Interpretation: Descriptive statistics for distractors (Distractor Analysis)
Applies to Multiple-choice, true-false, multiple true-false format only Statistics: Proportion endorsing each distractor Informs the exam authors which distractor(s) are not functioning or Counter-intuitively more attractive than the intended right answer (hi ability wrong answer)

Analyses and Interpretation: Item-Level Statistics
Difficulty of Items Statistics: Proportion correct {p-value} – mc, t/f, m-t/f, mm, short answer Item mean – mm, open-ended items Describes how difficult an item is Discrimination Discrimination index: high vs Low examinee difference in p-value An index describing sensitivity to instruction item-total correlations: correlation of item (dichotomously or polychotomously scored) with the total score pt-biserials: correlation between total score & dichotomous (right/wrong) item being examined Biserials: same as pt-biserials except that the dichotomous item is now assumed to come from a normal distribution of student ability in responding to item Polyserials: same as biserials except that the item is polychotomously scored Describes how an item relates (thus, contributes) to the total score

Examination-Level Statistics
Overall Difficulty of Exams/Scale Statistics: Test mean, Average item Difficulty Overall Dispersion of Exam/Scale scores Statistics: Test variability – standard deviation, variance, range, etc Test Speededness Statistics: 1)Percent of students attempting the last few questions 2) Percentages of examinees finishing the test within the allotted time period Not speeded test: percentage is more than 95% Consistency of the Scale/Exam Scores Statistics: Scale Reliability Indices KR20: for dichotomously scored items Coefficient alpha: for dichotomously and polychotomously scored items Standard error of Measurement Indices Validity Measures of Scale/Exam Scores Intercorrelation matrix High Correlation with similar measures Low correlation with dissimilar measures Structural analyses (Factor analyses, etc)

Statistical procedures describing Validity of Examination scores for its intended use
Is Reality of Exam for the students same as Authors’ Exam Specifications? Construct validity: Analyses on exam structures (Intercorrelation matrix, Factor analyses, etc) Can the exam measure the intended learning factors (constructs)? Answer: with Factor analyses (Data Reduction method) Predictive validity: predictive power of exam scores for explaining important variables e.g. Can exam scores explain (or predict) success in college? Regression Analyses Differential Item Functioning: statistical bias in test items Are test items fair for all subgroups (Female, Hispanic, Blacks, etc) of examinees taking the test? Mantel-Haenszel chi-squared Statistics

Some research areas in Educational Testing that involve further statistical analyses
Reliability Theory How consistent is a set of examination scores? Signal to signal+noise, 2/(2+ 2), ratio in educational measurement Generalizability Theory Describing & Controlling for more than 1 source of error variance Differential Item Functioning Pair-wise difference (F vs M, B vs W) in student performance on items Type I error rate control (many items & comparison  inflate false detection rates) issue

Some research areas in Educational Testing that involve further statistical analyses (continued)
Test Equating Two or more forms of the exam: Are they interchangeable? If scores on form X is regressed on scores from form Y, will the scores from either test editions be interchangeable? Different regression functions Item Response Theory Theory relating students’ unobserved ability with their responses to items Probability of responding correctly to test items for each level of ability (item characteristic curves) Can put items (not test) on the same common scale. Vertical Scaling How do student performance from different school grade groups compare with each other? Are their means increasing rapidly, slowly, etc? Are their variances constant, increasing, or decreasing?

Some research areas in Educational Testing that involve further statistical analyses (continued)
Item Banking Are the same items from different administrations significantly different in their statistical properties? Need Item Response Theory to calibrate all items so that there’s one common scale. Advantage: Can easily build test forms with similar test difficulty. Computerized Test Are score results taken on computers interchangeable with those on paper-and-pencil editions? (e.g. Is measure of student performances free from or tainted by their level of computer anxiety? Computer Adaptive Testing increase measurement precision (test information function) by allowing students to take only items that are at their own ability level.

Role of Statistics in Developing Standardized Examinations in the US

Similar presentations

Presentation on theme: "Role of Statistics in Developing Standardized Examinations in the US"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Role of Statistics in Developing Standardized Examinations in the US

Similar presentations

Presentation on theme: "Role of Statistics in Developing Standardized Examinations in the US"— Presentation transcript:

Similar presentations

About project

Feedback