1 Effective Use of Benchmark Test and Item Statistics and Considerations When Setting Performance Levels California Educational Research Association Anaheim,

Slides:



Advertisements
Similar presentations
Issues of Reliability, Validity and Item Analysis in Classroom Assessment by Professor Stafford A. Griffith Jamaica Teachers Association Education Conference.
Advertisements

Consistency in testing
Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
Reliability - The extent to which a test or instrument gives consistent measurement - The strength of the relation between observed scores and true scores.
Evaluation of the Iowa Algebra Aptitude Test Terri Martin Doug Glasshoff Mini-project 1 June 17, 2002.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
What is a Good Test Validity: Does test measure what it is supposed to measure? Reliability: Are the results consistent? Objectivity: Can two or more.
General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.
Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.
STAR 2010 September 10, Agenda New in 2010 Interpreting reports Comparing results Appendixes A-G 2.
Chapter 4 Validity.
STAR Basics.
Understanding Quick Scores & This Year’s ChangeUnderstanding Quick Scores & This Year’s Change Dr. Nakia TownsDr. Nakia Towns Assistant Commissioner for.
Research Methods in MIS
Measurement Joseph Stevens, Ph.D. ©  Measurement Process of assigning quantitative or qualitative descriptions to some attribute Operational Definitions.
Classroom Assessment Reliability. Classroom Assessment Reliability Reliability = Assessment Consistency. –Consistency within teachers across students.
Measurement and Data Quality
Valentine Elementary School San Marino Unified School District Standardized Testing and Reporting (STAR) Spring 2009 California Standards Test.
COMPASS National and Local Norming Sandra Bolt, M.S., Director Student Assessment Services South Seattle Community College February 2010.
Department of Research and Evaluation Santa Ana Unified School District 2011 CST API and AYP Elementary Presentation Version: Elementary.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
SELECTION OF MEASUREMENT INSTRUMENTS Ê Administer a standardized instrument Ë Administer a self developed instrument Ì Record naturally available data.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Technical Adequacy Session One Part Three.
FCAT 2.0 and End-of-Course Assessments 1 Kris Ellington Deputy Commissioner Division of Accountability, Research and Measurement 850/
CRT Dependability Consistency for criterion- referenced decisions.
Essential Skills Transition Planning Derek Brown Manager, Assessment of Essential Skills Oregon Department of Education.
Review and Validation of ISAT Performance Levels for 2006 and Beyond MetriTech, Inc. Champaign, IL MetriTech, Inc. Champaign, IL.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.
Diagnostics Mathematics Assessments: Main Ideas  Now typically assess the knowledge and skill on the subsets of the 10 standards specified by the National.
Assessment Training Nebo School District. Assessment Literacy.
Reliability & Validity
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Tests and Measurements Intersession 2006.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Spring 2012 Testing Results. GRANT API HISTORY
Reliability vs. Validity.  Reliability  the consistency of your measurement, or the degree to which an instrument measures the same way each time it.
Standard Setting Results for the Oklahoma Alternate Assessment Program Dr. Michael Clark Research Scientist Psychometric & Research Services Pearson State.
Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
Measurement MANA 4328 Dr. Jeanne Michalski
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
Technical Adequacy of Tests Dr. Julie Esparza Brown SPED 512: Diagnostic Assessment.
1 Maximizing Predictive Accuracy of District Benchmarks Illuminate Education, Inc. User’s Conference Aliso Viejo, California June 4&5, 2012.
Department of Research and Evaluation Santa Ana Unified School District 2011 CST High School.
The Normal Distribution and Norm-Referenced Testing Norm-referenced tests compare students with their age or grade peers. Scores on these tests are compared.
Reliability EDUC 307. Reliability  How consistent is our measurement?  the reliability of assessments tells the consistency of observations.  Two or.
©2013, The McGraw-Hill Companies, Inc. All Rights Reserved Chapter 5 What is a Good Test?
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Chapter 2 Norms and Reliability. The essential objective of test standardization is to determine the distribution of raw scores in the norm group so that.
Measurement and Scaling Concepts
1 Measurement Error All systematic effects acting to bias recorded results: -- Unclear Questions -- Ambiguous Questions -- Unclear Instructions -- Socially-acceptable.
Franklin Public Schools MCAS and PARCC Results Spring 2015 Joyce Edwards Assistant Superintendent for Teaching and Learning December 8, 2015.
STAR Reading. Purpose Periodic progress monitoring assessment Quick and accurate estimates of reading comprehension Assessment of reading relative to.
Department of Research and Evaluation
Concept of Test Validity
Classical Test Theory Margaret Wu.
Reliability & Validity
California Educational Research Association
پرسشنامه کارگاه.
Reliability and Validity of Measurement
Presentation transcript:

1 Effective Use of Benchmark Test and Item Statistics and Considerations When Setting Performance Levels California Educational Research Association Anaheim, California December 1, 2011

2 Review of Benchmark Test and Item Statistics Objective Extend knowledge of assessment team to: 1. Better understand test reliability and the influences of test composition and test length. 2. Better understand item statistics and use them to identify items in need of revision

Reliability is a measure of the consistency of the assessment Types of reliability coefficients (always range from 0 to 1) Test-retest Alternate forms Split-half Internal consistency (Cronbach’s Alpha/KR- 20) 3

Reliability Influenced by Test Length Spearman-Brown formula estimates reliabilities of shorter tests – Remember: The reliability of a score is an indication of how much an observed score can be expected to be the same if observed again. NOTE: See handout from STAR Technical Manual for exact cluster reliabilities. 4

Reliability Influenced by Test Length Example: given a 75 item test with r=.95 – 40 item test has r=.91 – 35 item test has r=.90 – 30 item test has r=.88 – 25 item test has r=.86 – 20 item test has r=.84 – 10 item test has r=.72 – 5 item test has r=.56 NOTE: See handout from STAR Technical Manual for exact cluster reliabilities. 5

Reliability Statistics for CST’s (see handout)  Note that CST reliabilities range from.90 to.95  Note that cluster reliabilities are consistent with those predicted by Spearman-Brown formula

Validity is the degree to which the test is measuring what was intended Types of test validity A.Predictive or Criterion (How does it correlate with other measures?) B.Content 1. How well does the test sample from the content domain? 2. How aligned are the items with regard to format and rigor 7

Validity Is Influenced by Reliability  Impact of Lower Reliability on Validity  Remember: Validity is the agreement between a test score and the quality it is believed to measure  Upper limit on validity coefficient is the square root of the reliability coefficient  75 item test = square root of.95 =.97 8

Validity Is Influenced by Reliability  Upper limit on validity coefficient is the square root of the reliability coefficient  75 item test =square root of.95=.97  30 item test= square root of.88=.94  20 item test= square root of.86=.93  10 item test = square root of.72=.85  5 item test = square root of.56=.75 9

Coefficient of Determination (R squared)  Square of validity coefficient gives “proportion of variance in the achievement construct accounted for by the test”  75 item test =.97 squared=.94  30 item test=.94 squared=.88  20 item test=.93 squared=.86  10 item test=.85 squared=.72  5 item test=.75 squared=.56 10

Using Item Statistics (p-value & point- biserials)  Apply item analysis statistics from assessment reporting system (e.g. Datadirector, Edusoft, OARS, EADMS, etc.)  P-values (percent of group getting item correct  Most should be between 30 and 80  Very high indicates it may be too easy; too low may indicate a problem item  Point-biserials (correlation of item with total score)  Most should be.30 or higher  Very low or negative generally indicates a problem with the item

Item statistics for CST’s (see handout)  Note that the range of P-values is consistent with most being between.30 and.80  Note that median point-biserials are generally in the 40’s

Algebra 1

Algebra 2

Geometry

18 Maximizing Predictive Accuracy of District Benchmarks Objective Extend knowledge of assessment team to: 1. Better understand how performance level setting is key to predictive validity. 2. Better understand how to create performance level bands based on equipercentile equating

19 Comparing District Benchmarks to CST Results Common Methods for Setting Cutoffs on District Benchmarks:  Use default settings on assessment platform (e.g. 20%, 40%, 60%, 80%)  Ask curriculum experts for their opinion of where cutoffs should be set  Determine percent correct corresponding to performance levels on CSTs and apply to benchmarks

20 Comparing District Benchmarks to CST Results There is a better way!

21 Comparing District Benchmarks to CST Results “Two scores, one on form X and the other on form Y, may be considered equivalent if their corresponding percentile ranks in any given group are equal.” (Educational Measurement-Second Edition, p. 563)

22 Comparing District Benchmarks to CST Results  Equipercentile Method of Equating at the Performance Level Cut-points  Establishes cutoffs for benchmarks at equivalent local percentile ranks as cutoffs for CSTs  By applying same local percentile cutoffs to each trimester benchmark, comparisons across trimesters within a grade level are more defensible

23 Equipercentile Equating Method Step 1-Identify CST SS Cut-points

24 Equipercentile Equating Method Step 2 - Establish Local Percentiles at CST Performance Level Cutoffs (from scaled score frequency distribution)

25 Equipercentile Equating Method Step 3 – Locate Benchmark Raw Scores Corresponding to the CST Cutoff Percentiles (from benchmark raw score frequency distribution)

26 Equipercentile Equating Method Step 4 – Validate Classification Accuracy – Old Cutoffs

27 Equipercentile Equating Method Step 4 – Validate Classification Accuracy – Old Cutoffs

28 Equipercentile Equating Method Step 4 – Validate Classification Accuracy – Old Cutoffs

29 Equipercentile Equating Method Step 4 – Validate Classification Accuracy – New Cutoffs

30 Equipercentile Equating Method Step 4 – Validate Classification Accuracy – New Cutoffs

31 Equipercentile Equating Method Step 4 – Validate Classification Accuracy – New Cutoffs

32 Example: Classification Accuracy Biology OldNew 2 nd Semester Proficient or Advanced42%77% Each Level38%55% 1 st Semester Proficient or Advanced30%77% Each Level31%50%

33 Example: Classification Accuracy Biology OldNew 1 st Quarter Proficient or Advanced53%71% Each Level41%46%

34 Example: Classification Accuracy Chemistry OldNew 2 nd Semester: Prof. & Adv. 63%79% 2 nd Semester: Each Level 47%52% 1 st Semester: Prof. & Adv. 74% 1 st Semester: Each Level 49%50% 1 st Quarter: Prof. & Adv. 83%76% 1 st Quarter: Each Level 48%47%

35 Example: Classification Accuracy Earth Science OldNew 2 nd Semester: Prof. & Adv. 48%68% 2 nd Semester: Each Level 43%52% 1 st Semester: Prof. & Adv. 33%66% 1 st Semester: Each Level 38%47% 1 st Quarter: Prof. & Adv. 42%56% 1 st Quarter: Each Level 34%41%

36 Example: Classification Accuracy Physics OldNew 2 nd Semester: Prof. & Adv. 57%87% 2 nd Semester: Each Level 37%57% 1 st Semester: Prof. & Adv. 60%88% 1 st Semester: Each Level 42%50% 1 st Quarter: Prof. & Adv. 65%87% 1 st Quarter: Each Level 47%45%

37 Things to Consider Prior to Establishing the Benchmark Cutoffs  Will there be changes to the benchmarks after CST percentile cutoffs are established?  If NO then raw score benchmark cutoffs can be established by linking CST to same year benchmark administration (i.e. spring 2011 CST matched to benchmark raw scores)  If YES then wait until new benchmark is administered and then establish raw score cutoffs on benchmark  How many cases are available for establishing the CST percentiles? (too few cases could lead to unstable percentile distributions)

38 Things to Consider Prior to Establishing the Benchmark Cutoffs (Continued)  How many items comprise the benchmarks to be equated? (as test gets shorter it becomes more difficult to match the percentile cutpoints established on the CST’s)

39 Summary Equipercentile Equating Method  Method generally establishes a closer correspondence between the CST and Benchmarks  When benchmarks are tightly aligned with CSTs, the approach may be less advantageous (i.e. elementary math)  Comparisons between benchmark and CST performance can be made more confidently  Comparisons between benchmarks within the school year can be made more confidently

40 Coming Soon from Illuminate Education, Inc.! Reports using the equipercentile methodology are being programmed to: (1) establish benchmark cutoffs for performance bands (2) create validation tables showing improved classification accuracy based on the method

Contact: Tom Barrett, Ph.D. President, Barrett Enterprises, LLC Director, Owl Corps, School Wise Press 2173 Hackamore Place Riverside, CA (office) (cell) 41