Let’s Examine Our Exams: On Test Reliability, Validity and Results and the New York State Testing Program ~~~~~~ Fred Smith Administrative Analyst (Ret.)

Slides:



Advertisements
Similar presentations
Standardized Tests: What Are They? Why Use Them?
Advertisements

1 COMM 301: Empirical Research in Communication Kwan M Lee Lect4_1.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
In Today’s Society Education = Testing Scores = Accountability Obviously, Students are held accountable, But also!  Teachers  School districts  States.
NYC ACHIEVEMENT GAINS COMPARED TO OTHER LARGE CITIES SINCE 2003 Changes in NAEP scores Leonie Haimson & Elli Marcus Class Size Matters January.
Chapter Fifteen Understanding and Using Standardized Tests.
Measures of Academic Progress (MAP) Curt Nath Director of Curriculum Ocean City School District.
Evaluating Pretest to Posttest Score Differences in CAP Science and Social Studies Assessments: How Much Growth is Enough? February 2014 Dale Whittington,
Enquiring mines wanna no.... Who is it? Coleman Report “[S]chools bring little influence to bear upon a child’s achievement that is independent of.
Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.
Confidence Intervals: Estimating Population Mean
Classroom Assessment A Practical Guide for Educators by Craig A
Athleticism, like intelligence, is many things
But What Does It All Mean? Key Concepts for Getting the Most Out of Your Assessments Emily Moiduddin.
Standardized Tests. Standardized tests are commercially published tests most often constructed by experts in the field. They are developed in a very precise.
Common Core State Standards & SBAC Field Test April, 2, 2014 Hill Regional Career High School Intended Outcomes: To gain a general understanding of the.
Statistics Used In Special Education
Student Learning targets
Understanding and Using Standardized Tests
Interpreting Assessment Results using Benchmarks Program Information & Improvement Service Mohawk Regional Information Center Madison-Oneida BOCES.
Technical Adequacy Session One Part Three.
Standardization and Test Development Nisrin Alqatarneh MSc. Occupational therapy.
Jim Lloyd_2007 Educational Value Added Assessment System (EVAAS) Olmsted Falls City Schools Initial Presentation of 4 th Grade Students.
Copyright ©2006. Battelle for Kids. Understanding & Using Value-Added Analysis.
Chapter 3 Understanding Test Scores Robert J. Drummond and Karyn Dayle Jones Assessment Procedures for Counselors and Helping Professionals, 6 th edition.
Introduction In medicine, business, sports, science, and other fields, important decisions are based on statistical information drawn from samples. A sample.
A Closer Look at Adequate Yearly Progress (AYP) Michigan Department of Education Office of Educational Assessment and Accountability Paul Bielawski Conference.
Leadership for the Common Core in Mathematics, University of Wisconsin-Milwaukee Disclaimer Leadership for the Common Core in Mathematics (CCLM^2)
Understanding the TerraNova Test Testing Dates: May Kindergarten to Grade 2.
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Week 5 Lecture 4. Lecture’s objectives  Understand the principles of language assessment.  Use language assessment principles to evaluate existing tests.
Welcome The challenges of the new National Curriculum & Life without Levels.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 1): Two-tail Tests & Confidence Intervals Fall, 2008.
Chapter 2 ~~~~~ Standardized Assessment: Types, Scores, Reporting.
New Developments in NYS Assessments. What is new? Required use of Standardized Scannable Answer Sheets for all Regents Exams starting in June 2012 Beginning.
Section 10.1 Confidence Intervals
Chapter 15 Understanding Standardized Assessment.
Copyright © 2010, SAS Institute Inc. All rights reserved. How Do They Do That? EVAAS and the New Tests October 2013 SAS ® EVAAS ® for K-12.
Scale Scoring A New Format for Provincial Assessment Reports.
McGraw-Hill/Irwin © 2012 The McGraw-Hill Companies, Inc. All rights reserved. Obtaining Valid and Reliable Classroom Evidence Chapter 4:
DVAS Training Find out how Battelle for Kids can help Presentation Outcomes Learn rationale for value-added progress measures Receive conceptual.
Parents Information Evening
Assessing Intelligence
Welcome to MMS MAP DATA INFO NIGHT 2015.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Understanding the 2015 Smarter Balanced Assessment Results Assessment Services.
Copyright © 2010, SAS Institute Inc. All rights reserved. How Do They Do That? EVAAS and the New Tests October 2013 SAS ® EVAAS ® for K-12.
North Carolina’s READY Initiative to Prepare Students for College and Career.
Evaluation Institute Qatar Comprehensive Educational Assessment (QCEA) 2008 Summary of Results.
SAT’s Information Parent’s Meeting 10 th February February 2016.
Lostock Gralam CE Primary School Parent Information Meeting January 2016.
+ Z-Interval for µ So, the formula for a Confidence Interval for a population mean is To be honest, σ is never known. So, this formula isn’t used very.
Assessment Assessment is the collection, recording and analysis of data about students as they work over a period of time. This should include, teacher,
California Assessment of Student Performance and Progress CAASPP Insert Your School Logo.
Proposed End-of-Course (EOC) Cut Scores for the Spring 2015 Test Administration Presentation to the Nevada State Board of Education March 17, 2016.
Presentation to the Nevada Council to Establish Academic Standards Proposed Math I and Math II End of Course Cut Scores December 22, 2015 Carson City,
In 2014/15 a new national curriculum framework was introduced by the government for Years 1, 3, 4 and 5 However, Years 2 and 6 (due to statutory testing)
KHS PARCC/SCIENCE RESULTS Using the results to improve achievement Families can use the results to engage their child in conversations about.
Interpreting Test Results using the Normal Distribution Dr. Amanda Hilsmier.
STAR Reading. Purpose Periodic progress monitoring assessment Quick and accurate estimates of reading comprehension Assessment of reading relative to.
Analysis of AP Exam Scores
SATs KS1 – YEAR 2 We all matter.
New Developments in NYS Assessments
M.A.P. Measures of Academic Progress
Your mission, should you choose to accept it, is to provide as
M.A.P. Measures of Academic Progress
NWEA Measures of Academic Progress (MAP)
2015 PARCC Results for R.I: Work to do, focus on teaching and learning
Understanding and Using Standardized Tests
EDUC 2130 Quiz #10 W. Huitt.
Presentation transcript:

Let’s Examine Our Exams: On Test Reliability, Validity and Results and the New York State Testing Program ~~~~~~ Fred Smith Administrative Analyst (Ret.) New York City Public Schools Change the Stakes December 14,

The Tale of the Scale 2

3

Two Ways to Measure the Same Thing: Which Do You Think Is More Reliable? The following slides show that results can differ when there are two ways to measure the same thing—whether it is how many books children borrow from the library or how well they perform in reading. When this happens, questions need to be asked about which measure provides more accurate data, which observations are truer to reality and which tests are more reliable. The slides compare two ways to examine and quantify the same thing. Ask yourself which approach is more credible and gives a better account. In which results do you have greater confidence? 4

Two Ways to Measure the Same Thing: Which Do You Think Is More Reliable? What if we wanted to know the average number of books 4 th graders took out of the Guggenheim school library from one year to the next. The chart gives us two ways to arrive at that number: Ask students to remember the number of books they borrowed. Check the librarian’s computer which stores data when books are taken out. Which line is based on data in the computer? Which gives a better picture of library activity from year to year—the smooth line or the up-and-down line? Which would you have more faith in if you wanted to find out how much the library was used? 5

6

Two Ways to Measure Reading Proficiency: Which Do You Think Is More Reliable? Now we want to know about reading achievement and growth. The New York State ELA Test and the National Assessment of Educational Progress (NAEP) are two measures of proficiency in that area. The next chart shows how New York State’s 4 th graders performed on the ELA (N=186,000) compared to NAEP, which is given to samples of children across the state every two years (N=4,200). Clearly, NAEP results hold steady going back to The ELA results exhibit great fluctuation from one occasion to the next. Physical attributes, learning abilities and averages do not usually shift so sharply from one year to another. Which set of results do you trust more? Why? The chart supports three points: 1- The NYS tests have been unreliable. Little confidence can be placed in the results. 2- The lack of reliability applies to the most recent years when testing aligned to the Common Core was implemented. Yet the State Education Department insisted the 2013 results were a baseline against which to judge whether students met the Common Core Standards. 3- Critics of the State Testing Program have found it deeply disturbing, because without reliability test scores have no validity. High-stakes decisions about the status and growth of students, teachers and schools based on unreliable exams will be unsound. 7Fred Smith12/14/2015

8

9

The Normal Distribution aka The Bell-Shaped Curve 10 Scientists noticed that measures of height and weight follow a predictable pattern in the general population. Most people are in the middle, near the average—or the norm. Some are in a range that is a little above or below the middle; a small percentage are in the upper or lower end of the curve. This has been described as the normal distribution. It was reasoned that if physical attributes take on this shape, then things like IQ and educational achievement also follow this model. So tests were designed and constructed to yield results to reflect this pattern.

11 Item Development Two approaches to creating test questions have been taken by the New York State Education Department to produce ELA and math tests for students in grads 3 – 8. They have involved working with commercial test publishers (CTB/McGraw-Hill, and Pearson, Inc., 2012 to date) in compliance with testing requirements set forth by the No Child Left Behind Act. The preferred approach is known as embedded field testing. The other method has been called stand-alone field testing. Both allow material to be tried out on students and items are analyzed to see how well test items are functioning: Are the items confusing? Is there more than one acceptable answer to a given question? Are the reading passages and math problems appropriate to the age- and grade-level of the children for whom they are intended? Does the test provide enough time for students to complete? Do the items show bias against a particular group of students? Item analysis is performed to address these concerns. Embedded Field Testing is the preferred way to try out material. This practice has been used in major testing programs, such as the SAT. It depends on students who are taking tests not knowing which material is being tried out—because the trial items are interspersed among the items that actually count. The embedded items do not count in scoring the test. The items being field tested in this manner make up a small part of the “operational test” that is considered the real test.

12 Advantage of Embedding: Students are motivated to do well on the entire test. Since they do not know which items are embedded, they try equally hard to answer the trial items— under real examination conditions. The information gained from this approach is likely to reflect how students will do on the same material should it be selected for future use—and be included on operational exams. Disadvantage: Embedding items lengthens the testing period. Students taking items that don’t count may lose energy or focus when they get to operational items that come later in the test. This would impact their results on the “real” test. In addition, embedded field testing requires many tryout forms of a test to be prepared in order to generate a large pool of items from which to choose good items for subsequent exams. Each form will contain a set of field test questions embedded within the operational material, which stays the same from form to form. Done correctly, there should be between 15 and 25 forms to allow a sufficient number of embedded problems, passages, items to be tried out in order to build the next test without making any testing session too long. In New York State, embedded ELA and math field testing began with Pearson in 2012, but only four forms per grade were given.

13 Stand-Alone Field Testing has been done in New York to make up for the lack of items that were tried via the use of only four forms containing embedded material. This approach involves administering separate forms—containing nothing but trial material--on a day apart from the scheduled operational testing date. Advantage: The publisher can prepare many forms and items to be taken by students without hurting their performance on an operational exam. Disadvantage: Students are not motivated to do their best on a test they know doesn’t count. The effort put forth is not likely to match the level of attention their best effort would require. This has the effect of generating misleading, unreliable statistics that will make it problematic to choose items for future tests—where students taking these exams will be motivated to do their best. Pearson has been doing stand-alone field testing every June since 2012, arguably the worst time of year to test children and expect them to be motivated to do well on tests, especially ones that don’t count.

TESTING IS AN ART NOT A SCIENCE. 14