CREATE – National Evaluation Institute Annual Conference – October 8-10, 2009 The Brown Hotel, Louisville, Kentucky Research and Evaluation that inform.

Slides:



Advertisements
Similar presentations
Searching for DIF, Drift, and Growth Among the Deck Chairs -or- Why we cant make sense of measures of educational achievement Presentation at the 2008.
Advertisements

Test Development.
MAP and the Gifted Student Northern Kentucky State MAP Summer Institute 2010 Sonya J. P. Linder.
Eastern Evaluation Research Society 32 nd Annual Conference – April 19-21, 2009 Evaluation in the Digital Age: Promises and Pitfalls The A Star Audit The.
Teacher Effectiveness in Urban Schools Richard Buddin & Gema Zamarro IES Research Conference, June 2010.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
NYC ACHIEVEMENT GAINS COMPARED TO OTHER LARGE CITIES SINCE 2003 Changes in NAEP scores Leonie Haimson & Elli Marcus Class Size Matters January.
BOARD ENDS POLICY REVIEW E-2 Reading and Writing Testing Results USD 244 Board of Education March 12, 2001.
HSPA Mathematics The HSPA is an exam administered statewide in March to high school juniors. It is designed to test our students’ proficiencies in Mathematics.
The Special Education Leadership Training Project January, 2003 Mary Lynn Boscardin, Ph.D. Associate Professor Preston C. Green, III, Ed.D., J.D., Associate.
Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.
Cognitive Abilities Test (CogAT) and AIG Identification Process Understanding the Student Profile and its Use for Educational Planning and AIG Identification.
Classroom Assessment A Practical Guide for Educators by Craig A
Inferential statistics Hypothesis testing. Questions statistics can help us answer Is the mean score (or variance) for a given population different from.
Quantitative Methods – Week 7: Inductive Statistics II: Hypothesis Testing Roman Studer Nuffield College
Educational Psychology, 11 th Edition ISBN © 2010 Pearson Education, Inc. All rights reserved. Classroom Assessment, Grading, and Standardized.
BOARD ENDS POLICY REVIEW E-2 Students will demonstrate a strong foundation in academic skills by working toward the Kansas Standards of Excellence in reading,
The Impact of Including Predictors and Using Various Hierarchical Linear Models on Evaluating School Effectiveness in Mathematics Nicole Traxel & Cindy.
What Was Learned from a Second Year of Implementation IES Research Conference Washington, DC June 8, 2009 William Corrin, Senior Research Associate MDRC.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Instruction, Teacher Evaluation and Value-Added Student Learning Minneapolis Public Schools November,
Measuring of student subject competencies by SAM: regional experience Elena Kardanova National Research University Higher School of Economics.
HRD Consortium1 Wonderlic Personnel Test TM. HRD Consortium2 Wonderlic© is a scholastic aptitude test administered during the hiring process. It is designed.
Employing Empirical Data in Judgmental Processes Wayne J. Camara National Conference on Student Assessment, San Diego, CA June 23, 2015.
AP Stat Review Descriptive Statistics Grab Bag Probability
NAEP 2011 Mathematics and Reading Results NAEP State Coordinator Mark DeCandia.
Oversight of test administration: Respect for educators, respect for standardization, working together to focus on measurement Eliot Long A*Star Audits,
Standard Setting Results for the Oklahoma Alternate Assessment Program Dr. Michael Clark Research Scientist Psychometric & Research Services Pearson State.
Contextual Effects of Bilingual Programs on Beginning Reading Barbara R. Foorman, Lee Branum-Martin, David J. Francis, & Paras D. Mehta Florida Center.
Preliminary Data: Not a Final Accountability Document1 SAISD TAKS Performance Board Work Session June 2004 Office of Research, Evaluation,
Classroom Assessment, Grading, and Standardized Testing
BOARD ENDS POLICY REVIEW E-2 Science and Social Studies Testing Results USD 244 Board of Education April 9, 2001.
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
Copyright © 2010, SAS Institute Inc. All rights reserved. How Do They Do That? EVAAS and the New Tests October 2013 SAS ® EVAAS ® for K-12.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 12 Testing for Relationships Tests of linear relationships –Correlation 2 continuous.
1 Children Left Behind in AYP and Non-AYP Schools: Using Student Progress and the Distribution of Student Gains to Validate AYP Kilchan Choi Michael Seltzer.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Copyright © 2010, SAS Institute Inc. All rights reserved. How Do They Do That? EVAAS and the New Tests October 2013 SAS ® EVAAS ® for K-12.
Cognitive Abilities Test (CogAT)
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Custom Reports: SCGs and VCGs. Standard Comparison Group (SCG)
Standardized Testing. Basic Terminology Evaluation: a judgment Measurement: a number Assessment: procedure to gather information.
Misadministration of standardized achievement tests: Can we count on test scores for the evaluation of principals and teachers? Eliot Long A*Star Audits,
The Normal Distribution and Norm-Referenced Testing Norm-referenced tests compare students with their age or grade peers. Scores on these tests are compared.
Measuring Turnaround Success October 29 th, 2015 Jeanette P. Cornier, Ph.D.
IMPACTS OF SERVICE DELIVERY ON SLD IDENTIFICATION, TEACHER EMPLOYMENT, AND OUTCOMES Dr. Paul Sindelar Christopher Leko University of Florida.
Understanding ERB Scores
Student Achievement Data Mount Olive Township Public Schools Winter 2016 RESULTS.
Lesson Thirteen Standardized Test. Contents Components of a Standardized test Reasons for the Name “Standardized” Reasons for Using a Standardized Test.
Interpreting Test Results using the Normal Distribution Dr. Amanda Hilsmier.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
1 Testing Various Models in Support of Improving API Scores.
Growth: Changing the Conversation
What is Value Added?.
Describing Location in a Distribution
School Quality and the Black-White Achievement Gap
Classical Test Theory Margaret Wu.
NWEA Measures of Academic Progress (MAP)
Cobb County’s Gifted Eligibility Process
Instructional Practices in the Early Grades that Foster Language & Comprehension Development Timothy Shanahan University of Illinois at Chicago
Overview This presentation provides information on how districts compile evaluation ratings for principals, assistant principals (APs), and vice principals.
Calculating Reliability of Quantitative Measures
Cognitive Abilities Test (CogAT)
Cognitive Abilities Test (CogAT)
Baltimore City Schools Definition and Vision
Overview This presentation provides information on how districts compile evaluation ratings for principals, assistant principals (APs), and vice principals.
Why We Should be Skeptical about the Common Core
What Do All These Numbers Mean? Interpreting Gifted Test Scores
Presentation transcript:

CREATE – National Evaluation Institute Annual Conference – October 8-10, 2009 The Brown Hotel, Louisville, Kentucky Research and Evaluation that inform Leadership for Results Masking Variations in Achievement Gains By Eliot R. Long A*Star Audits, LLC - Brooklyn, NY eliotlong.astaraudits.com

Teacher encouraged guessing: Unstructured influence on student test item responses An accepted practice  Recommended by educational assessment writers  Supported by extensive research – since the early 1920’s  A common practice in schools across the U.S. - Assessment mantra: “If it’s blank, it’s wrong”  Informal, entrepreneurial teacher activity – no written policy or instructions how to – or not to - do it Yet, no evaluation of impact on program evaluation or accountability  No study of effects on low performing students  No study of impact on comparison of test scores over time  No study of recommendations put into general practice

A Norms Review The following exhibits are based on four separate research projects, each including the development of group response pattern norms -Classroom groups, grades 3-7 in a northeast urban school district 15,825 classrooms, 391,078 students - School groups, grade 3 statewide in Midwest 2,317 schools, 140,203 students -Nationwide sample, grade 4 A test section of the 2002 NAEP Reading 36,314 students -Job applicant groups across the U.S. 87 employers, 447 employer groups, 32,458 job applicants

Percent Correct & Test Completion Teacher Administered Tests Non-teacher Administered Tests Pct. Pct. Attp. Pct. Pct. Attp. Test-Takers Correct All Questions Northeast Independent Proctor Administered Urban School District – Reading Tests NAEP Reading 2002 Grade % 97.4% Grade % 60.9% Grade % 96.7% Grade % 93.6% Employer Administered Grade % 93.1% Verbal Skills Grade % 96.4% Job Applicants 82.0% 44.0% Midwest2001 Quantitative Skills Statewide – Math Test Job Applicants 75.2% 28.2% Grade % 97.4% “If it’s blank, it’s wrong.” No encouraged guessing

Test Completion: A Teacher/Proctor Effect Answers left blank are concentrated by classroom 15.6% of all classrooms account for 77.6% of all answers left blank. 5.6% of all classrooms account for 48.0% of all answers left blank. Grade 5 Reading 45 items – 4 alternative, multiple-choice All Classes ‘Low Blanks’ Classes ‘High Blanks’ Classes < 26 Ans. Left Blank 26+ Ans. Left Blank Class Blanks Pct. Attp. Blanks Blanks Pct. of All Standing Classes Per Class All Ques. Classes Per Class Classes Per Class Classes Blanks 4 th Q % % 12.0% 3 rd Q % % 32.5% 2 nd Q % % 43.5% 1 st Q % % 63.8% All 2, % 2, % 48.0% Pct. Correct 65.5% 65.9% 59.3% Pct. Attp. All 94.0% 95.1% 74.0%

Tale of Two Classes: Number Attempted by Number Correct Two classrooms at the same class average score with and without encouraged guessing. Class: n = 21, Blanks = 3 Class n = 21, Blanks = 199 RS Avg. = 19.4 SD = 4.3 RS Avg. = 19.4 SD = 7.9 KR-20 =.53 – Pct. Blank = 0.3% KR-20 =.89 – Pct. Blank = 21.1% The Norm of Classroom Test Administration The Exception

NAEP & Job Applicants: Number Attempted by Number Correct Independent Test Administrators Employer Administered NAEP 2000 Grade 4 Reading Test of Basic Verbal Skills Students leave many answers blankJob applicants leave many answers blank Pct. correct of attempts = 67.6% Pct. correct of attempts = 75.1% Pct. attempt all questions = 60.9% Pct. attempt all questions = 1.8%

Correlation Analysis: Number Attempted - Number Correct Teacher Administered All Students Students with Blanks =>5 Grade 5 Reading r =.153 n = 66,320 r =.527 n = 1,094 Grade 5 Math r =.110 n = 69,413 r =.549 n = 238 Grade 6 Reading r =.162 n = 62,524 r =.583 n = 658 Grade 7 Reading r =.202 n = 58,915 r =.597 n = 1,416 Independent Test Administrator NAEP Grade 4 Reading r =.608 n = 36,314 Employer Administered Job Applicants Test of Verbal Skills r =.717 n = 32,458 Test of Quantitative Skills r =.581 n = 31,629 Hovland and Wonderlic (1939) Adult workers & students Otis Test of Mental Ability 4 test forms & 2 time limits r =.608 to.723 n = 125 to 2,274 (8 variations)

Location of Answers Left Blank Recommendations to encourage guessing presume that most answers left blank are imbedded; that is, they represent questions that are addressed and, for some reason, skipped. Our norms reveal that most blanks are trailing; that is, they represent questions that are not reached during the time limit. Position of Blanks Imbedded Trailing Grade 5 Reading 22.3% 77.7% NAEP Grade 4 Reading 15.8% 84.2% Job Applicant Verbal Skills 5.2% 94.8% Teachers must significantly change students’ test work behavior to achieve answers to ‘not reached’ questions. How?

Test Score Reliability (KR-20) by Classroom Teacher involvement in their students’ test work behavior to encourage guessing is entrepreneurial, often undermining test score reliability. 50+ Answers Left Blank No Answers Left Blank 42 classrooms at and below average 330 classrooms at and below average likely to have little encouragement to guess likely to have extensive encouragement to guess

The volume of teacher encouraged guessing Parsing Grade 5 number correct scores: The traditional correction-for-guessing: S = R – W/(n-1) For the number correct score at the minimum for Basic (R = 18): S = 18 – 27/(4-1) = S = 9 Result: Half of the number correct score is due to random guessing. RS 18 = Min. Scale Score For ‘Basic’ - just passing S = True Score R = Number Right W = Number Wrong n = Number of Answer Choices Grade 5 Reading: 45 items 4 ans. alternatives

Success rate: A norms approach The traditional correction-for-guessing formula assumes that 100% of skills based answers are correct. A regression of median percent correct on number attempted for test-takers who leave 5+ answers blank finds a variable rate of success: Regression of Median Pct. Correct on Number Attempted Test-Takers Number Data Pts R squared Constant Slope Grade 5 Reading 1,449 7* Grade 6 Reading 1,486 7* Grade 7 Reading 1,269 7* Job Applicants 15,650 25** or Percent Correct = *A s where A s represents the number of questions answered based on the test-taker’s skills. * Number attempted ranges: Up to 15, 16-20, 21-25, 26-30, 31-35, 36-40, ** Number attempted: 21 through 45

Add norms to The traditional formula = Empirical Approach Traditional formula: S = R – W/(n-1) orR = S + W/(n-1) skills + guessing Empirical formula: R = Pct. Correct*A s + (A t – A s )/n orR = *A s *A s + (A t – A s )/n ---- skills guessing For a score of 18: 18 = (0.0094* ) + (0.465*17.7) + (( )/4) = = skills + guessing Results:39% (17.7/45) of answers are attempted based on skills 61% of answers are guessed due to teacher encouragement 38% of the observed score is based on encouraged random guessing Note: W = (A t – A s )*((n-1)/n) A t = Total attempts = 45 A s = Skill based attempts Solution: Substitute 45 for A t and 18 for R, find A s = 17.7

Observed and Estimated True Scores Grade 5 Reading Test: Distribution of Observed and Estimated True Skills Application of the ‘empirical’ parsing formula to the full distribution of Grade 5 scores*. Student Distribution Mean SD Observed Est. True Change +10.2% -16.5% Classroom Distribution Avg. Mean Avg. SD Observed Est. True Change +11.6% -19.9% * Random guessing outcomes are forecast by the binomial distribution and moderated by the variation in the volume of guessing with student skill level. The actual percent guessed correct is lower than expected among lower observed scores and higher than expected among higher observed scores.

Volume of encouraged guessing By Performance Level Contribution of Encouraged Guessing To Student Scores Student Averages by Performance Level Grade 5 Reading Estimates for Random Guessing Average Average Pct. of Pct. of Student Pct. of Number Number Number All Performance Level Students Correct Attempted CorrectAnswers Level 4 Advanced 6.2% % 0.0% Level 3 Proficient 48.4% % 15.4% Level 2 Basic 37.1% % 42.8% Level 1 Below Basic 8.3% % 69.8% Levels 1-2 Basic & Below 45.4% % 47.1% All Students 100.0% % 26.7%

Encouraged guessing Creates a test score modulator Changes in skill and guessing move in opposite direction, offsetting in the total score. Comparison of First Test and Second Test Scores Test Answers Test Observed Based on Based on Guessing Administration Total Skills Guessing Contribution 1 st Test Admin. Correct % Attempts % Pct. Correct 40.0% 63.3% 25.0% 2 nd Test Admin. Correct % Attempts % Pct. Correct 44.4% 68.7% 25.0% Gain Pct. Gain 11.1% 23.2% 52% of true gain masked by guessing

Estimated Gain Masked by Guessing The ‘empirical’ formula may be applied to first test and second tests at each score level. Hypothetical Gains Parsed for Guessing Effects Number Correct Pct. Pct. Pct. Percentile First SecondObserved Est. True True Gain Standing Score Score Gain Gain Masked_ 90% % 10.9% 8.3% 80% % 13.3% 24.9% 70% % 13.2% 24.2% 60% % 13.8% 27.3% 50% % 13.4% 25.6% 40% % 15.2% 34.1% 30% % 15.8% 36.7% 20% % 16.5% 39.2% 10% % 20.6% 51.5%

Findings of a Norms Review The informal practice of teacher encouraged guessing to complete all test answers has the following effects: 1. High volume of non-skills based test answers The volume of test answers that result from teacher encouragement is very high: 26% of all answers for students at the school district average and 50% or more among students most at risk of failing. 2. Teacher involvement lowers test score reliability Teacher involvement is unstructured, varying from classroom to classroom and from student to student, creating widely varying and generally lower test score reliability. 3. Guessed correct answers reduce the range of measurement Added guessing increases among lower performing students, raising their scores more than higher performing students and therefore narrowing the range of measurement by ~20%.

Findings of a Norms Review Continued 4. Guessing creates a test score modulator Changes in student achievement will cause changes in the volume of guessing – in the opposite, offsetting direction - modulating observed scores. This modulating effect masks variations in gain, by as much as 50% or more among low performing students. Teacher encouraged guessing narrows the window onto student achievement gains, while reducing both the range and reliability of the measurement that can be observed. As a consequence, non-skills related variation may predominate, misdirecting test score interpretation and education policy.