NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

Slides:

Advertisements

Similar presentations

Implications and Extensions of Rasch Measurement.

Advertisements

Standardized Scales.

Structural Equation Modeling. What is SEM Swiss Army Knife of Statistics Can replicate virtually any model from “canned” stats packages (some limitations.

Item Response Theory in a Multi-level Framework Saralyn Miller Meg Oliphint EDU 7309.

Conceptualization and Measurement

Reliability Definition: The stability or consistency of a test. Assumption: True score = obtained score +/- error Domain Sampling Model Item Domain Test.

© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.

Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.

VALIDITY AND RELIABILITY

MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT

General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.

Item Response Theory in Health Measurement

Issues of Technical Adequacy in Measuring Student Growth for Educator Effectiveness Stanley Rabinowitz, Ph.D. Director, Assessment & Standards Development.

Models for Measuring. What do the models have in common? They are all cases of a general model. How are people responding? What are your intentions in.

Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.

Concept of Reliability and Validity. Learning Objectives  Discuss the fundamentals of measurement  Understand the relationship between Reliability and.

Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.

Education 795 Class Notes Factor Analysis II Note set 7.

Robert delMas (Univ. of Minnesota, USA) Ann Ooms (Kingston College, UK) Joan Garfield (Univ. of Minnesota, USA) Beth Chance (Cal Poly State Univ., USA)

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.

Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.

Item Response Theory Psych 818 DeShon. IRT ● Typically used for 0,1 data (yes, no; correct, incorrect) – Set of probabilistic models that… – Describes.

Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.

Modern Test Theory Item Response Theory (IRT). Limitations of classical test theory An examinee’s ability is defined in terms of a particular test The.

Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.

Unanswered Questions in Typical Literature Review 1. Thoroughness – How thorough was the literature search? – Did it include a computer search and a hand.

Technical Adequacy Session One Part Three.

Test item analysis: When are statistics a good thing? Andrew Martin Purdue Pesticide Programs.

Chapter Nine Copyright © 2006 McGraw-Hill/Irwin Sampling: Theory, Designs and Issues in Marketing Research.

Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,

Chapter 7 Item Analysis In constructing a new test (or shortening or lengthening an existing one), the final set of items is usually identified through.

Effect Size Estimation in Fixed Factors Between- Groups Anova.

Estimating a Population Proportion

Reliability & Validity

Measurement Models: Exploratory and Confirmatory Factor Analysis James G. Anderson, Ph.D. Purdue University.

Appraisal and Its Application to Counseling COUN 550 Saint Joseph College For Class # 3 Copyright © 2005 by R. Halstead. All rights reserved.

Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1.

Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.

Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.

Validity and Item Analysis Chapter 4. Validity Concerns what the instrument measures and how well it does that task Not something an instrument has or.

Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”

SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.

Practical Issues in Computerized Testing: A State Perspective Patricia Reiss, Ph.D Hawaii Department of Education.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.

Chapter 6 - Standardized Measurement and Assessment

Chapter 13 Understanding research results: statistical inference.

Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.

Computacion Inteligente Least-Square Methods for System Identification.

5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)

Measurement and Scaling Concepts

ESTABLISHING RELIABILITY AND VALIDITY OF RESEARCH TOOLS Prof. HCL Rawat Principal UCON,BFUHS Faridkot.

CHAPTER 8 Estimating with Confidence

Classical Test Theory Margaret Wu.

Validity and Reliability

Item Analysis: Classical and Beyond

Reliability & Validity

Item pool optimization for adaptive testing

پرسشنامه کارگاه.

PSY 614 Instructor: Emily Bullock, Ph.D.

Evaluation of measuring tools: reliability

Inferences and Conclusions from Data

National Conference on Student Assessment

Aligned to Common Core State Standards

Mohamed Dirir, Norma Sinclair, and Erin Strauts

15.1 The Role of Statistics in the Research Process

Psy 425 Tests & Measurements

Understanding Statistical Inferences

Item Analysis: Classical and Beyond

Item Analysis: Classical and Beyond

Presentation transcript:

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

2 What is different about the adaptive context? How do you conceptualize adaptive assessments? How do you make the transition from fixed form thinking? How can you evaluate the quality of these tests?

In the fixed form world…. Test Blueprint + items = Test Form = Student Test Event Percent correct is an indicator of difficulty Commonly accepted criteria for acceptance 3

In the adaptive context… Test Blueprint is a design for the student test event Item pool + test structure + algorithm determine each test event Variable linking block (all items) P-values close to.5 Metrics not as well-established. 4

Everything supports the test event Test Event Test Blueprint Content & Report Structure Pool Algorithm 5

What’s going on here? You are moving from the concept of a population responding to a form into the realm of a person responding to an individual item. Indicators based on sets of people responding to sets of items may be uninformative The scale representing the latent trait assumes greater importance. 6

Move from population-based thinking to Responses to Items Forms are not linked to one another. Pool consists of items linked to the scale. Scores from non-parallel tests are expressed and interpreted on the scale. Percent correct is not important in assessing ability. The test event establishes the difficulty of the items a student is getting right about half the time. The goal of the test session is to solve for theta ( Use the IRT equation with your favorite number of parameters.) 7

Start with the Test Blueprint What do you want every student to get? Content – categories and proportions Cognitive characteristics Item types How many items in each test event? What are you going to report? For individuals? For groups? Overall scores Sub-scores Achievement category 8

How do you evaluate pool adequacy? Reckase – P-optimal pool evaluation. Analysis of “bins”. Satisfy some proportion of a fully informative pool. It’s unrealistic to expect that every value of theta will have a maximally informative item. This method specifies a degree of optimality. The p-optimal method can be used to evaluate existing pools or specify pool design. 9

How do you evaluate pool adequacy? Veldkamp & van der Linden - Shadow test method – 1. At every point in the test, a test that meets constraints and has maximum information at the current ability estimate is assembled. 2. The item in the shadow test with maximum information is administered 4. Update the ability estimate. 5. Return all unused items to the pool. 6. Adjust the constraints to allow for the attributes of the item administered. 7. Repeat Steps 2-6 until end of test. 10

Adaptive Test Design-Algorithm How will you guarantee that each students gets the material in your test design? Item selection, scoring, domain sampling How will you guarantee reliable scores and categories? Overall scores Sub-scores Achievement category How do you control for item exposure? 11

12 Adaptive test event - Start Assumption: you have a calibrated item pool that supports your test purpose What do you need to know about the examinee? How will you choose the initial item? Jumping into the item pool

13 Adaptive test event – Finding Theta Assumption: you have a response to the initial item How do you estimate ability? How do you estimate error? How do you choose the next item? How do you satisfy your test event design? Progressing through the item pool

14 Adaptive test event – Termination What triggers the end of the test? Number of items Error threshold Proctor termination What is reported to the student at the end? High achiever getting out of the pool

15 How do I know it’s a good test? Classical reliability estimates depend on correlation among items. In CAT, inter-item correlation is low. This is an illustration of local independence. In general CATs use the Marginal Reliability Coefficient (Samejima, 1977, 1994). This is based on analysis of the test information function over all values of theta. In evaluating tests, it can be interpreted like coefficient alpha.

16 Simulation is your friend Using the actual pool, test structure and algorithm, simulate student responses at interesting levels of theta. Compare the test’s estimated thetas with true thetas. Bias: Average difference Fit: Root Mean Squared Error How do I know it’s a good test before giving it to zillions of students?

17 CAT depends on a calibrated bank When items are used operationally, responses are gathered from those with highest info (I.e., ability and difficulty are close) variance is low so correlational indicators are not appropriate P-values are around.5

18 Evaluating item technical quality Calibration depends on common person link to scale Expose to a representative sample The trick is to get informative responses

19 Evaluating item technical quality In calibration, the process is to find difficulty from responses of examinees with known abilities. Look at a vector of p-values across the range of theta. Evaluate the relationship between observed and expected p-values for your IRT model; may use chi- square or correlation of p to expected p. What value of difficulty maximizes this relationship?

Ask lots of questions. Keep pestering until understanding dawns. Thank you for your attention! Questions, comments? Contact: 20