NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL

2 What is different about the adaptive context? How do you conceptualize adaptive assessments? How do you make the transition from fixed form thinking? How can you evaluate the quality of these tests?

In the fixed form world…. Test Blueprint + items = Test Form = Student Test Event Percent correct is an indicator of difficulty Commonly accepted criteria for acceptance 3

In the adaptive context… Test Blueprint is a design for the student test event Item pool + test structure + algorithm determine each test event Variable linking block (all items) P-values close to.5 Metrics not as well-established. 4

Everything supports the test event Test Event Test Blueprint Content & Report Structure Pool Algorithm 5

What’s going on here? You are moving from the concept of a population responding to a form into the realm of a person responding to an individual item. Indicators based on sets of people responding to sets of items may be uninformative The scale representing the latent trait assumes greater importance. 6

Move from population-based thinking to Responses to Items Forms are not linked to one another. Pool consists of items linked to the scale. Scores from non-parallel tests are expressed and interpreted on the scale. Percent correct is not important in assessing ability. The test event establishes the difficulty of the items a student is getting right about half the time. The goal of the test session is to solve for theta ( Use the IRT equation with your favorite number of parameters.) 7

Start with the Test Blueprint What do you want every student to get? Content – categories and proportions Cognitive characteristics Item types How many items in each test event? What are you going to report? For individuals? For groups? Overall scores Sub-scores Achievement category 8

How do you evaluate pool adequacy? Reckase – P-optimal pool evaluation. Analysis of “bins”. Satisfy some proportion of a fully informative pool. It’s unrealistic to expect that every value of theta will have a maximally informative item. This method specifies a degree of optimality. The p-optimal method can be used to evaluate existing pools or specify pool design. 9

How do you evaluate pool adequacy? Veldkamp & van der Linden - Shadow test method – 1. At every point in the test, a test that meets constraints and has maximum information at the current ability estimate is assembled. 2. The item in the shadow test with maximum information is administered 4. Update the ability estimate. 5. Return all unused items to the pool. 6. Adjust the constraints to allow for the attributes of the item administered. 7. Repeat Steps 2-6 until end of test. 10

Adaptive Test Design-Algorithm How will you guarantee that each students gets the material in your test design? Item selection, scoring, domain sampling How will you guarantee reliable scores and categories? Overall scores Sub-scores Achievement category How do you control for item exposure? 11

12 Adaptive test event - Start Assumption: you have a calibrated item pool that supports your test purpose What do you need to know about the examinee? How will you choose the initial item? Jumping into the item pool

13 Adaptive test event – Finding Theta Assumption: you have a response to the initial item How do you estimate ability? How do you estimate error? How do you choose the next item? How do you satisfy your test event design? Progressing through the item pool

14 Adaptive test event – Termination What triggers the end of the test? Number of items Error threshold Proctor termination What is reported to the student at the end? High achiever getting out of the pool

15 How do I know it’s a good test? Classical reliability estimates depend on correlation among items. In CAT, inter-item correlation is low. This is an illustration of local independence. In general CATs use the Marginal Reliability Coefficient (Samejima, 1977, 1994). This is based on analysis of the test information function over all values of theta. In evaluating tests, it can be interpreted like coefficient alpha.

16 Simulation is your friend Using the actual pool, test structure and algorithm, simulate student responses at interesting levels of theta. Compare the test’s estimated thetas with true thetas. Bias: Average difference Fit: Root Mean Squared Error How do I know it’s a good test before giving it to zillions of students?

17 CAT depends on a calibrated bank When items are used operationally, responses are gathered from those with highest info (I.e., ability and difficulty are close) variance is low so correlational indicators are not appropriate P-values are around.5

18 Evaluating item technical quality Calibration depends on common person link to scale Expose to a representative sample The trick is to get informative responses

19 Evaluating item technical quality In calibration, the process is to find difficulty from responses of examinees with known abilities. Look at a vector of p-values across the range of theta. Evaluate the relationship between observed and expected p-values for your IRT model; may use chi- square or correlation of p to expected p. What value of difficulty maximizes this relationship?

Ask lots of questions. Keep pestering until understanding dawns. Thank you for your attention! Questions, comments? Contact: marty.mccall@nwea.org 20

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

Similar presentations

Presentation on theme: "NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

Similar presentations

Presentation on theme: "NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL."— Presentation transcript:

Similar presentations

About project

Feedback