Computerized Adaptive Testing: What is it and How Does it Work?

Slides:

Advertisements

Similar presentations

Implications and Extensions of Rasch Measurement.

Advertisements

Test Development.

What is a CAT?. Introduction COMPUTER ADAPTIVE TEST + performance task.

1Reliability Introduction to Communication Research School of Communication Studies James Madison University Dr. Michael Smilowitz.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.

Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.

Measures of Academic Progress. Make informed instructional decisions  Identify gaps/needs  Support specific skill development across content areas 

AN OVERVIEW OF THE FAMILY OF RASCH MODELS Elena Kardanova

Chapter Fifteen Understanding and Using Standardized Tests.

Choosing appropriate summative tests.

The use of a computerized automated feedback system Trevor Barker Dept. Computer Science.

Seminar Overview Welcomes, Introductions Background to e-asTTle

Computer Science Department Jeff Johns Autonomous Learning Laboratory A Dynamic Mixture Model to Detect Student Motivation and Proficiency Beverly Woolf.

Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.

Test Preparation Strategies

Algebra Problems… Solutions

Info for this website Taken from: From: Compass Test Information Presentation.

Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.

Computerized Adaptive Testing: What is it and How Does it Work? Presented by Matthew Finkelman, Ph.D.

1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.

Measures of Academic Progress. Make informed instructional decisions  Identify gaps/needs  Support specific skill development across content areas 

Office of Institutional Research, Planning and Assessment January 24, 2011 UNDERSTANDING THE DIAGNOSTIC GUIDE.

Parent Training California Assessment for Student

Modern Test Theory Item Response Theory (IRT). Limitations of classical test theory An examinee’s ability is defined in terms of a particular test The.

SKILLS AND TECHNIQUES HOMEWORK DUE IN TODAY Higher/Intermediate 2 Physical Education.

Technical Adequacy Session One Part Three.

Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,

Copyright © 2010 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.

 Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing.

Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.

Copyright © 2009 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.

University of Georgia – Chemistry Department JExam - A Method to Measure Outcomes Assessment Charles H. Atwood, Kimberly D. Schurmeier, and Carrie G. Shepler.

From Bad to Worse: Variations in Judgments of Associative Memory Erin Buchanan, Ph.D., Missouri State University Abstract Four groups were tested in variations.

1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.

NCLEX ® is a Computerized Adaptive Test (CAT) How Does It Work?

Smith/Davis (c) 2005 Prentice Hall Chapter Fourteen Designing and Conducting Experiments with Multiple Independent Variables PowerPoint Presentation created.

Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June.

KEYS Scott Gajewski ART 389A Spring Contents Premise Getting Started -Players -Set-up -Materials Rules -Basics -Points System -Multiple Players.

Reactive and Output-Only HKOI Training Team 2006 Liu Chi Man (cx) 11 Feb 2006.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

SAT’s Information Parent’s Meeting 10 th February February 2016.

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 6 The Standard Deviation as a Ruler and the Normal Model.

7.1 What is a Sampling Distribution? Objectives SWBAT: DISTINGUISH between a parameter and a statistic. USE the sampling distribution of a statistic to.

Overview of Item Response Theory Ron D. Hays November 14, 2012 (8:10-8:30am) Geriatrics Society of America (GSA) Pre-Conference Workshop on Patient- Reported.

California Assessment of Student Performance and Progress CAASPP Insert Your School Logo.

+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.

Understanding RIT and Reading MAP Reports. Agenda Unique features of the RIT scale Calibrating items for MAP Scoring a test Interpretation of scores How.

Chapter 9 Sampling Distributions 9.1 Sampling Distributions.

SCIENCE TEST 35 Minutes; 40 Questions; 7 Passages 5 – 7 questions per passage 5 minutes per passage Evaluates your ability to reason scientifically 3 Question.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.

Section 2 Effective Groupwork Online. Contents Effective group work activity what is expected of you in this segment of the course: Read the articles.

Understanding Your PSAT/NMSQT Results

What is a CAT? What is a CAT?.

SAT Prep Lesson # 1 EQ: What do I need know about time management to be successful on the SAT?

PowerSchool for Parents

Welcome to i-Ready®.

Math-Curriculum Based Measurement (M-CBM)

Understanding Your PSAT/NMSQT Results

Office of Education Improvement and Innovation

Understanding Your PSAT/NMSQT Results

Understanding Your PSAT/NMSQT Results

Study Island Student Demo:

Mohamed Dirir, Norma Sinclair, and Erin Strauts

Understanding Your PSAT/NMSQT Results

Margaret Wu University of Melbourne

Understanding Your PSAT/NMSQT Results

Presentation transcript:

Computerized Adaptive Testing: What is it and How Does it Work?

Goals of this session Learn about Computerized Adaptive Testing (CAT) Review Item Response Theory (IRT) Combining CAT with IRT Pros and cons of CAT Answer questions

Not to be confused with… Computerized Adaptive Testing: Not as cute, but far fewer hairballs.

PART I Introduction to CAT

Motivation for Understanding CAT There are already operational assessments that use CAT Some believe it will revolutionize classroom testing in the future Interesting idea that speaks to potential of computers to have new uses in education Item Response Theory is all over testing now

OK, so what is CAT? A type of assessment where a question is displayed on a monitor Students use mouse to select answer Computer chooses next question based on previous responses Next question is displayed on monitor, or else test ends

A graphical representation Questions chosen depend on prior responses

Analogy: A Game of 20 Questions I am thinking of an object. You have 20 “yes-or-no” questions to figure it out. Would you write out all your questions ahead of time? 1) Is it an animal? 2) Is it a vegetable? 3) Is it blue? 4) Is it red? 5) Is it bigger than a car? 6) Etc.

20 Questions, Continued Isn’t it more effective to base your next question on previous answers? 1) Is it an animal? NO. 2) Is it a vegetable? YES. 3) Is it commonly found in a salad? YES. 4) Is it green? NO. 5) Would Bugs Bunny eat it? YES.

Same principle used in CAT Computer keeps track of each student’s pattern of responses so far As test progresses, learn more about individual student Choose next question (item) to get maximal info about that particular student’s level of ability Purpose of assessment: Get best possible information about students

Some items are more informative than others? Sure! Some items are easier than others: vs Some items are more relevant than others: vs. Academy Awards question Some items are better at discerning proficient students from those who need improvement

Which is most informative? Suppose we have only 2 types of students: “Advanced” and “Beginning” Use the test to classify each student Which item below is the best for this purpose? ItemP(Correct|Advanced)P(Correct|Beginning) 152% 275%34% 3100%0%

Item 3 is the best Item 1 is completely useless Item 2 gives some information Item 3 is all you need! ItemP(Correct|Advanced)P(Correct|Beginning) 152% 275%34% 3100%0%

But wait… Wouldn’t we choose Item 3 for ALL students? If so, why customize a test for an individual student? Answer: For some students, Item A is more informative. For others, Item B is more informative.

When is one item more informative than another? Item A: Item B: ( ) / 2 If you’ve answered many difficult items correctly, Item A is waste of time If you’ve answered many easy items incorrectly, Item B is too hard Thus, give Item B to high-performing students, Item A to low-performing students

Isn’t that unfair? It seems like CAT penalizes students for performing well at start If we give different items to different students, how can we compare their performances? The above question arises whether we use CAT or not Item Response Theory to the rescue!

Summary of Part I CAT customizes assessment based on previous responses, as in 20 Questions Certain items more informative than others For some students, Item A is more informative; for others, Item B is When give different items to different students, need way to relate student performances (Item Response Theory)

PART II Review of Item Response Theory

Item Response Theory (IRT) Quantifies the relation between examinees and test items For each item, gives probability of correct response by ability level Provides a means for describing characteristics of items, estimating ability of examinees Places examinees on common scale when they have taken different items

The IRT Model: One item

Different items have different curves

Where did those curves come from? In IRT, ability is denoted by θ Probability of a correct response is Each item has its own values of a, b, and c. We know them from field testing a is the “discrimination”: Related to the slope b is the “difficulty”: Harder item, higher b c is the “guessing parameter”: Chance of lucky guess

Effect of the a parameter All curves shown have equal b and c parameters Larger a increases the slope in the middle

Effect of the b parameter All curves shown have equal a and c parameters Larger b means harder item

Effect of the c parameter All curves shown have equal a and b parameters c is the left asymptote

Wait a minute What do you mean by a student with an ability of 1.0? Does an ability of 0.0 mean that a student has NO ability? What if my student has a reading ability of -1.2? What in the world does that mean???

The ability scale Ability is on an arbitrary scale that just happens to be centered around 0.0 We use arbitrary scales all the time: –Fahrenheit –Celsius –Decibels Nevertheless, need more “user-friendly” reporting: “scaled” scores on conventional scale like

Giving a score for each student First assign an ability (θ) value to each student (say, -4 to 4) Student is given the value of θ that is most consistent with his/her responses The better he/she does on the test, the higher the value of θ that he/she receives Computer converts the θ score to a scaled score Report final score!

Assigning scores Set of answers: (C,C,I,C,C,I,I,C,C,C,I,C,C) We know which items were taken by each student: a, b, c parameters If Student 1’s items were harder than Student 2’s, take into account through item parameters Student 1: θ = 1.25, scaled score = 290 Student 2: θ = 0.65, scaled score = 268 Can compare students who took different items!!!

Summary of Part II If you didn’t get all that, don’t worry Just remember: –In IRT, different items have different curves (depending on a, b, c parameters) –IRT allows us to give scores on the same scale, even when students take different items These features critical in CAT So how do we choose which items to give?

PART III Combining CAT with IRT

CAT Reminder CAT customizes assessment based on previous responses For some students, Item A is more informative; for others, Item B is With IRT, it’s OK to give different items to different students

Which item would you choose next? PREVIOUS RESPONSES: = ? Answered correctly = ? Answered incorrectly = ? Answered incorrectly. POSSIBLE ITEMS TO GIVE NEXT: = ? = ? = ?

Item selection to match ability/difficulty Want to give items appropriate to ability is not informative for high-performing students; ( ) / 2 is not informative for low- performing students Student has taken 10 items, awaits 11th Classic approach: Give item whose difficulty (b) is closest to current ability estimate (θ)

Which item is better for θ = -1.2? Easier item Harder item

More complex item selection Previous method: Match difficulty to ability This criterion only uses b parameter and θ Recall that a parameter is related to slope, c is guessing parameter Shouldn’t we consider those when choosing next item?

Another item selection method Ideal item: High value of a; value of b close to θ; low value of c “Fisher Information” combines these factors into a single number Choose item with highest Fisher Info

Game: Which item would you choose? Suppose our current estimate of θ is 0.6 Itemabc

Results If matching ability estimate (0.6) with difficulty, we would give Item 2 If using Fisher Info, we would give Item 2 Itemabc

Round 2 Suppose our current estimate of θ is 0.7 Itemabc

Round 2 Results If matching ability estimate (0.7) with difficulty, we would give Item 2 If using Fisher Info, we would give Item 1 Itemabc

Summary of Part III Tailor items to be most informative about individual student’s ability Do this by combining CAT with IRT One method: Match difficulty with current estimate of θ Another method: Take all parameters into account via Fisher Info

PART IV Practical Considerations

Problem: Content Balance In operational testing, must balance content (e.g., math test of algebra, geometry, number sense) What if all your most informative items come from the same content strand? In practice, dozens of constraints for each CAT: Content, topics, enemies list, etc. CAT solution: Pick most informative item among those “in play”

Problem: Test security CAT administered on multiple occasions Person A takes exam, memorizes items, tells Person B. Person B takes exam, benefits from Person A’s information Different students, different items; however, some items more popular than others CAT solution: Limit the amount each item can be administered

CAT “Pros” Convenient administration Immediate scoring Items maximally informative: Exams just as accurate, with shorter tests Items at correct level: High-performing students not bored, low-performing students not overwhelmed

CAT “Cons” Limited by technology Potential bias versus students with less computer experience Content balance less exact than paper-and- pencil testing Test security Expensive

Final summary Introduction to CAT: Benefits of giving different items to different students Review of IRT Using IRT to select items in a CAT Pros and cons of CAT