Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 23, 2012.

Slides:



Advertisements
Similar presentations
Implications and Extensions of Rasch Measurement.
Advertisements

Continued Psy 524 Ainsworth
The effect of differential item functioning in anchor items on population invariance of equating Anne Corinne Huggins University of Florida.
Statistical Analysis SC504/HS927 Spring Term 2008
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 30, 2012.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 12, 2012.
Item Response Theory in a Multi-level Framework Saralyn Miller Meg Oliphint EDU 7309.
Brief introduction on Logistic Regression
LOGO One of the easiest to use Software: Winsteps
How Should We Assess the Fit of Rasch-Type Models? Approximating the Power of Goodness-of-fit Statistics in Categorical Data Analysis Alberto Maydeu-Olivares.
Educational Data Mining Overview Ryan S.J.d. Baker PSLC Summer School 2012.
Item Response Theory in Health Measurement
AN OVERVIEW OF THE FAMILY OF RASCH MODELS Elena Kardanova
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Knowledge Inference: Advanced BKT Week 4 Video 5.
Hidden Markov Models Theory By Johan Walters (SR 2003)
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Marco Del Negro, Frank Schorfheide, Frank Smets, and Raf Wouters (DSSW) On the Fit of New-Keynesian Models Discussion by: Lawrence Christiano.
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.
Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.
Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.
Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.
12 Stats Statistics is a body of methods for making wise decisions in the face of uncertainty. W. Allen Wallis.
EM and expected complete log-likelihood Mixture of Experts
Estimation of Statistical Parameters
The ABC’s of Pattern Scoring Dr. Cornelia Orr. Slide 2 Vocabulary Measurement – Psychometrics is a type of measurement Classical test theory Item Response.
Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 2, 2012.
智慧型系統實驗室 iLab 南台資訊工程 1 Evaluation for the Test Quality of Dynamic Question Generation by Particle Swarm Optimization for Adaptive Testing Department of.
Class 4 Simple Linear Regression. Regression Analysis Reality is thought to behave in a manner which may be simulated (predicted) to an acceptable degree.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 January 28, 2013.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 4, 2013.
Grading and Analysis Report For Clinical Portfolio 1.
1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.
What is the HSPA???. HSPA - Overview The HSPA is the High School Proficiency Assessment that is given to juniors in New Jersey’s public schools. States.
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
The ABC’s of Pattern Scoring
Item Factor Analysis Item Response Theory Beaujean Chapter 6.
Dynamic Programming. A Simple Example Capital as a State Variable.
Statistics What is the probability that 7 heads will be observed in 10 tosses of a fair coin? This is a ________ problem. Have probabilities on a fundamental.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
Item Response Theory in Health Measurement
Item Parameter Estimation: Does WinBUGS Do Better Than BILOG-MG?
FIT ANALYSIS IN RASCH MODEL University of Ostrava Czech republic 26-31, March, 2012.
Core Methods in Educational Data Mining HUDK4050 Fall 2015.
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.
Core Methods in Educational Data Mining HUDK4050 Fall 2015.
Item Response Theory Dan Mungas, Ph.D. Department of Neurology University of California, Davis.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 25, 2012.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 6, 2012.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
IRT Equating Kolen & Brennan, 2004 & 2014 EPSY
Core Methods in Educational Data Mining
Core Methods in Educational Data Mining
Michael V. Yudelson Carnegie Mellon University
ARDHIAN SUSENO CHOIRUL RISA PRADANA P.
Item Analysis: Classical and Beyond
Special Topics in Educational Data Mining
Core Methods in Educational Data Mining
Addressing the Assessing Challenge with the ASSISTment System
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Item Analysis: Classical and Beyond
Core Methods in Educational Data Mining
Core Methods in Educational Data Mining
Item Analysis: Classical and Beyond
Presentation transcript:

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 23, 2012

Today’s Class Item Response Theory

What is the key goal of IRT?

Measuring how much of some latent trait a person has How intelligent is Bob? How much does Bob know about snorkeling? – SnorkelTutor

What is the typical use of IRT?

Assess a student’s knowledge of topic X Based on a sequence of items that are dichotomously scored – E.g. the student can get a score of 0 or 1 on each item

Scoring Not a simple average of the 0s and 1s – That’s an approach that is used for simple tests, but it’s not IRT Instead, a function is computed based on the difficulty and discriminability of the individual items

Key assumptions There is only one latent trait or skill being measured per set of items – There are other models that allow for multiple skills per item, we’ll talk about them later in the semester Each learner has ability  Each item has difficulty b and discriminability a From these parameters, we can compute the probability P(  ) that the learner will get the item correct

Note The assumption that all items tap the same latent construct, but have different difficulties, is a very different assumption than is seen in other approaches such as BKT (which we’ll talk about later) Why might this be a good assumption? Why might this be a bad assumption?

Item Characteristic Curve Can anyone walk the class through what this graph means?

Item Characteristic Curve If Iphigenia is an Idiot, but Joelma is a Jenius, where would they fall on this curve?

Which parameter do these three graphs differ in terms of?

Which of these three graphs represents a difficult item? Which represents an easy item?

For a genius, what is the probability of success on the hard item? For an idiot, what is the probability of success on the easy item? What are the implications of this?

Which parameter do these three graphs differ in terms of?

Which of these three items has low discriminability? Which has high discriminability? Which of these items would be useful on a test?

What would a graph with extremely low discriminability look like? Can anyone draw it on the board? Would this be useful on a test?

What would a graph with extremely high discriminability look like? Can anyone draw it on the board? Would this be useful on a test?

Mathematical formulation The logistic function

The Rasch (1PL) model Simplest IRT model, very popular There is an entire special interest group of AERA devoted solely to the Rasch model (RaschSIG)

The Rasch (1PL) model No discriminability parameter Parameters for student ability and item difficulty

The Rasch (1PL) model Each learner has ability  Each item has difficulty b

The Rasch (1PL) model Let’s enter this into Excel, and create the item characteristic curve

The Rasch (1PL) model Let’s try the following values:  = 0  b = 0?  = 3  b = 0?  = -3  b = 0?  = 0  b = 3?  = 0  b = -3?  = 3  b = 3?  = -3  b = -3? What do each of these param sets mean? What is P(  )?

The 2PL model Another simple IRT model, very popular Discriminability parameter a added

Rasch 2PL

The 2PL model Another simple IRT model, very popular Discriminability parameter a added Let’s enter it into Excel, and create the item characteristic curve

The 2PL model What do these param sets mean? What is P(  )?  = 0  b = 0, a = 0  = 3  b = 0, a = 0  = 0  b = 3, a = 0

The 2PL model What do these param sets mean? What is P(  )?  = 0  b = 0, a = 1  = 0  b = 0, a = -1  = 3  b = 0, a = 1  = 3  b = 0, a = -1  = 0  b = 3, a = 1  = 0  b = -3, a = -1

The 2PL model What do these param sets mean? What is P(  )?  = 3  b = 0, a = 1  = 3  b = 0, a = 2  = 3  b = 0, a = 10  = 3  b = 0, a = 0.5  = 3  b = 0, a = 0.25  = 3  b = 0, a = 0.01

Model Degeneracy Where a model works perfectly well computationally, but makes no sense/does not match intuitive understanding of parameter meanings What parts of the 2PL parameter space are degenerate? What does the ICC look like?

The 3PL model A more complex model Adds a guessing parameter c

The 3PL model

What is the meaning of the c and (1-c) parts of the function?

The 3PL model A more complex model Adds a guessing parameter c Let’s enter it into Excel, and create the item characteristic curve

The 3PL model What do these param sets mean? What is P(  )?  = 0  b = 0, a = 1, c = 0  = 0  b = 0, a = 1, c = 1  = 0  b = 0, a = 1, c = 0.35

The 3PL model What do these param sets mean? What is P(  )?  = 0  b = 0, a = 1, c = 1  = -5  b = 0, a = 1, c = 1  = 5  b = 0, a = 1, c = 1

The 3PL model What do these param sets mean? What is P(  )?  = 1  b = 0, a = 0, c = 0.5  = 1  b = 0, a = 0.5, c = 0.5  = 1  b = 0, a = 1, c = 0.5

The 3PL model What do these param sets mean? What is P(  )?  = 1  b = 0, a = 1, c = 0.5  = 1  b = 0.5, a = 1, c = 0.5  = 1  b = 1, a = 1, c = 0.5

The 3PL model What do these param sets mean? What is P(  )?  = 0  b = 0, a = 1, c = 2  = 0  b = 0, a = 1, c = -1

Model Degeneracy Where a model works perfectly well computationally, but makes no sense/does not match intuitive understanding of parameter meanings What parts of the 3PL parameter space are degenerate? What does the ICC look like?

Fitting an IRT model Typically done with Maximum Likelihood Estimation (MLE) – Which parameters make the data most likely We’ll do it here with Maximum a-priori estimation (MAP) – Which parameters are most likely based on the data

The difference Mostly a matter of religious preference – In many models (though not IRT) they are the same thing – MAP is usually easier to calculate – Statisticians frequently prefer MLE – Data Miners sometimes prefer MAP – In this case, we use MAP solely because it’s easier to do in real-time

Let’s fit IRT parameters to this data irt-modelfit-set1-v1.xlsx Let’s start with a Rasch model

Let’s fit IRT parameters to this data We’ll use SSR (sum of squared residuals) as our goodness criterion – Lower SSR = less disagreement between data and model = better model – This is a standard goodness criterion within statistical modeling – Why SSR rather than just sum of residuals? – What are some other options?

Let’s fit IRT parameters to this data Fit by hand Fit using Excel Equation Solver Other options: – Iterative Gradient Descent – Grid Search – Expectation Maximization

Items and students Who are the best and worst students? Which items are the easiest and hardest?

2PL Now let’s fit a 2PL model Are the parameters similar? How much difference do the items have in terms of discriminability?

2PL Now let’s fit a 2PL model Is the model better? (how much?)

2PL Now let’s fit a 2PL model Is the model better? (how much?) – It’s worth noting that I generated this simulated data using a Rasch-like model – What are the implications of this result?

Reminder IRT models are typically fit using the (more complex) Expectation Maximization algorithm rather than in the fashion used here We’ll talk more about fit algorithms in a future class

Standard Error in Estimation of Student Knowledge (1 – P( 

Standard Error in Estimation of Student Knowledge 1.96 standard errors in each direction = 95% confidence interval Standard error bars are typically 1 standard error – If you compare two different values, each of which have 1 standard error bars – Then if they do not overlap, they are significantly different This glosses over some details, but is basically correct

Standard Error in Estimation of Student Knowledge Let’s estimate the standard error in some of our student estimates in the data set Are there any students for whom the estimates are not trustworthy?

Final Thoughts IRT is the classic approach to assessing knowledge through tests Extensions are used heavily in Computer- Adaptive Tests Not frequently used in Intelligent Tutoring Systems – Where models that treat learning as dynamic are preferred; more next class

IRT Questions? Comments?

Next Class Wednesday, January 25 3pm-5pm AK232 Performance Factors Analysis Pavlik, P.I., Cen, H., Koedinger, K.R. (2009) Performance Factors Analysis -- A New Alternative to Knowledge Tracing. Proceedings of AIED2009. Pavlik, P.I., Cen, H., Koedinger, K.R. (2009) Learning Factors Transfer Analysis: Using Learning Curve Analysis to Automatically Generate Domain Models. Proceedings of the 2nd International Conference on Educational Data Mining.

The End