Latent Factor Models Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy.

Latent Factor Models Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy

Motivation A key component of a cognitive tutor: student cognitive model Tracks what skills student currently knows —latent factors circle-area rectangle-area decompose-area right-answer

Motivation Student models are a key bottleneck in cognitive tutor authoring and performance rough estimate: 20-80 hrs to hand-code model for 1 hr of content result may be too simple, not rigorously verified But, demonstrated improvements in learning from better models E.g., Cen et al [2007]:12% less time to learn 6 geometry units (same retention) using tutor w/ more accurate model This talk: automatic discovery of new models and data-driven revision of existing models via (latent) factor analysis

DataShop Subject areaTransactions Math (total)16.3 M Algebra11.2 M Geometry5.1 M Language (total)2.5 M French0.5 M English0.2 M Chinese1.8 M Science (total)3.3 M Chemistry1.2 M Physics2.1 M Other (total)3.2 M Total25.3 M REPRESENTIN G ~112K TOTAL HOURS ACROSS ~15K STUDENTS

SCORE: STDNT I, ITEM J Simple case: snapshot, no side information 123456… A 110010… B 011000… C 110110… D 100110… … ………………… ITEMS STUDENTS

Missing data 123456… A 1???10… B 0?10??… C 11???0… D 1001??… … ………………… ITEMS STUDENTS

Data matrix X x1x1x2x2x3x3......xnxnx1x1x2x2x3x3......xnxn... STUDENTS ITEMS

Simple case: model X V U U: student latent factors V: item latent factors X: observed performance n students m items k latent factors observed unobserved

Linear-Gaussian version student factoritem factor X V U n students m items k latent factors U: Gaussian (0 mean, fixed var) V: Gaussian (0 mean, fixed var) X: Gaussian (fixed var, mean at left)

Matrix form: Principal Components Analysis x1x1x2x2x3x3......xnxnx1x1x2x2x3x3......xnxn... DATA MATRIX X ≈ COMPRESSED MATRIX U u1u1u2u2u3u3......ununu1u1u2u2u3u3......unun... v1v1……vkvkv1v1……vkvk… BASIS MATRIX V T

PCA: the picture

PCA: matrix form x1x1x2x2x3x3......xnxnx1x1x2x2x3x3......xnxn... DATA MATRIX X ≈ COMPRESSED MATRIX U u1u1u2u2u3u3......ununu1u1u2u2u3u3......unun... v1v1……vkvkv1v1……vkvk… BASIS MATRIX V T COLS OF V SPAN THE LOW-RANK SPACE

Interpretation of factors u1u1u2u2u3u3......ununu1u1u2u2u3u3......unun... v1v1……vkvkv1v1……vkvk… STUDENTS ITEMSBASIS WEIGHTS BASIS VECTORS BASIS VECTORS ARE CANDIDATE “SKILLS” OR “KNOWLEDGE COMPONENTS” WEIGHTS ARE STUDENTS’ KNOWLEDGE LEVELS

PCA is a widely successful model FACE IMAGES FROM Groundhog Day, EXTRACTED BY CAMBRIDGE FACE DB PROJECT

Data matrix: face images x1x1x2x2x3x3......xnxnx1x1x2x2x3x3......xnxn... IMAGES PIXELS

Result of factoring u1u1u2u2u3u3......ununu1u1u2u2u3u3......unun... v1v1……vkvkv1v1……vkvk… IMAGES PIXELSBASIS WEIGHTS BASIS VECTORS BASIS VECTORS ARE OFTEN CALLED “EIGENFACES”

Eigenfaces IMAGE CREDIT: AT&T LABS CAMBRIDGE

PCA: the good Unsupervised: need no human labels of latent state! No worry about “expert blind spot” Of course, labels helpful if available Post-hoc human interpretation of latents is nice too—e.g., intervention design

PCA: the bad Linear, Gaussian PCA assumes E(X) is linear in UV PCA assumes (X–E(X)) is i.i.d. Gaussian

Nonlinearity: conjunctive skills P(CORRECT) SKILL 1 SKILL 2

Nonlinearity: disjunctive skills P(CORRECT) SKILL 1 SKILL 2

Nonlinearity: “other” P(CORRECT) SKILL 1 SKILL 2

Non-Gaussianity Typical hand-developed skill-by-item matrix 123456… 110011… 001101… SKILLS ITEMS

Result of Gaussian assumption truerecovered rows of true and recovered V matrices

The ugly: MLE only PCA yields maximum-likelihood estimate Good, right? sadly, the usual reasons to want the MLE don’t apply here e.g., consistency: variance and bias of estimates of U and V do not approach 0 (unless #items/student and #students/item   ) Result: MLE is typically far too confident of itself

Too certain: example Learned coefficients (e.g., a row of U) Predictions

Result: “fold-in problem” Nonsensical results when trying to apply learned model to a new student or item Similar to overfitting problem in supervised learning: confident- but-wrong parameters do not generalize to new examples Unlike overfitting, fold-in problem doesn’t necessarily go away with more data

Summary: 3 problems w/ PCA Can’t handle nonlinearity Can’t handle non-Gaussian distributions Uses MLE only (==> fold-in problem) Let’s look at each problem in turn

Nonlinearity In PCA, had X ij ≈ U i ⋅ V j What if X ij ≈ exp(U i ⋅ V j ) X ij ≈ logit(U i ⋅ V j ) …

Non-Gaussianity In PCA, had X ij ~ Normal(μ), μ = U i ⋅ V j What if X ij ~ Poisson(μ) X ij ~ Binomial(p) …

Exponential family review Exponential family of distributions: P(X | θ) = P 0 (X) exp(X ⋅ θ – G(θ)) G(θ) is always strictly convex, differentiable on interior of domain means G’ is strictly monotone (strictly generalized monotone in 2D or higher)

Exponential family review Exponential family PDF: P(X | θ) = P 0 (X) exp(X ⋅ θ – G(θ)) Surprising result: G’(θ) = g(θ) = E(X | θ) g & g –1 = “link function” θ = “natural parameter” E(X | θ) = “expectation parameter”

Examples Normal(mean) g = identity Poisson(log rate) g = exp Binomial(log odds) g = sigmoid

Nonlinear & non-Gaussian Let P(X | θ) be an exponential family with natural parameter θ Predict X ij ~ P(X | θ ij ), where θ ij = U i ⋅ V j e.g., in Poisson, E(X ij ) = exp(θ ij ) e.g., in Binomial, E(X ij ) = logit(θ ij )

Optimization problem max ∑ log P(X ij | θ ij ) s.t. θ ij = U i ⋅ V j “Generalized linear” or “exponential family” PCA all P(…) terms are exponential families analogy to GLMs + log P(U) + log P(V) U,V [Collins et al, 2001] [Gordon, 2002] [Roy & Gordon, 2005]

Special cases PCA, probabilistic PCA Poisson PCA k-means clustering Max-margin matrix factorization (MMMF) Almost: pLSI, pHITS, NMF

Comparison to AFM p = probability correct θ = student overall performance β = skill difficulty Q = item x skill matrix  = skill practice slope T = number of practice opportunities T ik k T ik  k θ β0 Q 1 x

Theorem In GL PCA, finding U which maximizes likelihood (holding V fixed) is a convex optimization problem And, finding best V (holding U fixed) is a convex problem Further, Hessian is block diagonal So, an efficient and effective optimization algorithm: alternately improve U and V

Example: compressing histograms w/ Poisson PCA Points: observed frequencies in ℝ 3 Hidden manifold: a 1-parameter family of multinomials A BC

Example ITERATION 1

Example ITERATION 2

Example ITERATION 3

Example ITERATION 4

Example ITERATION 5

Example ITERATION 9

Remaining problem: MLE Well-known rule of thumb: if MLE gets you in trouble due to overfitting, move to fully-Bayesian inference Typical problem: computation In our case, the computation is just fine if we’re a little clever Additional wrinkle: switch to hierarchical model

Bayesian hierarchical exponential- family PCA X V U U: student latent factors V: item latent factors X: observed performance R: shared prior for student latents S: shared prior for item latents n students m items k latent factors observed unobserved R S student factor item factor

A little clever: MCMC Z P(X)

Experimental comparison Geometry Area 1996-1997 data Geometry tutor: 139 items presented to 59 students On average, each student tested on 60 items

Results: hold-out error Embedding dimension for *EPCA is K = 15 credit: Ajit Singh

Extensions Relational models Temporal models

Relational models 123456 john 110010 sue 011000 tom 110110 ITEMS STUDENTS 123456 trig 110010 story 011000 hard 110110 ITEMS TAGS

Relational hierarchical Bayesian exponential-family PCA X V U X, Y: observed data U: student latent factors V: item latent factors Z: tag latent factors R, S, T: shared priors n students m items k latent factors observed unobserved R S p tags Y Z k latent factors T X ≈ f(UV T ) Y ≈ g(VZ T )

Example: brain imaging 2000 dictionary words 60 stimulus words 500 brain voxels X = co-occurrence of (dictionary word, stimulus word) on web Y = activation of voxel when presented with stimulus Task: predict X HB-EPCA H-EPCA EPCA Relational versions Mean squared error credit: Ajit Singh

fMRI data Subject reads word on screen, thinks about it for 15–30s fMRI measures blood flow in ~16,000 voxels of a few mm 3 each at ~1Hz

fMRI data Blood flow is a proxy for energy consumption is a proxy for amount of activity But delayed and time- averaged (4–10s) And we further average over time to reduce noise

Example data Slice 2 at bottom, slice 15 at top Front of brain at bottom of slice Subject’s left at left of slice

Image factorization Express each image in terms of a basis of “eigenimages” Basis images capture spatial patterns of activity over many voxels … = = x

Temporal models So far: latent factors of students and content e.g., knowledge components for student: skill at KC for problem: need for KC e.g., student affect But limited idea of evolution through time e.g., fixed-structure models: proficiency = a + B X, where x = # practice opportunities, A = initial skill level, b = skill learning rate

Temporal models For evolving factors, we expect far better results if we learn about time explicitly learning curves, gaming state, affective state, motivational state, self-efficacy, … X1X1X1X1 X1X1X1X1 X1X1X1X1 LATENT STATE PROPERTIES OF TRANSACTION X1 X1 Y1Y1Y1Y1 INSTRUCTIONAL DECISIONS X1 X1 U1U1U1U1 TRANS. 1TRANS. 2TRANS. 3 X1X1X1X1 X1X1X1X1 X2X2X2X2 X1 X1 Y2Y2Y2Y2 X1 X1 U2U2U2U2 X1X1X1X1 X1X1X1X1 X3X3X3X3 X1 X1 Y3Y3Y3Y3 X1 X1 U3U3U3U3

Example: Bayesian Evaluation & Assessment [BECK ET AL., 2008] PROPERTIES OF TRANSACTIONS LATENT STATE INSTRUCTIONAL DECISIONS

The hope Fit a temporal model Examine learned parameters and latent states Discover important evolving factors which affect performance learning curve, affective state, gaming state, … Discover how they evolve

The hope Reduce assumptions about what the factors are Explore a wider variety of models Model search guided by data  discover factors we might otherwise have missed

Walking: original data THANKS: BYRON BOOTS, SAJID SIDDIQI

Walking: original data THANKS: BYRON BOOTS, SAJID SIDDIQI X1X1X1X1 X1X1X1X1 X1X1X1X1 LATENT STATE JOINT ANGLES X1 X1 Y1Y1Y1Y1 DESIRED DIRECTION X1 X1 U1U1U1U1 TRANS. 1TRANS. 2TRANS. 3 X1X1X1X1 X1X1X1X1 X2X2X2X2 X1 X1 Y2Y2Y2Y2 X1 X1 U2U2U2U2 X1X1X1X1 X1X1X1X1 X3X3X3X3 X1 X1 Y3Y3Y3Y3 X1 X1 U3U3U3U3

Walking: learned model

Steam: original data

X1X1X1X1 X1X1X1X1 X1X1X1X1 LATENT STATE PIXELS X1 X1 Y1Y1Y1Y1 (EMPTY) X1 X1 U1U1U1U1 TRANS. 1TRANS. 2TRANS. 3 X1X1X1X1 X1X1X1X1 X2X2X2X2 X1 X1 Y2Y2Y2Y2 X1 X1 U2U2U2U2 X1X1X1X1 X1X1X1X1 X3X3X3X3 X1 X1 Y3Y3Y3Y3 X1 X1 U3U3U3U3

Steam: learned model

Latent Factor Models Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy.

Similar presentations

Presentation on theme: "Latent Factor Models Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Latent Factor Models Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy.

Similar presentations

Presentation on theme: "Latent Factor Models Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy."— Presentation transcript:

Similar presentations

About project

Feedback