Download presentation
Presentation is loading. Please wait.
1
Latent Factor Models Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy
2
Motivation A key component of a cognitive tutor: student cognitive model Tracks what skills student currently knows —latent factors circle-area rectangle-area decompose-area right-answer
3
Motivation Student models are a key bottleneck in cognitive tutor authoring and performance rough estimate: 20-80 hrs to hand-code model for 1 hr of content result may be too simple, not rigorously verified But, demonstrated improvements in learning from better models E.g., Cen et al [2007]:12% less time to learn 6 geometry units (same retention) using tutor w/ more accurate model This talk: automatic discovery of new models and data-driven revision of existing models via (latent) factor analysis
4
DataShop Subject areaTransactions Math (total)16.3 M Algebra11.2 M Geometry5.1 M Language (total)2.5 M French0.5 M English0.2 M Chinese1.8 M Science (total)3.3 M Chemistry1.2 M Physics2.1 M Other (total)3.2 M Total25.3 M REPRESENTIN G ~112K TOTAL HOURS ACROSS ~15K STUDENTS
5
SCORE: STDNT I, ITEM J Simple case: snapshot, no side information 123456… A 110010… B 011000… C 110110… D 100110… … ………………… ITEMS STUDENTS
6
Missing data 123456… A 1???10… B 0?10??… C 11???0… D 1001??… … ………………… ITEMS STUDENTS
7
Data matrix X x1x1x2x2x3x3......xnxnx1x1x2x2x3x3......xnxn... STUDENTS ITEMS
8
Simple case: model X V U U: student latent factors V: item latent factors X: observed performance n students m items k latent factors observed unobserved
9
Linear-Gaussian version student factoritem factor X V U n students m items k latent factors U: Gaussian (0 mean, fixed var) V: Gaussian (0 mean, fixed var) X: Gaussian (fixed var, mean at left)
10
Matrix form: Principal Components Analysis x1x1x2x2x3x3......xnxnx1x1x2x2x3x3......xnxn... DATA MATRIX X ≈ COMPRESSED MATRIX U u1u1u2u2u3u3......ununu1u1u2u2u3u3......unun... v1v1……vkvkv1v1……vkvk… BASIS MATRIX V T
11
PCA: the picture
12
PCA: matrix form x1x1x2x2x3x3......xnxnx1x1x2x2x3x3......xnxn... DATA MATRIX X ≈ COMPRESSED MATRIX U u1u1u2u2u3u3......ununu1u1u2u2u3u3......unun... v1v1……vkvkv1v1……vkvk… BASIS MATRIX V T COLS OF V SPAN THE LOW-RANK SPACE
13
Interpretation of factors u1u1u2u2u3u3......ununu1u1u2u2u3u3......unun... v1v1……vkvkv1v1……vkvk… STUDENTS ITEMSBASIS WEIGHTS BASIS VECTORS BASIS VECTORS ARE CANDIDATE “SKILLS” OR “KNOWLEDGE COMPONENTS” WEIGHTS ARE STUDENTS’ KNOWLEDGE LEVELS
14
PCA is a widely successful model FACE IMAGES FROM Groundhog Day, EXTRACTED BY CAMBRIDGE FACE DB PROJECT
15
Data matrix: face images x1x1x2x2x3x3......xnxnx1x1x2x2x3x3......xnxn... IMAGES PIXELS
16
Result of factoring u1u1u2u2u3u3......ununu1u1u2u2u3u3......unun... v1v1……vkvkv1v1……vkvk… IMAGES PIXELSBASIS WEIGHTS BASIS VECTORS BASIS VECTORS ARE OFTEN CALLED “EIGENFACES”
17
Eigenfaces IMAGE CREDIT: AT&T LABS CAMBRIDGE
18
PCA: the good Unsupervised: need no human labels of latent state! No worry about “expert blind spot” Of course, labels helpful if available Post-hoc human interpretation of latents is nice too—e.g., intervention design
19
PCA: the bad Linear, Gaussian PCA assumes E(X) is linear in UV PCA assumes (X–E(X)) is i.i.d. Gaussian
20
Nonlinearity: conjunctive skills P(CORRECT) SKILL 1 SKILL 2
21
Nonlinearity: disjunctive skills P(CORRECT) SKILL 1 SKILL 2
22
Nonlinearity: “other” P(CORRECT) SKILL 1 SKILL 2
23
Non-Gaussianity Typical hand-developed skill-by-item matrix 123456… 110011… 001101… SKILLS ITEMS
24
Result of Gaussian assumption truerecovered rows of true and recovered V matrices
25
Result of Gaussian assumption truerecovered rows of true and recovered V matrices
26
The ugly: MLE only PCA yields maximum-likelihood estimate Good, right? sadly, the usual reasons to want the MLE don’t apply here e.g., consistency: variance and bias of estimates of U and V do not approach 0 (unless #items/student and #students/item ) Result: MLE is typically far too confident of itself
27
Too certain: example Learned coefficients (e.g., a row of U) Predictions
28
Result: “fold-in problem” Nonsensical results when trying to apply learned model to a new student or item Similar to overfitting problem in supervised learning: confident- but-wrong parameters do not generalize to new examples Unlike overfitting, fold-in problem doesn’t necessarily go away with more data
29
Summary: 3 problems w/ PCA Can’t handle nonlinearity Can’t handle non-Gaussian distributions Uses MLE only (==> fold-in problem) Let’s look at each problem in turn
30
Nonlinearity In PCA, had X ij ≈ U i ⋅ V j What if X ij ≈ exp(U i ⋅ V j ) X ij ≈ logit(U i ⋅ V j ) …
31
Non-Gaussianity In PCA, had X ij ~ Normal(μ), μ = U i ⋅ V j What if X ij ~ Poisson(μ) X ij ~ Binomial(p) …
32
Exponential family review Exponential family of distributions: P(X | θ) = P 0 (X) exp(X ⋅ θ – G(θ)) G(θ) is always strictly convex, differentiable on interior of domain means G’ is strictly monotone (strictly generalized monotone in 2D or higher)
33
Exponential family review Exponential family PDF: P(X | θ) = P 0 (X) exp(X ⋅ θ – G(θ)) Surprising result: G’(θ) = g(θ) = E(X | θ) g & g –1 = “link function” θ = “natural parameter” E(X | θ) = “expectation parameter”
34
Examples Normal(mean) g = identity Poisson(log rate) g = exp Binomial(log odds) g = sigmoid
35
Nonlinear & non-Gaussian Let P(X | θ) be an exponential family with natural parameter θ Predict X ij ~ P(X | θ ij ), where θ ij = U i ⋅ V j e.g., in Poisson, E(X ij ) = exp(θ ij ) e.g., in Binomial, E(X ij ) = logit(θ ij )
36
Optimization problem max ∑ log P(X ij | θ ij ) s.t. θ ij = U i ⋅ V j “Generalized linear” or “exponential family” PCA all P(…) terms are exponential families analogy to GLMs + log P(U) + log P(V) U,V [Collins et al, 2001] [Gordon, 2002] [Roy & Gordon, 2005]
37
Special cases PCA, probabilistic PCA Poisson PCA k-means clustering Max-margin matrix factorization (MMMF) Almost: pLSI, pHITS, NMF
38
Comparison to AFM p = probability correct θ = student overall performance β = skill difficulty Q = item x skill matrix = skill practice slope T = number of practice opportunities T ik k T ik k θ β0 Q 1 x
39
Theorem In GL PCA, finding U which maximizes likelihood (holding V fixed) is a convex optimization problem And, finding best V (holding U fixed) is a convex problem Further, Hessian is block diagonal So, an efficient and effective optimization algorithm: alternately improve U and V
40
Example: compressing histograms w/ Poisson PCA Points: observed frequencies in ℝ 3 Hidden manifold: a 1-parameter family of multinomials A BC
41
Example ITERATION 1
42
Example ITERATION 2
43
Example ITERATION 3
44
Example ITERATION 4
45
Example ITERATION 5
46
Example ITERATION 9
47
Remaining problem: MLE Well-known rule of thumb: if MLE gets you in trouble due to overfitting, move to fully-Bayesian inference Typical problem: computation In our case, the computation is just fine if we’re a little clever Additional wrinkle: switch to hierarchical model
48
Bayesian hierarchical exponential- family PCA X V U U: student latent factors V: item latent factors X: observed performance R: shared prior for student latents S: shared prior for item latents n students m items k latent factors observed unobserved R S student factor item factor
49
A little clever: MCMC Z P(X)
50
Experimental comparison Geometry Area 1996-1997 data Geometry tutor: 139 items presented to 59 students On average, each student tested on 60 items
51
Results: hold-out error Embedding dimension for *EPCA is K = 15 credit: Ajit Singh
52
Extensions Relational models Temporal models
53
Relational models 123456 john 110010 sue 011000 tom 110110 ITEMS STUDENTS 123456 trig 110010 story 011000 hard 110110 ITEMS TAGS
54
Relational hierarchical Bayesian exponential-family PCA X V U X, Y: observed data U: student latent factors V: item latent factors Z: tag latent factors R, S, T: shared priors n students m items k latent factors observed unobserved R S p tags Y Z k latent factors T X ≈ f(UV T ) Y ≈ g(VZ T )
55
Example: brain imaging 2000 dictionary words 60 stimulus words 500 brain voxels X = co-occurrence of (dictionary word, stimulus word) on web Y = activation of voxel when presented with stimulus Task: predict X HB-EPCA H-EPCA EPCA Relational versions Mean squared error credit: Ajit Singh
56
fMRI data Subject reads word on screen, thinks about it for 15–30s fMRI measures blood flow in ~16,000 voxels of a few mm 3 each at ~1Hz
57
fMRI data Blood flow is a proxy for energy consumption is a proxy for amount of activity But delayed and time- averaged (4–10s) And we further average over time to reduce noise
58
Example data Slice 2 at bottom, slice 15 at top Front of brain at bottom of slice Subject’s left at left of slice
59
Image factorization Express each image in terms of a basis of “eigenimages” Basis images capture spatial patterns of activity over many voxels … = = x
60
Temporal models So far: latent factors of students and content e.g., knowledge components for student: skill at KC for problem: need for KC e.g., student affect But limited idea of evolution through time e.g., fixed-structure models: proficiency = a + B X, where x = # practice opportunities, A = initial skill level, b = skill learning rate
61
Temporal models For evolving factors, we expect far better results if we learn about time explicitly learning curves, gaming state, affective state, motivational state, self-efficacy, … X1X1X1X1 X1X1X1X1 X1X1X1X1 LATENT STATE PROPERTIES OF TRANSACTION X1 X1 Y1Y1Y1Y1 INSTRUCTIONAL DECISIONS X1 X1 U1U1U1U1 TRANS. 1TRANS. 2TRANS. 3 X1X1X1X1 X1X1X1X1 X2X2X2X2 X1 X1 Y2Y2Y2Y2 X1 X1 U2U2U2U2 X1X1X1X1 X1X1X1X1 X3X3X3X3 X1 X1 Y3Y3Y3Y3 X1 X1 U3U3U3U3
62
Example: Bayesian Evaluation & Assessment [BECK ET AL., 2008] PROPERTIES OF TRANSACTIONS LATENT STATE INSTRUCTIONAL DECISIONS
63
The hope Fit a temporal model Examine learned parameters and latent states Discover important evolving factors which affect performance learning curve, affective state, gaming state, … Discover how they evolve
64
The hope Reduce assumptions about what the factors are Explore a wider variety of models Model search guided by data discover factors we might otherwise have missed
65
Walking: original data THANKS: BYRON BOOTS, SAJID SIDDIQI
66
Walking: original data THANKS: BYRON BOOTS, SAJID SIDDIQI X1X1X1X1 X1X1X1X1 X1X1X1X1 LATENT STATE JOINT ANGLES X1 X1 Y1Y1Y1Y1 DESIRED DIRECTION X1 X1 U1U1U1U1 TRANS. 1TRANS. 2TRANS. 3 X1X1X1X1 X1X1X1X1 X2X2X2X2 X1 X1 Y2Y2Y2Y2 X1 X1 U2U2U2U2 X1X1X1X1 X1X1X1X1 X3X3X3X3 X1 X1 Y3Y3Y3Y3 X1 X1 U3U3U3U3
67
Walking: learned model
68
Steam: original data
69
X1X1X1X1 X1X1X1X1 X1X1X1X1 LATENT STATE PIXELS X1 X1 Y1Y1Y1Y1 (EMPTY) X1 X1 U1U1U1U1 TRANS. 1TRANS. 2TRANS. 3 X1X1X1X1 X1X1X1X1 X2X2X2X2 X1 X1 Y2Y2Y2Y2 X1 X1 U2U2U2U2 X1X1X1X1 X1X1X1X1 X3X3X3X3 X1 X1 Y3Y3Y3Y3 X1 X1 U3U3U3U3
70
Steam: learned model
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.