Next Semester CSCI 5622 – Machine learning (Matt Wilder)  great text by Hastie, Tibshirani, & Friedman great text ECEN 5018 – Game Theory ECEN 5322 –

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Probabilistic models Haixu Tang School of Informatics.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Biointelligence Laboratory, Seoul National University
Computer vision: models, learning and inference Chapter 8 Regression.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Expectation Maximization
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Parameter Estimation using likelihood functions Tutorial #1
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Presenting: Assaf Tzabari
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Maximum-Likelihood estimation Consider as usual a random sample x = x 1, …, x n from a distribution with p.d.f. f (x;  ) (and c.d.f. F(x;  ) ) The maximum.
Bayesian Learning, Cont’d. Administrivia Various homework bugs: Due: Oct 12 (Tues) not 9 (Sat) Problem 3 should read: (duh) (some) info on naive Bayes.
Computer vision: models, learning and inference
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
Introduction to Bayesian Parameter Estimation
Thanks to Nir Friedman, HU
Maximum likelihood (ML)
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
The Triangle of Statistical Inference: Likelihoood Data Scientific Model Probability Model Inference.
Likelihood function and Bayes Theorem In simplest case P(B|A) = P(A|B) P(B)/P(A) and we consider the likelihood function in which we view the conditional.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Univariate Gaussian Case (Cont.)
Review of statistical modeling and probability theory Alan Moses ML4bio.
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 25, 2012.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
STROUD Worked examples and exercises are in the text Programme 10: Sequences PROGRAMME 10 SEQUENCES.
Bayesian Inference: Multiple Parameters
Markov Chain Monte Carlo in R
Oliver Schulte Machine Learning 726
Classification of unlabeled data:
STA 216 Generalized Linear Models
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
CSCI 5822 Probabilistic Models of Human and Machine Learning
STA 216 Generalized Linear Models
More about Posterior Distributions
CSCI 5822 Probabilistic Models of Human and Machine Learning
'Linear Hierarchical Models'
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CSCI 5822 Probabilistic Models of Human and Machine Learning
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Next Semester CSCI 5622 – Machine learning (Matt Wilder)  great text by Hastie, Tibshirani, & Friedman great text ECEN 5018 – Game Theory ECEN 5322 – Analysis of high-dimensional datasets  FALL 2014 

Project Assignments 8 and 9 Your own project or my ‘student modeling’ project Individual or team

Battleship Game link to game

Data set 51 students 179 unique problems 4223 total problems ~ 15 hr of student usage

Data set

Test set embedded in spreadsheet

Bayesian Knowledge Tracing Students are learning a new skill (knowledge component) with a computerized tutoring system  E.g., manipulation of algebra equations Students are given a series of problems to solve. Solution is either correct or incorrect.  E.g., Goal  Infer when learning has taken place  (Larger goal is to use this prediction to make inferences about other aspects of student performance, such as retention over time and generalization to other skills)

All Or Nothing Learning Model (Atkinson, 1960s) Two state finite-state machine Don’t Know Know Just learned Just forgotten c1c1 c0c0

Bayesian Knowledge Tracing Assumes No Forgetting Very sensible, given that sequence of problems is all within a single session. Don’t Know Know Just learned ρ1ρ1 ρ0ρ0

Inference Problem Given sequence of trials, infer the probability that the concept was just learned T: trial on which concept was learned (0…∞) T = 2T < 1T = 6T > 8

T: trial on which concept was learned (0…∞) X i : response i is correct (X=1) or incorrect (X=0) P(T | X 1, …, X n ) S: latent state (0 = don’t know, 1 = know) ρ s : probability of correct response when S=s L: probability of transitioning from don’t-know to know state T = 2T < 1T = 6T > 8 Don’t Know Know Just learned c1c1 c0c0

What I Did

Observation If you know the point in time at which learning occurred (T), then the order of trials before doesn’t matter. Neither does the order of trials after. What matters is the total count of number correct -> can ignore sequences

Notation: Simple Model

What We Should Be Able To Do  Treat ρ 0, ρ 1, and T as RVs  Do Bayesian inference on these variables Put hyperpriors on ρ 0, ρ 1, and T, and use the data (over multiple subjects) to inform the posteriors  Loosen restriction on transition distribution  Principled handling of ‘didn’t learn’ situation Poisson or Negative Binomial GeometricUniform

What CSCI 7222 Did In 2012 γ ρ0ρ0 ρ1ρ1 X α0α0 α1α1 student trial λ T k0k0 θ0θ0 k1k1 θ1θ1 k2k2 β θ2θ2

Most General Analog To BKT γ ρ0ρ0 ρ1ρ1 X student trial λ T α 0, 0 α 0, 1 k0k0 θ0θ0 k1k1 θ1θ1 k2k2 β θ2θ2 α 1, 0 α 1, 1 k1k1 θ1θ1 k0k0 θ0θ0

Sampling Although you might sample {ρ 0,s } and {ρ 1,s }, it would be preferable (more efficient) to integrate them out. See next slide Never represented explicitly (like topic model) It’s also feasible (and likely more efficient) to integrate out T s because it is discreet. If you wanted to do Gibbs sampling on T s, See next slide How to deal with remaining variables (λ,γ,α 0,α 1 )? See 2 slides ahead

Key Inference Problem If we are going to sample T (either to compute posteriors on hyperparameters, or to make final guess about moment-of- learning distribution), we must compute P(T s |{X s,i },λ,γ,α 0,α 1 )? Note that T s is discrete and has values in {0, 1, …, N} Normalization is feasible because T is discreet

Remaining Variables (λ, γ, α 0, α 1 ) Rowan: maximum likelihood estimation Find values that maximize P(x|λ,γ,α 0,α 1 ) Possibility of overfitting but not that serious an issue considering the amount of data and only 4 parameters Mohammad, Homa: Metropolis Hastings Requires analytic evaluation of P(λ|x) etc. but doesn’t require normalization constant Note: product is over students, marginalizing over T s all data

Remaining Variables (λ, γ, α 0, α 1 ) Mike: Likelihood weighting Sample λ, γ, α 0, α 1 from their respective priors For each student, compute data likelihood given sample, marginalizing over T s, ρ s,0, and ρ s,1 Weight that sample by data likelihood Rob Lindsey: Slice sampling

Latent Factor Models Item response theory (a.k.a. Rasch model)  Traditional approach to modeling student and item effects in test taking (e.g., SATs) ability of student s difficulty of item i

Extending Latent Factor Models Need to consider problem and performance history

Bayesian Latent Factor Model ML approach  search for α and δ values that maximize training set likelihood Bayesian approach  define priors on α and δ, e.g., Gaussian Hierarchical Bayesian approach  treat the σ α 2 and σ δ 2 as random variables, e.g., Gamma distributed with hyperpriors

Khajah, Wing, Lindsey, & Mozer model (paper)paper