Information criterion

Information criterion
5/5/2018 Information criterion Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2016 Purdue University PSY200 Cognitive Psychology

Likelihood Given a model (e.g., N(0, 1)), one can ask how likely is a set of observed data outcomes? This is similar to probability, but not quite the same because in a continuous distribution, each specific point has probability zero We multiply the probability density (height of model probability density function) of each data outcome Smaller likelihoods indicate that data is less likely (probable), given the model X3 X2 X4 X1

Likelihood Calculation is just multiplication of probability densities
> x<-c(-1.1, -0.3, 0.2, 1.0) > dnorm(x, mean=0,sd=1) [1] > prod(dnorm(x, mean=0,sd=1)) [1] X3 X2 X4 X1

Likelihood It should be obvious that adding data to a set makes the set less likely It is always more probable that I draw a red king from a deck of cards than that I draw a red king and then a black queen from a deck of cards. x<-c(-1.1, -0.3, 0.2, 1.0, 0) > prod(dnorm(x, mean=0,sd=1)) [1] In general, larger data sets have smaller likelihood, for a given model But this partly depends on the properties of the sets and the model X5 X3 X2 X4 X1

Log Likelihood The values of likelihood can become so small that it causes problems Smaller than the smallest possible number in a computer Often compute log (natural) likelihood > x<-c(-1.1, -0.3, 0.2, 1.0, 0) > sum(dnorm(x, mean=0,sd=1, log=TRUE)) [1] Smaller (more negative) values indicate smaller likelihood X5 X3 X2 X4 X1

Maximum (log) Likelihood
Given a model (e.g., N(mu, 1)) form, what value of mu maximizes likelihood? 0 -> 0.5 -> -0.5 -> 0.05 -> > > It is the sample mean that maximizes (log) likelihood X5 X3 X2 X4 X1

Maximum (log) Likelihood
Given a model (e.g., N(mu, sigma)) form, what pair of parameters (mu, sigma) maximizes likelihood? Sample standard deviation: (-0.05, ) -> Sample sd using “population formula” (-0.05, ) -> (-0.05, 0.5) -> (-0.05, 1) -> (0, 1) -> Note: the true population values do not maximize likelihood for a given sample [Over fitting] X5 X3 X2 X4 X1

Predicting (log) Likelihood
Suppose you identify the parameters (e.g., N(mu, sigma)) that maximizes likelihood for a data set, X (n=50) Now you gather a new data set Y (n=50). You use the (mu, sigma) values to estimate likelihood for the new data set Log Likelihood Pred Log Likelihood Difference Average difference =1.54

Try again Suppose you identify the parameters (e.g., N(mu, sigma)) that maximizes likelihood for a data set, X (n=50) Now you gather a new data set Y (n=50). You use the (mu, sigma) values to estimate likelihood for the new data set Log Likelihood Pred Log Likelihood Difference Average difference =2.67

Really large sample Suppose you identify the parameters (e.g., N(mu, sigma)) that maximizes likelihood for a data set, X (n=50) Now you gather a new data set Y (n=50). You use the (mu, sigma) values to estimate likelihood for the new data set Do this for many (100,000) simulated experiments and one finds (on average)

Different case Suppose you identify the parameters (e.g., N(mu1, sigma), N(mu2, sigma)) that maximizes likelihood for a data set, X1, X2 (n1=n2=50) Now you gather a new data set Y1, Y2 (n1=n2=50) You use the (mu1, mu2, sigma) values to estimate likelihood for the new data set Log Likelihood Pred Log Likelihood Difference Average difference =2.68

Try again Suppose you identify the parameters (e.g., N(mu1, sigma), N(mu2, sigma)) that maximizes likelihood for a data set, X1, X2 (n1=n2=50) Now you gather a new data set Y1, Y2 (n1=n2=50) You use the (mu1, mu2, sigma) values to estimate likelihood for the new data set Log Likelihood Pred Log Likelihood Difference -148.4 Average difference =3.24

Really large sample Suppose you identify the parameters (e.g., N(mu1, sigma), N(mu2, sigma)) that maximizes likelihood for a data set, X1, X2 (n1=n2=50) Now you gather a new data set Y1, Y2 (n1=n2=50) You use the (mu1, mu2, sigma) values to estimate likelihood for the new data set Do this for many (100,000) simulated experiments and one finds (on average)

Still different case Suppose you identify the parameters (e.g., N(mu1, sigma1), N(mu2, sigma2)) that maximizes likelihood for a data set, X1, X2 (n1=n2=50) Now you gather a new data set Y1, Y2 (n1=n2=50) You use the (mu1, mu2, sigma1, sigma2) values to estimate likelihood for the new data set Do this for many (100,000) simulated experiments and one finds (on average)

Number of parameters The difference of the log likelihood is approximately equal to the number of independent model parameters! Number of model parameters Average difference of log likelihoods (original – replication) 2 (1 common mean, 1 common sd) 2.14 3 (2 means, 1 common sd) 3.20 3 (1 common mean, 2 sd) 3.24 4 (2 means, 2 sd) 4.25

Over fitting This means that the log likelihood of the original data set calculated relative to these parameter values is bigger than reality It is a biased estimate of log likelihood The log likelihood of the replication data set calculated relative to these parameter values is (on average) accurate It is an unbiased estimate of log likelihood Thus, we know that (on average) using the maximum likelihood estimates for parameters will “over fit” the data set We can make a better estimate of the predicted likelihood of the original data set by adjusting for the (average) bias Note, we only need to know how many (K) independent parameters are in the model We do not actually need the replication data set!

AIC This is one way of deriving the Akaike Information Criterion
Multiply everything by -2 More generally, AIC is an estimate of how much relative information is lost by using a model rather than (an unknown) reality You can compare models by calculating AIC for each model (relative to a fixed data set) and choosing the model with the smaller AIC value

Number of model parameters
Model comparison Sample X and Y from N(0,1) and N(0.5, 1) for n1=n2=50 Number of model parameters Sample 1 AIC Sample 2 AIC Sample 3 AIC Sample 4 AIC 2 (1 common mean, 1 common sd) 3 (2 means, 1 common sd) 3 (1 common mean, 2 sd) 4 (2 means, 2 sd)

Model comparison If you have to pick one model: Pick the one with the smallest AIC Number of model parameters Sample 1 AIC Sample 2 AIC Sample 3 AIC Sample 4 AIC 2 (1 common mean, 1 common sd) 3 (2 means, 1 common sd) 3 (1 common mean, 2 sd) 4 (2 means, 2 sd)

Model comparison How much confidence should you have in your choice? Two steps: Number of model parameters Sample 1 AIC Sample 2 AIC Sample 3 AIC Sample 4 AIC 2 (1 common mean, 1 common sd) 3 (2 means, 1 common sd) 3 (1 common mean, 2 sd) 4 (2 means, 2 sd)

Akaike weights How much confidence should you have in your choice? Two steps: 1) Compute the difference of AIC, relative to the value of the best model Number of model parameters Sample 1 ΔAIC Sample 2 ΔAIC Sample 3 ΔAIC Sample 4 ΔAIC 2 (1 common mean, 1 common sd) 15.966 5.2087 5.0791 1.7132 3 (2 means, 1 common sd) 1.7209 3 (1 common mean, 2 sd) 4.6525 4.6258 4 (2 means, 2 sd) 1.9951 1.3493 1.4715 1.9353

Akaike weights How much confidence should you have in your choice? Two steps: 1) Compute the difference of AIC, relative to the value of the best model 2) Rescale the differences to be a probability Number of model parameters Sample 1 wi Sample 2 Sample 3 Sample 4 2 (1 common mean, 1 common sd) 3 (2 means, 1 common sd) 3 (1 common mean, 2 sd) 4 (2 means, 2 sd)

Akaike weights These are estimated probabilities that the given model will make the best predictions (of likelihood) on new data, conditional on the set of models being considered. Number of model parameters Sample 1 wi Sample 2 Sample 3 Sample 4 2 (1 common mean, 1 common sd) 3 (2 means, 1 common sd) 3 (1 common mean, 2 sd) 4 (2 means, 2 sd)

Generalization AIC counts the number of independent parameters in the model This is an estimate of the “flexibility” of the model to fit data AIC works with the single “best” (maximum likelihood) parameters In contrast, our Bayesian analyses consider a distribution of parameter values rather than a single “best” set We can compute log likelihood for each set of parameter values in the posterior distribution Average across them all (weighting by probability density) WAIC (Widely Applicable Information Criterion) Easy with the rethinking R library WAIC(VSmodel) Easy to compare models Compare(VSmodel, VSmodel2)

Example: Visual Search
Typical results: For conjunctive distractors, response time increases with the number of distractors

Visual Search Previously we fit a model to the Target absent condition and then extended it to include the Target present condition VSdata2<-subset(VSdata, VSdata$Participant=="Francis200S16-2" & VSdata$DistractorType=="Conjunction") Define a dummy variable with value 0 if target is absent and 1 if the target is present VSdata2$TargetIsPresent <- ifelse(VSdata2$Target=="Present", 1, 0)

Visual Search Define the model VSmodel <- map(
alist( RT_ms ~ dnorm(mu, sigma), mu <- a + (b* TargetIsPresent +(1-TargetIsPresent)*b2)*NumberDistractors, a ~ dnorm(1000, 500), b ~ dnorm(0, 100), b2 ~ dnorm(0, 100), sigma ~ dunif(0, 2000) ), data=VSdata2 ) Note, parameter b is the slope for when the target is present and b2 is the slope when the target is absent Both conditions have the same model standard deviations and intercept

Model results Maximum a posteriori (MAP) model fit Formula:
RT_ms ~ dnorm(mu, sigma) mu <- a + (b * TargetIsPresent + (1 - TargetIsPresent) * b2) * NumberDistractors a ~ dnorm(1000, 500) b ~ dnorm(0, 100) b2 ~ dnorm(0, 100) sigma ~ dunif(0, 2000) MAP values: a b b sigma Log-likelihood: Compare with model for Target absent only MAP values: a b sigma

HDPI (Target present) # Plot HPDI for TargetPresent
points(RT_ms ~ NumberDistractors, data=subset(VSdata2, VSdata2$Target=="Present" ), pch=15) # use link to compute mu for each sample from posterior and for each value in NumberDistractors.seq mu_present<-link(VSmodel, data=data.frame(NumberDistractors=NumberDistractors.seq, TargetIsPresent=1)) mu_present.mean <- apply(mu_present, 2, mean) mu_present.HPDI <-apply(mu_present, 2, HPDI, prob=0.89) # Plot the MAP line (same as abline done previously from the linear regression coefficients) lines(NumberDistractors.seq, mu_present.mean) shade(mu_present.HPDI, NumberDistractors.seq, col=col.alpha("red",0.3))

Competing model A popular theory about reaction times in visual search is that they are the result of a serial process Examine an item and judge whether it is the target (green, circle) If it is the target, then stop If not, pick a new target and repeat The final RT is determined (roughly) by how many items you have to examine before finding the target When the target is present, on average you should find the target after examining half of the searched items When the target is absent, you always have to search all the items Thus slope(target absent) = 2 x slope(target present) Is this a better model than just estimating each slope separately? This twice slope model has less flexibility than the separate slopes model

Twice slope model set up
VSmodel2 <- map( alist( RT_ms ~ dnorm(mu, sigma), mu <- a + (b +(1-TargetIsPresent)*b)*NumberDistractors, a ~ dnorm(1000, 500), b ~ dnorm(0, 100), sigma ~ dunif(0, 2000) ), data=VSdata2 ) Fit both models using the same set of data

Twice slope model results
Maximum a posteriori (MAP) model fit Formula: RT_ms ~ dnorm(mu, sigma) mu <- a + (b + (1 - TargetIsPresent) * b) * NumberDistractors a ~ dnorm(1000, 500) b ~ dnorm(0, 100) sigma ~ dunif(0, 2000) MAP values: a b sigma Log-likelihood:

Separate slopes model Maximum a posteriori (MAP) model fit Formula: RT_ms ~ dnorm(mu, sigma) mu <- a + (b * TargetIsPresent + (1 - TargetIsPresent) * b2) * NumberDistractors a ~ dnorm(1000, 500) b ~ dnorm(0, 100) b2 ~ dnorm(0, 100) sigma ~ dunif(0, 2000) MAP values: a b b2 sigma Log-likelihood: Slight higher log likelihood indicates better fit to the sample

WAIC comparison compare(VSmodel, VSmodel2) WAIC pWAIC dWAIC weight SE dSE VSmodel NA VSmodel The twice-slope model has a smaller WAIC, so it is expected to do a better job predicting future data than the more general separate slopes model The theoretical constraint improves the model Or, the extra parameter of the separate slopes model causes that model to over fit the sampled data, which hurts prediction of future data How confident are we in this conclusion? The Akaike weight is 0.89 for the twice-slope model Pretty convincing, but maybe we hold out some hope for the more general model Remember, the WAIC values are estimates, the SE is pretty big

Generalize competing model
If final RT is determined (roughly) by how many items you have to examine before finding the target, then When the target is present, on average you should find the target after examining half of the searched items When the target is absent, you always have to search all the items Thus slope(target absent) = 2 x slope(target present) If this model were correct, then there should also be differences in standard deviations for Target present and Target absent trials Target absent trials always have to search all the items (small variability) Target present trials sometimes search all items, sometimes only search one item (and cases in between) (high variability)

Different SDs Define the model VSmodel3 <- map(
alist( RT_ms ~ dnorm(mu, sigma), mu <- a + (b +(1-TargetIsPresent)*b)*NumberDistractors, a ~ dnorm(1000, 500), b ~ dnorm(0, 100), sigma <- c1*TargetIsPresent + c2*(1-TargetIsPresent), c1 ~ dunif(0, 2000), c2 ~ dunif(0, 2000) ), data=VSdata2 ) Note, parameter c1 is the sd for when the target is present and c2 is the sd when the target is absent

Model results Maximum a posteriori (MAP) model fit Formula:
RT_ms ~ dnorm(mu, sigma) mu <- a + (b + (1 - TargetIsPresent) * b) * NumberDistractors a ~ dnorm(1000, 500) b ~ dnorm(0, 100) sigma <- c1 * TargetIsPresent + c2 * (1 - TargetIsPresent) c1 ~ dunif(0, 2000) c2 ~ dunif(0, 2000) MAP values: a b c c2 Log-likelihood:

Model comparison Likelihoods -308.98 -303.88 -308.64 Model type
Twice slopes, common sd Twice slopes, different sd Different slopes, different sd

Model comparison WAIC > compare(VSmodel, VSmodel2, VSmodel3) WAIC pWAIC dWAIC weight SE dSE VSmodel NA VSmodel VSmodel The twice-slope model with separate sd’s has the smallest WAIC The increased flexibility improves the model fit The constraint on the twice-slope model to have a common sd, causes that model to under fit the data How confident are we in this conclusion? The Akaike weight is 0.93 for the twice-slope model with separate sd’s Pretty convincing, but maybe we hold out some hope for the more general model Remember, the WAIC values are estimates, the SE is pretty big

Generating models It might be tempting to consider lots of other models a model with separate slopes and separate sd’s A model with twice slopes, separate sd’s and separate intercepts All possibilities? You should resist this temptation There is always noise in your data, the (W)AIC analysis hedges against over fitting that noise, but you can undermine it by searching for a model that happens to line up nicely with the particular noise in your sample Models to test should be justified by some kind of argument Theoretical processes involved in behavior Previous findings in the literature Practical implications

Model selection Remember: If you have to pick one model: Pick the one with the smallest (W)AIC Consider whether you really have to pick just one model If you want to predict performance on a visual search task, you will do somewhat better by merging the predictions of the different models instead of just choosing the best model It is sometimes wise to just admit that the data do not distinguish between competing models, even if it slightly favors one This can motivate future work Do not make a choice unless you have to: Such as, you are going to build a robot to search for targets Even a clearly best model, is only relative to the set of models you have compared There may be better models that you have not even considered because you did not gather the relevant information

Conclusions Information Criterion
AIC relates number of parameters to differences of original and replication log likelihoods WAIC generalizes the idea to more complex models (with posteriors of parameters) Can identify the best model that fits the data (maximizes log likelihood) with few parameters Models should be justified

Information criterion

Similar presentations

Presentation on theme: "Information criterion"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information criterion

Similar presentations

Presentation on theme: "Information criterion"— Presentation transcript:

Similar presentations

About project

Feedback