Information criterion

Information criterion
11/23/2018 Information criterion Greg Francis PSY 626: Bayesian Statistics for Psychological Science Fall 2018 Purdue University PSY200 Cognitive Psychology

Likelihood Given a model (e.g., N(0, 1)), one can ask how likely is a set of observed data outcomes? This is similar to probability, but not quite the same because in a continuous distribution, each specific point has probability zero We multiply the probability density (height of model probability density function) of each data outcome Smaller likelihoods indicate that data is less likely (probable), given the model X3 X2 X4 X1

Likelihood Calculation is just multiplication of probability densities
> x<-c(-1.1, -0.3, 0.2, 1.0) > dnorm(x, mean=0,sd=1) [1] > prod(dnorm(x, mean=0,sd=1)) [1] X3 X2 X4 X1

Likelihood It should be obvious that adding data to a set makes the set less likely It is always more probable that I draw a red king from a deck of cards than that I draw a red king and then a black queen from a deck of cards. x<-c(-1.1, -0.3, 0.2, 1.0, 0) > prod(dnorm(x, mean=0,sd=1)) [1] In general, larger data sets have smaller likelihood, for a given model But this partly depends on the properties of the sets and the model X5 X3 X2 X4 X1

Log Likelihood The values of likelihood can become so small that it causes problems Smaller than the smallest possible number in a computer Often compute log (natural) likelihood > x<-c(-1.1, -0.3, 0.2, 1.0, 0) > sum(dnorm(x, mean=0,sd=1, log=TRUE)) [1] Smaller (more negative) values indicate smaller likelihood X5 X3 X2 X4 X1

Maximum (log) Likelihood
Given a model (e.g., N(mu, 1)) form, what value of mu maximizes likelihood? 0 -> 0.5 -> -0.5 -> 0.05 -> > > It is the sample mean that maximizes (log) likelihood X5 X3 X2 X4 X1

Maximum (log) Likelihood
Given a model (e.g., N(mu, sigma)) form, what pair of parameters (mu, sigma) maximizes likelihood? Sample standard deviation: (-0.05, ) -> Sample sd using “population formula” (-0.05, ) -> (-0.05, 0.5) -> (-0.05, 1) -> (0, 1) -> Note: the true population values do not maximize likelihood for a given sample [Over fitting] X5 X3 X2 X4 X1

Predicting (log) Likelihood
Suppose you identify the parameters (e.g., N(mu, sigma)) that maximizes likelihood for a data set, X (n=50) Now you gather a new data set Y (n=50). You use the (mu, sigma) values to estimate likelihood for the new data set Repeat the process Log Likelihood Pred Log Likelihood Difference Average difference =1.54

Try again Suppose you identify the parameters (e.g., N(mu, sigma)) that maximizes likelihood for a data set, X (n=50) Now you gather a new data set Y (n=50). You use the (mu, sigma) values to estimate likelihood for the new data set Log Likelihood Pred Log Likelihood Difference Average difference =2.67

Really large sample Suppose you identify the parameters (e.g., N(mu, sigma)) that maximizes likelihood for a data set, X (n=50) Now you gather a new data set Y (n=50). You use the (mu, sigma) values to estimate likelihood for the new data set Do this for many (100,000) simulated experiments and one finds (on average)

Different case Suppose you identify the parameters (e.g., N(mu1, sigma), N(mu2, sigma)) that maximizes likelihood for a data set, X1, X2 (n1=n2=50) Now you gather a new data set Y1, Y2 (n1=n2=50) You use the (mu1, mu2, sigma) values to estimate likelihood for the new data set Log Likelihood Pred Log Likelihood Difference Average difference =2.68

Try again Suppose you identify the parameters (e.g., N(mu1, sigma), N(mu2, sigma)) that maximizes likelihood for a data set, X1, X2 (n1=n2=50) Now you gather a new data set Y1, Y2 (n1=n2=50) You use the (mu1, mu2, sigma) values to estimate likelihood for the new data set Log Likelihood Pred Log Likelihood Difference -148.4 Average difference =3.24

Really large sample Suppose you identify the parameters (e.g., N(mu1, sigma), N(mu2, sigma)) that maximizes likelihood for a data set, X1, X2 (n1=n2=50) Now you gather a new data set Y1, Y2 (n1=n2=50) You use the (mu1, mu2, sigma) values to estimate likelihood for the new data set Do this for many (100,000) simulated experiments and one finds (on average)

Still different case Suppose you identify the parameters (e.g., N(mu1, sigma1), N(mu2, sigma2)) that maximizes likelihood for a data set, X1, X2 (n1=n2=50) Now you gather a new data set Y1, Y2 (n1=n2=50) You use the (mu1, mu2, sigma1, sigma2) values to estimate likelihood for the new data set Do this for many (100,000) simulated experiments and one finds (on average)

Number of parameters The difference of the log likelihood is approximately equal to the number of independent model parameters! Number of model parameters Average difference of log likelihoods (original – replication) 2 (1 common mean, 1 common sd) 2.14 3 (2 means, 1 common sd) 3.20 3 (1 common mean, 2 sd) 3.24 4 (2 means, 2 sd) 4.25

Over fitting This means that, on average, the log likelihood of the original data set calculated relative to these parameter values is bigger than reality It is a biased estimate of log likelihood The log likelihood of the replication data set calculated relative to these parameter values is (on average) accurate It is an unbiased estimate of log likelihood Thus, we know that (on average) using the maximum likelihood estimates for parameters will “over fit” the data set We can make a better estimate of the predicted likelihood of the original data set by adjusting for the (average) bias Note, we only need to know how many (K) independent parameters are in the model We do not actually need the replication data set!

AIC This is one way of deriving the Akaike Information Criterion
Multiply everything by -2 More generally, AIC is an estimate of how much relative information is lost by using a model rather than (an unknown) reality You can compare models by calculating AIC for each model (relative to a fixed data set) and choosing the model with the smaller AIC value Learn how to pronounce his name

Number of model parameters
Model comparison Sample X and Y from N(0,1) and N(0.5, 1) for n1=n2=50 Number of model parameters Sample 1 AIC Sample 2 AIC Sample 3 AIC Sample 4 AIC 2 (1 common mean, 1 common sd) 3 (2 means, 1 common sd) 3 (1 common mean, 2 sd) 4 (2 means, 2 sd)

Model comparison If you have to pick one model: Pick the one with the smallest AIC Number of model parameters Sample 1 AIC Sample 2 AIC Sample 3 AIC Sample 4 AIC 2 (1 common mean, 1 common sd) 3 (2 means, 1 common sd) 3 (1 common mean, 2 sd) 4 (2 means, 2 sd)

Model comparison How much confidence should you have in your choice? Two steps: Number of model parameters Sample 1 AIC Sample 2 AIC Sample 3 AIC Sample 4 AIC 2 (1 common mean, 1 common sd) 3 (2 means, 1 common sd) 3 (1 common mean, 2 sd) 4 (2 means, 2 sd)

Akaike weights How much confidence should you have in your choice? Two steps: 1) Compute the difference of AIC, relative to the value of the best model Number of model parameters Sample 1 ΔAIC Sample 2 ΔAIC Sample 3 ΔAIC Sample 4 ΔAIC 2 (1 common mean, 1 common sd) 15.966 5.2087 5.0791 1.7132 3 (2 means, 1 common sd) 1.7209 3 (1 common mean, 2 sd) 4.6525 4.6258 4 (2 means, 2 sd) 1.9951 1.3493 1.4715 1.9353

Akaike weights How much confidence should you have in your choice? Two steps: 1) Compute the difference of AIC, relative to the value of the best model 2) Rescale the differences to be a probability Number of model parameters Sample 1 wi Sample 2 Sample 3 Sample 4 2 (1 common mean, 1 common sd) 3 (2 means, 1 common sd) 3 (1 common mean, 2 sd) 4 (2 means, 2 sd)

Akaike weights These are estimated probabilities that the given model will make the best predictions (of likelihood) on new data, conditional on the set of models being considered. Number of model parameters Sample 1 wi Sample 2 Sample 3 Sample 4 2 (1 common mean, 1 common sd) 3 (2 means, 1 common sd) 3 (1 common mean, 2 sd) 4 (2 means, 2 sd)

Generalization AIC counts the number of independent parameters in the model This is an estimate of the “flexibility” of the model to fit data AIC works with the single “best” (maximum likelihood) parameters In contrast, our Bayesian analyses consider a distribution of parameter values rather than a single “best” set We can compute log likelihood for each set of parameter values in the posterior distribution Average across them all (weighting by probability density) WAIC (Widely Applicable Information Criterion) Easy with the brms library WAIC(model1) Easy to compare models WAIC(model1, model2, model3) > WAIC(model1, model2, model3) WAIC SE model model model model1 - model model1 - model model2 - model

Akaike weights Easy with brms library
If you don’t like the scientific notation, try Strongly favors model2, compared to the other choices here > model_weights(model1, model2, model3, weights="waic") model model model3 e e e-02 > options(scipen=999) > model_weights(model1, model2, model3, weights="waic") model model2 model3

loo vs. AIC Asymptotically, AIC is an approximation of loo
They are estimates of the same kind of thing: Model complexity minus log likelihood Asymptotically, AIC is an approximation of loo AIC (and WAIC) is often easier/faster to compute than loo In recent years, new methods for computing loo have been found, so some people recommend using loo whenever possible You can compute the same kind of probability weights with loo > model_weights(model1, model2, model3, weights="loo") model model2 model3

Example: Visual Search
Typical results: For conjunctive distractors, response time increases with the number of distractors

Visual Search # Null model: no difference in slopes or intercepts model1 = brm(RT_ms ~ NumberDistractors, data = VSdata2, iter = 2000, warmup = 200, chains = 3, thin = 2 ) # print out summary of model print(summary(model1)) Family: gaussian Links: mu = identity; sigma = identity Formula: RT_ms ~ NumberDistractors Data: VSdata2 (Number of observations: 40) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; total post-warmup samples = 2700 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept NumberDistractors Family Specific Parameters: sigma Let’s consider the effect of number of distractors and target being present or absent Many different models VSdata2<-subset(VSdata, VSdata$Participant=="Francis200S16-2" & VSdata$DistractorType=="Conjunction”) Null model

# Model with a common intercept but different slopes for each condition
model2 <- brm(RT_ms ~ Target:NumberDistractors, data = VSdata2, iter = 2000, warmup = 200, chains = 3) # print out summary of model print(summary(model2)) Family: gaussian Links: mu = identity; sigma = identity Formula: RT_ms ~ Target:NumberDistractors Data: VSdata2 (Number of observations: 40) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 1; total post-warmup samples = 5400 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept TargetAbsent:NumberDistractors TargetPresent:NumberDistractors Family Specific Parameters: sigma Visual Search Let’s consider the effect of number of distractors and target being present or absent Many different models Different slopes

Visual Search Let’s consider the effect of number of distractors and target being present or absent Many different models Different slopes and different intercepts # Model with different slopes and intercepts for each condition model3 <- brm(RT_ms ~ Target*NumberDistractors, data = VSdata2, iter = 2000, warmup = 200, chains = 3) # print out summary of model print(summary(model3)) Family: gaussian Links: mu = identity; sigma = identity Formula: RT_ms ~ Target * NumberDistractors Data: VSdata2 (Number of observations: 40) Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 1; total post-warmup samples = 5400 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept TargetPresent NumberDistractors TargetPresent:NumberDistractors Family Specific Parameters: sigma

Visual Search Let’s consider the effect of number of distractors and target being present or absent Many different models Different slopes and different intercepts, exgaussian rather than gaussian # Build a model using an exgaussian rather than a gaussian (better for response times) model4 <- brm(RT_ms ~ Target*NumberDistractors, family = exgaussian(link = "identity"), data = VSdata2, iter = 8000, warmup = 2000, chains = 3) # print out summary of model print(summary(model4)) Family: exgaussian Links: mu = identity; sigma = identity; beta = identity Formula: RT_ms ~ Target * NumberDistractors Data: VSdata2 (Number of observations: 40) Samples: 3 chains, each with iter = 8000; warmup = 2000; thin = 1; total post-warmup samples = 18000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept TargetPresent NumberDistractors TargetPresent:NumberDistractors Family Specific Parameters: sigma beta

Visual Search Compare models
Favors model2: common intercept, different slopes But the data is not overwhelmingly in support of model2 model3 and model4 might be viable too > model_weights(model1, model2, model3, model4, weights="waic") model model model model4 > model_weights(model1, model2, model3, model4, weights="loo") model model model model4

Serial search model A popular theory about reaction times in visual search is that they are the result of a serial process Examine an item and judge whether it is the target (green, circle) If it is the target, then stop If not, pick a new target and repeat The final RT is determined (roughly) by how many items you have to examine before finding the target When the target is present, on average you should find the target after examining half of the searched items When the target is absent, you always have to search all the items Thus slope(target absent) = 2 x slope(target present) Is this a better model than just estimating each slope separately? This twice slope model has less flexibility than the separate slopes model

Visual Search We have to tweak brms a bit to define this kind of model
There might be better ways to do what I am about to do Define a dummy variable with value 0 if target is absent and 1 if the target is present VSdata2$TargetIsPresent <- ifelse(VSdata2$Target=="Present", 1, 0) We use the brm non-linear formula (even though we actually define a linear model) # A model where the slope for target absent is twice that for target present model5 <- brm(bf( RT_ms ~ a1 + b1*TargetIsPresent*NumberDistractors + 2*b1*(1-TargetIsPresent)*NumberDistractors, a1 ~1, b1 ~ 1, nl=TRUE), family = gaussian(),data = VSdata2, prior= c( prior(normal(0, 2000), nlpar="a1"), prior(normal(0, 300), nlpar="b1") ), iter = 8000, warmup = 2000, chains = 3) print(summary(model5)) Family: gaussian Links: mu = identity; sigma = identity Formula: RT_ms ~ a1 + b1 * TargetIsPresent * NumberDistractors + 2 * b1 * (1 - TargetIsPresent) * NumberDistractors a1 ~ 1 b1 ~ 1 Data: VSdata2 (Number of observations: 40) Samples: 3 chains, each with iter = 8000; warmup = 2000; thin = 1; total post-warmup samples = 18000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat a1_Intercept b1_Intercept Family Specific Parameters: sigma

Visual Search Compare model2 (common intercept different slopes) and model5 (common intercept, twice-slopes) > WAIC(model2, model5) WAIC SE model model model2 - model > model_weights(model2, model5, weights="waic") model2 model5 > loo(model2, model5) LOOIC SE model model model2 - model > model_weights(model2, model5, weights="loo") model2 model5

Visual Search Remember, the WAIC / loo values are estimates
> WAIC(model2, model5) WAIC SE model model model2 - model > model_weights(model2, model5, weights="waic") model2 model5 The twice-slope model has a smaller WAIC / loo, so it is expected to do a better job predicting future data than the more general separate slopes model The theoretical constraint improves the model Or, the extra parameter of the separate slopes model causes that model to over fit the sampled data, which hurts prediction of future data How confident are we in this conclusion? The Akaike weight is 0.83 for the twice-slope model Pretty convincing, but maybe we hold out some hope for the more general model Remember, the WAIC / loo values are estimates

# A model where the slope for target present is some multiple of the slope for target present (prior centered on 2) model6 <- brm(bf( RT_ms ~ a1 + b1*TargetIsPresent*NumberDistractors + b2*b1*(1-TargetIsPresent)*NumberDistractors, a1 ~1, b1 ~ 1, b2 ~1, nl=TRUE), family = gaussian(),data = VSdata2, prior= c( prior(normal(0, 2000), nlpar="a1"), prior(normal(0, 300), nlpar="b1"), prior(normal(2, 1), nlpar="b2")), iter = 8000, warmup = 2000, chains = 3) print(summary(model6)) Family: gaussian Links: mu = identity; sigma = identity Formula: RT_ms ~ a1 + b1 * TargetIsPresent * NumberDistractors + b2 * b1 * (1 - TargetIsPresent) * NumberDistractors a1 ~ 1 b1 ~ 1 b2 ~ 1 Data: VSdata2 (Number of observations: 40) Samples: 3 chains, each with iter = 8000; warmup = 2000; thin = 1; total post-warmup samples = 18000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat a1_Intercept b1_Intercept b2_Intercept Family Specific Parameters: sigma Visual Search Are we confident that the twice-slope model is appropriate? Surely something like 1.9 x slope would work? Why not put a prior on the multiplier?

Visual Search Compare model5 (common intercept, twice-slope) and model6 (common intercept, variable slope multiplier) Favors model5 (more specific slope) > WAIC(model5, model6) WAIC SE model model model5 - model > model_weights(model5, model6, weights="waic") model5 model6 > loo(model5, model6) LOOIC SE model model model5 - model > model_weights(model5, model6, weights="loo") model5 model6

Generalize serial search model
If final RT is determined (roughly) by how many items you have to examine before finding the target, then When the target is present, on average you should find the target after examining half of the searched items When the target is absent, you always have to search all the items Thus slope(target absent) = 2 x slope(target present) If this model were correct, then there should also be differences in standard deviations for Target present and Target absent trials Target absent trials always have to search all the items (small variability) Target present trials sometimes search all items, sometimes only search one item (and cases in between) (high variability)

Visual Search We have to tweak brms a bit to define this kind of model
# A model where the slope for target absent is twice that for target present, and different variances for target conditions model7 <- brm(bf( RT_ms ~ a1 + b1*TargetIsPresent*NumberDistractors + 2*b1*(1-TargetIsPresent)*NumberDistractors, a1 ~1, b1 ~ 1, sigma ~ TargetIsPresent, nl=TRUE), family = gaussian(),data = VSdata2, prior= c( prior(normal(0, 2000), nlpar="a1"), prior(normal(0, 300), nlpar="b1") ), iter = 8000, warmup = 2000, chains = 3) print(summary(model7)) Family: gaussian Links: mu = identity; sigma = log Formula: RT_ms ~ a1 + b1 * TargetIsPresent * NumberDistractors + 2 * b1 * (1 - TargetIsPresent) * NumberDistractors a1 ~ 1 b1 ~ 1 sigma ~ TargetIsPresent Data: VSdata2 (Number of observations: 40) Samples: 3 chains, each with iter = 8000; warmup = 2000; thin = 1; total post-warmup samples = 18000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat sigma_Intercept a1_Intercept b1_Intercept sigma_TargetIsPresent We have to tweak brms a bit to define this kind of model There might be better ways to do what I am about to do Define a dummy variable with value 0 if target is absent and 1 if the target is present VSdata2$TargetIsPresent <- ifelse(VSdata2$Target=="Present", 1, 0) We use the brm non-linear formula (even though we actually define a linear model) Include a function for sigma

Model comparison Remember, the WAIC / loo values are estimates
> WAIC(model5, model7) WAIC SE model model model5 - model > model_weights(model5, model7, weights="waic") model5 model7 The twice-slope model with separate sd’s has the smallest WAIC / loo The increased flexibility improves the model fit The constraint on the twice-slope model to have a common sd, causes that model to under fit the data How confident are we in this conclusion? The Akaike weight is 0.98 for the twice-slope model with separate sd’s Pretty convincing, but maybe we hold out some hope for the other models Remember, the WAIC / loo values are estimates > loo(model5, model7) LOOIC SE model model model5 - model > model_weights(model5, model7, weights="loo") model5 model7

Compare them all! > WAIC(model1, model2, model3, model4, model5, model6, model7) WAIC SE model model model model model model model model1 - model model1 - model model1 - model model1 - model model1 - model model1 - model model2 - model model2 - model model2 - model model2 - model model2 - model model3 - model model3 - model model3 - model model3 - model model4 - model model4 - model model4 - model model5 - model model5 - model model6 - model > loo(model1, model2, model3, model4, model5, model6, model7) LOOIC SE model model model model model model model model1 - model model1 - model model1 - model model1 - model model1 - model model1 - model model2 - model model2 - model model2 - model model2 - model model2 - model model3 - model model3 - model model3 - model model3 - model model4 - model model4 - model model4 - model model5 - model model5 - model model6 - model > model_weights(model1, model2, model3, model4, model5, model6, model7, weights="waic") model model model model model model model7 > model_weights(model1, model2, model3, model4, model5, model6, model7, weights="loo") model model model model model model model7

Generating models It might be tempting to consider lots of other models a model with separate slopes and separate sd’s A model with twice slopes, separate sd’s and separate intercepts All possibilities? Different specific multipliers: 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4,… You should resist this temptation There is always noise in your data, the (W)AIC / loo analysis hedges against over fitting that noise, but you can undermine it by searching for a model that happens to line up nicely with the particular noise in your sample Models to test should be justified by some kind of argument Theoretical processes involved in behavior Previous findings in the literature Practical implications

Model selection Remember: If you have to pick one model: Pick the one with the smallest (W)AIC / loo Consider whether you really have to pick just one model If you want to predict performance on a visual search task, you will do somewhat better by merging the predictions of the different models instead of just choosing the best model It is sometimes wise to just admit that the data do not distinguish between competing models, even if it slightly favors one This can motivate future work Do not make a choice unless you have to: Such as, you are going to build a robot to search for targets Even a clearly best model, is only relative to the set of models you have compared There may be better models that you have not even considered because you did not gather the relevant information

Conclusions Information Criterion
AIC relates number of parameters to differences of original and replication log likelihoods WAIC generalizes the idea to more complex models (with posteriors of parameters) It approximates loo, but is easier/faster to calculate Can identify the best model that fits the data (maximizes log likelihood) with few parameters Models should be justified A more specific model is better, if it matches the data as well as a more general model

Information criterion

Similar presentations

Presentation on theme: "Information criterion"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information criterion

Similar presentations

Presentation on theme: "Information criterion"— Presentation transcript:

Similar presentations

About project

Feedback