Bayesian Framework Finding the best model Minimizing model complexity Maximum likelihood Maximum a posteriori Posterior mean estimator Minimizing model complexity Ockham’s razor Minimum Description Length Parametrizing models Lecture 5, CS567
Anatomy of a model Model = Parameter scheme + values for parameters Model for DNA sequence 4 parameters, one for each character Model M(w1) P(A) = P(T) = P(G) = P(C) = 0.25 Model M(w2) P(A) = P(G) = 0.3; P(T) = P(C) = 0.2 Lecture 5, CS567
Maximum Likelihood Likelihood: Given a particular model, how likely is it that this data would have been observed? L(M(wi)) = P(D|M(wi)) Maximum likelihood: Given a number of models, which one has the highest likelihood? Maximum value of L(M) Wmax = maxarg M(w) P(D|M(wi)) Lecture 5, CS567
Maximum Likelihood Example: Data: HHHTTT (sequence, i.e., as permutation) Model: Binomial with parameter p(H) Parameter set 1: p(H) = 0.5; p(T) = 1-p(H); Parameter set 2: p(H) = 0.25; p(T) = 1-p(H); Likelihoods: P(D|M(w1)) = (0.5)3(0.5)3 = 0.0156 P(D|M(w2)) = (0.25)3(0.75)3 = 0.0066 Maximum likelihood estimate = M(w1) In fact, L(M(w1)) > L(M(wi|i 1) Lecture 5, CS567
Maximum Likelihood Example: Data: HTTT (sequence, i.e., as permutation) Model: Binomial with parameter p(H) Parameter set 1: p(H) = 0.5; p(T) = 1-p(H); Parameter set 2: p(H) = 0.25; p(T) = 1-p(H); Likelihoods: P(D|M(w1)) = (0.5)(0.5)3 = 0.0625 P(D|M(w2)) = (0.25)(0.75)3 = 0.1055 Maximum likelihood estimate = M(w2)! In fact, L(M(w2)) > L(M(wi|i 2) So, is something wrong with this coin? Lecture 5, CS567
Maximum Likelihood Maximum is unreliable when data set size is small Prior important in dealing with such errors As data sample gets to be larger (more representative) Maximum likelihood estimate of parameters tends to the ‘true’ value Lecture 5, CS567
Maximum a posteriori Need to factor in prior in maximum likelihood estimate Posterior likelihood = (Likelihood) (Prior) = P(D|M(wi)) P(wi|M) Maximum a posteriori WMAP = maxarg M(w) P(D|M(wi)) P(wi|M) From Bayes theorem: P(w|M,D) = [P(D|M(wi)) P(wi|M)] / [P(D|M)] As P(D|M) does not affect the maximum of LHS, numerator is sufficient to find MAP Lecture 5, CS567
Posterior Mean Estimator Instead of using maximum value, use Expectation of model parameters Wpme = (wi)P (wi|n)dW where n = number of parameter combinations Makes sense when no clearly optimal choice (no sharp peak in parametric space) Lecture 5, CS567
Dealing with Model Complexity Ockham’s razor: “Car is stopping at cross-walk to allow me to cross, not to shoot a bullet at me” Go for the simplest explanation that matches the facts (probabilistically, of course) Introduce priors than penalize complex models Simpler models assign higher likelihoods Minimum Description Length (kind of similar): Economical specification of model Lecture 5, CS567
Graphical Models Real world = Massive network of dependencies Model = Sparsely connected network (Reduction of dimensionality) Graph representation Edge = dependency; No edge = Independence Directed/Undirected/Mixed (Chain independence) Goal: Factor graph into clusters of local probabilities Lecture 5, CS567
Graphical Models Undirected graphs Directed graphs Markov networks/random fields, Boltzmann machines Symmetric Statistical mechanics, Image processing Directed graphs Bayesian/Belief/Causal/Influence networks Temporal Causality Expert systems Neural networks, Hidden Markov models Lecture 5, CS567
Graphical Models Neighborhood For a single variable For a set of inter-dependent variables (Boundary) Hidden variables (Use Expectation Maximization algorithm) Hierarchy Different time scales/length scales Hyperparameters () P(w) = P(w|) P() d Prior = P() Computationally easier Mixture/Hybrid modeling P= n i Pi Lecture 5, CS567