Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.

Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex systems are built by combining simpler parts Why have a model? Compact and modular representation of complex systems Ability to execute complex reasoning patterns Make predictions Generalize from particular problem

Probability Theory Probability distribution P over ( , S) is a mapping from events in S such that: P(  )  0 for all  S P(  ) = 1 If ,  S and  = , then P(  )=P(  )+P(  ) Conditional Probability: Chain Rule: Bayes Rule: Conditional Independence:

Random Variables & Notation Random variable: Function from  to a value Categorical / Ordinal / Continuous Val(X) – set of possible values of RV X Upper case letters denote RVs (e.g., X, Y, Z) Upper case bold letters denote set of RVs (e.g., X, Y) Lower case letters denote RV values (e.g., x, y, z) Lower case bold letters denote RV set values (e.g., x) Values for categorical RVs with |Val(X)|=k: x 1,x 2,…,x k Marginal distribution over X: P(X) Conditional independence: X is independent of Y given Z if: 

Expectation Discrete RVs: Continuous RVs: Linearity of expectation: Expectation of products: (when X  Y in P) Independence assumption

Variance Variance of RV: If X and Y are independent: Var[X+Y]=Var[X]+Var[Y] Var[aX+b]=a 2 Var[X]

Information Theory Entropy: We use log base 2 to interpret entropy as bits of information Entropy of X is a lower bound on avg. # of bits to encode values of X 0  H p (X)  log|Val(X)| for any distribution P(X) Conditional entropy: Information only helps: Mutual information: 0  I p (X;Y)  H p (X) Symmetry: I p (X;Y)= I p (Y;X) I p (X;Y)=0 iff X and Y are independent Chain rule of entropies:

Distances Between Distributions Relative Entropy: D(P ॥ Q)  0 D(P ॥ Q)=0 iff P=Q Not a distance metric (no symmetry and triangle inequality) L 1 distance: L 2 distance: L  distance:

Independent Random Variables Two variables X and Y are independent if P(X=x|Y=y) = P(X=x) for all values x,y Equivalently, knowing Y does not change predictions of X If X and Y are independent then: P(X, Y) = P(X|Y)P(Y) = P(X)P(Y) If X 1,…,X n are independent then: P(X 1,…,X n ) = P(X 1 )…P(X n ) O(n) parameters All 2 n probabilities are implicitly defined Cannot represent many types of distributions

Conditional Independence X and Y are conditionally independent given Z if P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z Equivalently, if we know Z, then knowing Y does not change predictions of X Notation: Ind(X;Y | Z) or (X  Y | Z)

Conditional Parameterization S = Score on test, Val(S) = {s 0,s 1 } I = Intelligence, Val(I) = {i 0,i 1 } ISP(I,S) i0i0 s0s0 0.665 i0i0 s1s1 0.035 i1i1 s0s0 0.06 i1i1 s1s1 0.24 S Is0s0 s1s1 i0i0 0.950.05 i1i1 0.20.8 I i0i0 i1i1 0.70.3 P(S|I)P(I)P(I,S) Joint parameterizationConditional parameterization 3 parameters Alternative parameterization: P(S) and P(I|S)

Conditional Parameterization S = Score on test, Val(S) = {s 0,s 1 } I = Intelligence, Val(I) = {i 0,i 1 } G = Grade, Val(G) = {g 0,g 1,g 2 } Assume that G and S are independent given I Joint parameterization 2  2  3=12-1=11 independent parameters Conditional parameterization has P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I) P(I) – 1 independent parameter P(S|I) – 2  1 independent parameters P(G|I) - 2  2 independent parameters 7 independent parameters

Biased Coin Toss Example Coin can land in two positions: Head or Tail Estimation task Given toss examples x[1],...x[m] estimate P(H)=  and P(T)=  1-  Assumption: i.i.d samples Tosses are controlled by an (unknown) parameter  Tosses are sampled from the same distribution Tosses are independent of each other

Biased Coin Toss Example Goal: find  [0,1] that predicts the data well “Predicts the data well” = likelihood of the data given  Example: probability of sequence H,T,T,H,H 00.20.40.60.81  L(D:  )

Maximum Likelihood Estimator Parameter  that maximizes L(D:  ) In our example,  =0.6 maximizes the sequence H,T,T,H,H 00.20.40.60.81  L(D:  )

Maximum Likelihood Estimator General case Observations: M H heads and M T tails Find  maximizing likelihood Equivalent to maximizing log-likelihood Differentiating the log-likelihood and solving for  we get that the maximum likelihood parameter is:

Sufficient Statistics For computing the parameter  of the coin toss example, we only needed M H and M T since  M H and M T are sufficient statistics

Sufficient Statistics A function s(D) is a sufficient statistics from instances to a vector in  k if for any two datasets D and D’ and any  we have Datasets Statistics

Sufficient Statistics for Multinomial A sufficient statistics for a dataset D over a variable Y with k values is the tuple of counts such that M i is the number of times that the Y=y i in D Sufficient statistic Define s(x[i]) as a tuple of dimension k s(x[i])=(0,...0,1,0,...,0) (1,...,i-1)(i+1,...,k)

Sufficient Statistic for Gaussian Gaussian distribution: Rewrite as  sufficient statistics for Gaussian: s(x[i])=

Maximum Likelihood Estimation MLE Principle: Choose  that maximize L(D:  ) Multinomial MLE: Gaussian MLE:

MLE for Bayesian Networks Parameters  x 0,  x 1  y 0| x 0,  y 1| x 0,  y 0| x 1,  y 1| x 1 Data instance tuple: Likelihood X Y Y Xy0y0 y1y1 x0x0 0.950.05 x1x1 0.20.8 X x0x0 x1x1 0.70.3  Likelihood decomposes into two separate terms, one for each variable

MLE for Bayesian Networks Terms further decompose by CPDs: By sufficient statistics where M[x 0,y 0 ] is the number of data instances in which X takes the value x 0 and Y takes the value y 0 MLE

MLE for Bayesian Networks Likelihood for Bayesian network  if  X i |Pa(X i ) are disjoint then MLE can be computed by maximizing each local likelihood separately

MLE for Table CPD BayesNets Multinomial CPD For each value x  X we get an independent multinomial problem where the MLE is

Limitations of MLE Two teams play 10 times, and the first wins 7 of the 10 matches  Probability of first team winning = 0.7 A coin is tosses 10 times, and comes out ‘head’ 7 of the 10 tosses  Probability of head = 0.7 Would you place the same bet on the next game as you would on the next coin toss? We need to incorporate prior knowledge Prior knowledge should only be used as a guide

Bayesian Inference Assumptions Given a fixed  tosses are independent If  is unknown tosses are not marginally independent – each toss tells us something about  The following network captures our assumptions  X[1] X[M]X[2] …

Bayesian Inference Joint probabilistic model Posterior probability over   X[1] X[M]X[2] … Likelihood Prior Normalizing factor For a uniform prior, posterior is the normalized likelihood

Bayesian Prediction Predict the data instance from the previous ones Solve for uniform prior P(  )=1 and binomial variable

Example: Binomial Data Prior: uniform for  in [0,1] P(  ) = 1  P(  |D) is proportional to the likelihood L(D:  ) MLE for P(X=H) is 4/5 = 0.8 Bayesian prediction is 5/7 00.20.40.60.81 (M H,M T ) = (4,1)

Dirichlet Priors A Dirichlet prior is specified by a set of (non-negative) hyperparameters  1,...  k so that  ~ Dirichlet(  1,...  k ) if where and Intuitively, hyperparameters correspond to the number of imaginary examples we saw before starting the experiment

Dirichlet Priors – Example 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 00.20.40.60.81 Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5)

Dirichlet Priors Dirichlet priors have the property that the posterior is also Dirichlet Data counts are M 1,...,M k Prior is Dir(  1,...  k )  Posterior is Dir(  1 +M 1,...  k +M k ) The hyperparameters  1,…,  K can be thought of as “imaginary” counts from our prior experience Equivalent sample size =  1 +…+  K The larger the equivalent sample size the more confident we are in our prior

Effect of Priors 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 020406080100 0 0.1 0.2 0.3 0.4 0.5 0.6 020406080100 Different strength  H +  T Fixed ratio  H /  T Fixed strength  H +  T Different ratio  H /  T Prediction of P(X=H) after seeing data with M H =1/4M T as a function of the sample size

Effect of Priors (cont.) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 5101520253035404550 P(X = 1|D) N MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) N 0 1 Toss Result In real data, Bayesian estimates are less sensitive to noise in the data

General Formulation Joint distribution over D,  Posterior distribution over parameters P(D) is the marginal likelihood of the data As we saw, likelihood can be described compactly using sufficient statistics We want conditions in which posterior is also compact

Conjugate Families A family of priors P(  :  ) is conjugate to a model P(  |  ) if for any possible dataset D of i.i.d samples from P(  |  ) and choice of hyperparameters  for the prior over , there are hyperparameters  ’ that describe the posterior, i.e., P(  :  ’)  P(D|  )P(  :  )  Posterior has the same parametric form as the prior Dirichlet prior is a conjugate family for the multinomial likelihood Conjugate families are useful since: Many distributions can be represented with hyperparameters They allow for sequential update within the same representation In many cases we have closed-form solutions for prediction

Parameter Estimation Summary Estimation relies on sufficient statistics For multinomials these are of the form M[x i,pa i ] Parameter estimation Bayesian methods also require choice of priors MLE and Bayesian are asymptotically equivalent Both can be implemented in an online manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

This Week’s Assignment Compute P(S) Decompose as a Markov model of order k Collect sufficient statistics Use ratio to genome background Evaluation & Deliverable Test set likelihood ratio to random locations & sequences ROC analysis (ranking)

Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability P(S’|S) assumed to be sparse Usually encoded by a state transition graph SS’ O’ GG G0G0 Unrolled network S0S0 O0O0 S0S0 S1S1 O1O1 S2S2 O2O2 S3S3 O3O3

Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability P(S’|S) assumed to be sparse Usually encoded by a state transition graph S1S1 S2S2 S3S3 S4S4 s1s1 s2s2 s3s3 s4s4 s1s1 0.20.800 s2s2 0010 s3s3 0.4000.6 s4s4 00.50 P(S’|S) State transition representation

Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.

Similar presentations

Presentation on theme: "Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.

Similar presentations

Presentation on theme: "Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex."— Presentation transcript:

Similar presentations

About project

Feedback