Download presentation
Presentation is loading. Please wait.
1
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex systems are built by combining simpler parts Why have a model? Compact and modular representation of complex systems Ability to execute complex reasoning patterns Make predictions Generalize from particular problem
2
Probability Theory Probability distribution P over ( , S) is a mapping from events in S such that: P( ) 0 for all S P( ) = 1 If , S and = , then P( )=P( )+P( ) Conditional Probability: Chain Rule: Bayes Rule: Conditional Independence:
3
Random Variables & Notation Random variable: Function from to a value Categorical / Ordinal / Continuous Val(X) – set of possible values of RV X Upper case letters denote RVs (e.g., X, Y, Z) Upper case bold letters denote set of RVs (e.g., X, Y) Lower case letters denote RV values (e.g., x, y, z) Lower case bold letters denote RV set values (e.g., x) Values for categorical RVs with |Val(X)|=k: x 1,x 2,…,x k Marginal distribution over X: P(X) Conditional independence: X is independent of Y given Z if:
4
Expectation Discrete RVs: Continuous RVs: Linearity of expectation: Expectation of products: (when X Y in P) Independence assumption
5
Variance Variance of RV: If X and Y are independent: Var[X+Y]=Var[X]+Var[Y] Var[aX+b]=a 2 Var[X]
6
Information Theory Entropy: We use log base 2 to interpret entropy as bits of information Entropy of X is a lower bound on avg. # of bits to encode values of X 0 H p (X) log|Val(X)| for any distribution P(X) Conditional entropy: Information only helps: Mutual information: 0 I p (X;Y) H p (X) Symmetry: I p (X;Y)= I p (Y;X) I p (X;Y)=0 iff X and Y are independent Chain rule of entropies:
7
Distances Between Distributions Relative Entropy: D(P ॥ Q) 0 D(P ॥ Q)=0 iff P=Q Not a distance metric (no symmetry and triangle inequality) L 1 distance: L 2 distance: L distance:
8
Independent Random Variables Two variables X and Y are independent if P(X=x|Y=y) = P(X=x) for all values x,y Equivalently, knowing Y does not change predictions of X If X and Y are independent then: P(X, Y) = P(X|Y)P(Y) = P(X)P(Y) If X 1,…,X n are independent then: P(X 1,…,X n ) = P(X 1 )…P(X n ) O(n) parameters All 2 n probabilities are implicitly defined Cannot represent many types of distributions
9
Conditional Independence X and Y are conditionally independent given Z if P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z Equivalently, if we know Z, then knowing Y does not change predictions of X Notation: Ind(X;Y | Z) or (X Y | Z)
10
Conditional Parameterization S = Score on test, Val(S) = {s 0,s 1 } I = Intelligence, Val(I) = {i 0,i 1 } ISP(I,S) i0i0 s0s0 0.665 i0i0 s1s1 0.035 i1i1 s0s0 0.06 i1i1 s1s1 0.24 S Is0s0 s1s1 i0i0 0.950.05 i1i1 0.20.8 I i0i0 i1i1 0.70.3 P(S|I)P(I)P(I,S) Joint parameterizationConditional parameterization 3 parameters Alternative parameterization: P(S) and P(I|S)
11
Conditional Parameterization S = Score on test, Val(S) = {s 0,s 1 } I = Intelligence, Val(I) = {i 0,i 1 } G = Grade, Val(G) = {g 0,g 1,g 2 } Assume that G and S are independent given I Joint parameterization 2 2 3=12-1=11 independent parameters Conditional parameterization has P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I) P(I) – 1 independent parameter P(S|I) – 2 1 independent parameters P(G|I) - 2 2 independent parameters 7 independent parameters
12
Biased Coin Toss Example Coin can land in two positions: Head or Tail Estimation task Given toss examples x[1],...x[m] estimate P(H)= and P(T)= 1- Assumption: i.i.d samples Tosses are controlled by an (unknown) parameter Tosses are sampled from the same distribution Tosses are independent of each other
13
Biased Coin Toss Example Goal: find [0,1] that predicts the data well “Predicts the data well” = likelihood of the data given Example: probability of sequence H,T,T,H,H 00.20.40.60.81 L(D: )
14
Maximum Likelihood Estimator Parameter that maximizes L(D: ) In our example, =0.6 maximizes the sequence H,T,T,H,H 00.20.40.60.81 L(D: )
15
Maximum Likelihood Estimator General case Observations: M H heads and M T tails Find maximizing likelihood Equivalent to maximizing log-likelihood Differentiating the log-likelihood and solving for we get that the maximum likelihood parameter is:
16
Sufficient Statistics For computing the parameter of the coin toss example, we only needed M H and M T since M H and M T are sufficient statistics
17
Sufficient Statistics A function s(D) is a sufficient statistics from instances to a vector in k if for any two datasets D and D’ and any we have Datasets Statistics
18
Sufficient Statistics for Multinomial A sufficient statistics for a dataset D over a variable Y with k values is the tuple of counts such that M i is the number of times that the Y=y i in D Sufficient statistic Define s(x[i]) as a tuple of dimension k s(x[i])=(0,...0,1,0,...,0) (1,...,i-1)(i+1,...,k)
19
Sufficient Statistic for Gaussian Gaussian distribution: Rewrite as sufficient statistics for Gaussian: s(x[i])=
20
Maximum Likelihood Estimation MLE Principle: Choose that maximize L(D: ) Multinomial MLE: Gaussian MLE:
21
MLE for Bayesian Networks Parameters x 0, x 1 y 0| x 0, y 1| x 0, y 0| x 1, y 1| x 1 Data instance tuple: Likelihood X Y Y Xy0y0 y1y1 x0x0 0.950.05 x1x1 0.20.8 X x0x0 x1x1 0.70.3 Likelihood decomposes into two separate terms, one for each variable
22
MLE for Bayesian Networks Terms further decompose by CPDs: By sufficient statistics where M[x 0,y 0 ] is the number of data instances in which X takes the value x 0 and Y takes the value y 0 MLE
23
MLE for Bayesian Networks Likelihood for Bayesian network if X i |Pa(X i ) are disjoint then MLE can be computed by maximizing each local likelihood separately
24
MLE for Table CPD BayesNets Multinomial CPD For each value x X we get an independent multinomial problem where the MLE is
25
Limitations of MLE Two teams play 10 times, and the first wins 7 of the 10 matches Probability of first team winning = 0.7 A coin is tosses 10 times, and comes out ‘head’ 7 of the 10 tosses Probability of head = 0.7 Would you place the same bet on the next game as you would on the next coin toss? We need to incorporate prior knowledge Prior knowledge should only be used as a guide
26
Bayesian Inference Assumptions Given a fixed tosses are independent If is unknown tosses are not marginally independent – each toss tells us something about The following network captures our assumptions X[1] X[M]X[2] …
27
Bayesian Inference Joint probabilistic model Posterior probability over X[1] X[M]X[2] … Likelihood Prior Normalizing factor For a uniform prior, posterior is the normalized likelihood
28
Bayesian Prediction Predict the data instance from the previous ones Solve for uniform prior P( )=1 and binomial variable
29
Example: Binomial Data Prior: uniform for in [0,1] P( ) = 1 P( |D) is proportional to the likelihood L(D: ) MLE for P(X=H) is 4/5 = 0.8 Bayesian prediction is 5/7 00.20.40.60.81 (M H,M T ) = (4,1)
30
Dirichlet Priors A Dirichlet prior is specified by a set of (non-negative) hyperparameters 1,... k so that ~ Dirichlet( 1,... k ) if where and Intuitively, hyperparameters correspond to the number of imaginary examples we saw before starting the experiment
31
Dirichlet Priors – Example 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 00.20.40.60.81 Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5)
32
Dirichlet Priors Dirichlet priors have the property that the posterior is also Dirichlet Data counts are M 1,...,M k Prior is Dir( 1,... k ) Posterior is Dir( 1 +M 1,... k +M k ) The hyperparameters 1,…, K can be thought of as “imaginary” counts from our prior experience Equivalent sample size = 1 +…+ K The larger the equivalent sample size the more confident we are in our prior
33
Effect of Priors 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 020406080100 0 0.1 0.2 0.3 0.4 0.5 0.6 020406080100 Different strength H + T Fixed ratio H / T Fixed strength H + T Different ratio H / T Prediction of P(X=H) after seeing data with M H =1/4M T as a function of the sample size
34
Effect of Priors (cont.) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 5101520253035404550 P(X = 1|D) N MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) N 0 1 Toss Result In real data, Bayesian estimates are less sensitive to noise in the data
35
General Formulation Joint distribution over D, Posterior distribution over parameters P(D) is the marginal likelihood of the data As we saw, likelihood can be described compactly using sufficient statistics We want conditions in which posterior is also compact
36
Conjugate Families A family of priors P( : ) is conjugate to a model P( | ) if for any possible dataset D of i.i.d samples from P( | ) and choice of hyperparameters for the prior over , there are hyperparameters ’ that describe the posterior, i.e., P( : ’) P(D| )P( : ) Posterior has the same parametric form as the prior Dirichlet prior is a conjugate family for the multinomial likelihood Conjugate families are useful since: Many distributions can be represented with hyperparameters They allow for sequential update within the same representation In many cases we have closed-form solutions for prediction
37
Parameter Estimation Summary Estimation relies on sufficient statistics For multinomials these are of the form M[x i,pa i ] Parameter estimation Bayesian methods also require choice of priors MLE and Bayesian are asymptotically equivalent Both can be implemented in an online manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)
38
This Week’s Assignment Compute P(S) Decompose as a Markov model of order k Collect sufficient statistics Use ratio to genome background Evaluation & Deliverable Test set likelihood ratio to random locations & sequences ROC analysis (ranking)
39
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability P(S’|S) assumed to be sparse Usually encoded by a state transition graph SS’ O’ GG G0G0 Unrolled network S0S0 O0O0 S0S0 S1S1 O1O1 S2S2 O2O2 S3S3 O3O3
40
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability P(S’|S) assumed to be sparse Usually encoded by a state transition graph S1S1 S2S2 S3S3 S4S4 s1s1 s2s2 s3s3 s4s4 s1s1 0.20.800 s2s2 0010 s3s3 0.4000.6 s4s4 00.50 P(S’|S) State transition representation
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.