Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Basics of Statistical Estimation
A Tutorial on Learning with Bayesian Networks
Probabilistic models Haixu Tang School of Informatics.
Learning: Parameter Estimation
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Parameter Estimation using likelihood functions Tutorial #1
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Visual Recognition Tutorial
Class 3: Estimating Scoring Rules for Sequence Alignment.
Thanks to Nir Friedman, HU
Language Modeling Approaches for Information Retrieval Rong Jin.
Learning Bayesian Networks (From David Heckerman’s tutorial)
Probability and Statistics Review Thursday Sep 11.
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Crash Course on Machine Learning
Recitation 1 Probability Review
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
Conditional Probability Distributions Eran Segal Weizmann Institute.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Maximum Likelihood Estimation
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Oliver Schulte Machine Learning 726
CS479/679 Pattern Recognition Dr. George Bebis
Probability Theory and Parameter Estimation I
CS 2750: Machine Learning Density Estimation
Parameter Estimation 主講人:虞台文.
CS 2750: Machine Learning Probability Review Density Estimation
Bayes Net Learning: Bayesian Approaches
Maximum Likelihood Estimation
Oliver Schulte Machine Learning 726
Review of Probability and Estimators Arun Das, Jason Rebello
Tutorial #3 by Ma’ayan Fishelson
Important Distinctions in Learning BNs
CS498-EA Reasoning in AI Lecture #20
Statistical NLP: Lecture 4
CSCI 5822 Probabilistic Models of Human and Machine Learning
Important Distinctions in Learning BNs
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
Learning Bayesian networks
Presentation transcript:

Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex systems are built by combining simpler parts Why have a model? Compact and modular representation of complex systems Ability to execute complex reasoning patterns Make predictions Generalize from particular problem

Probability Theory Probability distribution P over ( , S) is a mapping from events in S such that: P(  )  0 for all  S P(  ) = 1 If ,  S and  = , then P(  )=P(  )+P(  ) Conditional Probability: Chain Rule: Bayes Rule: Conditional Independence:

Random Variables & Notation Random variable: Function from  to a value Categorical / Ordinal / Continuous Val(X) – set of possible values of RV X Upper case letters denote RVs (e.g., X, Y, Z) Upper case bold letters denote set of RVs (e.g., X, Y) Lower case letters denote RV values (e.g., x, y, z) Lower case bold letters denote RV set values (e.g., x) Values for categorical RVs with |Val(X)|=k: x 1,x 2,…,x k Marginal distribution over X: P(X) Conditional independence: X is independent of Y given Z if: 

Expectation Discrete RVs: Continuous RVs: Linearity of expectation: Expectation of products: (when X  Y in P) Independence assumption

Variance Variance of RV: If X and Y are independent: Var[X+Y]=Var[X]+Var[Y] Var[aX+b]=a 2 Var[X]

Information Theory Entropy: We use log base 2 to interpret entropy as bits of information Entropy of X is a lower bound on avg. # of bits to encode values of X 0  H p (X)  log|Val(X)| for any distribution P(X) Conditional entropy: Information only helps: Mutual information: 0  I p (X;Y)  H p (X) Symmetry: I p (X;Y)= I p (Y;X) I p (X;Y)=0 iff X and Y are independent Chain rule of entropies:

Distances Between Distributions Relative Entropy: D(P ॥ Q)  0 D(P ॥ Q)=0 iff P=Q Not a distance metric (no symmetry and triangle inequality) L 1 distance: L 2 distance: L  distance:

Independent Random Variables Two variables X and Y are independent if P(X=x|Y=y) = P(X=x) for all values x,y Equivalently, knowing Y does not change predictions of X If X and Y are independent then: P(X, Y) = P(X|Y)P(Y) = P(X)P(Y) If X 1,…,X n are independent then: P(X 1,…,X n ) = P(X 1 )…P(X n ) O(n) parameters All 2 n probabilities are implicitly defined Cannot represent many types of distributions

Conditional Independence X and Y are conditionally independent given Z if P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z Equivalently, if we know Z, then knowing Y does not change predictions of X Notation: Ind(X;Y | Z) or (X  Y | Z)

Conditional Parameterization S = Score on test, Val(S) = {s 0,s 1 } I = Intelligence, Val(I) = {i 0,i 1 } ISP(I,S) i0i0 s0s i0i0 s1s i1i1 s0s i1i1 s1s S Is0s0 s1s1 i0i i1i I i0i0 i1i P(S|I)P(I)P(I,S) Joint parameterizationConditional parameterization 3 parameters Alternative parameterization: P(S) and P(I|S)

Conditional Parameterization S = Score on test, Val(S) = {s 0,s 1 } I = Intelligence, Val(I) = {i 0,i 1 } G = Grade, Val(G) = {g 0,g 1,g 2 } Assume that G and S are independent given I Joint parameterization 2  2  3=12-1=11 independent parameters Conditional parameterization has P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I) P(I) – 1 independent parameter P(S|I) – 2  1 independent parameters P(G|I) - 2  2 independent parameters 7 independent parameters

Biased Coin Toss Example Coin can land in two positions: Head or Tail Estimation task Given toss examples x[1],...x[m] estimate P(H)=  and P(T)=  1-  Assumption: i.i.d samples Tosses are controlled by an (unknown) parameter  Tosses are sampled from the same distribution Tosses are independent of each other

Biased Coin Toss Example Goal: find  [0,1] that predicts the data well “Predicts the data well” = likelihood of the data given  Example: probability of sequence H,T,T,H,H  L(D:  )

Maximum Likelihood Estimator Parameter  that maximizes L(D:  ) In our example,  =0.6 maximizes the sequence H,T,T,H,H  L(D:  )

Maximum Likelihood Estimator General case Observations: M H heads and M T tails Find  maximizing likelihood Equivalent to maximizing log-likelihood Differentiating the log-likelihood and solving for  we get that the maximum likelihood parameter is:

Sufficient Statistics For computing the parameter  of the coin toss example, we only needed M H and M T since  M H and M T are sufficient statistics

Sufficient Statistics A function s(D) is a sufficient statistics from instances to a vector in  k if for any two datasets D and D’ and any  we have Datasets Statistics

Sufficient Statistics for Multinomial A sufficient statistics for a dataset D over a variable Y with k values is the tuple of counts such that M i is the number of times that the Y=y i in D Sufficient statistic Define s(x[i]) as a tuple of dimension k s(x[i])=(0,...0,1,0,...,0) (1,...,i-1)(i+1,...,k)

Sufficient Statistic for Gaussian Gaussian distribution: Rewrite as  sufficient statistics for Gaussian: s(x[i])=

Maximum Likelihood Estimation MLE Principle: Choose  that maximize L(D:  ) Multinomial MLE: Gaussian MLE:

MLE for Bayesian Networks Parameters  x 0,  x 1  y 0| x 0,  y 1| x 0,  y 0| x 1,  y 1| x 1 Data instance tuple: Likelihood X Y Y Xy0y0 y1y1 x0x x1x X x0x0 x1x  Likelihood decomposes into two separate terms, one for each variable

MLE for Bayesian Networks Terms further decompose by CPDs: By sufficient statistics where M[x 0,y 0 ] is the number of data instances in which X takes the value x 0 and Y takes the value y 0 MLE

MLE for Bayesian Networks Likelihood for Bayesian network  if  X i |Pa(X i ) are disjoint then MLE can be computed by maximizing each local likelihood separately

MLE for Table CPD BayesNets Multinomial CPD For each value x  X we get an independent multinomial problem where the MLE is

Limitations of MLE Two teams play 10 times, and the first wins 7 of the 10 matches  Probability of first team winning = 0.7 A coin is tosses 10 times, and comes out ‘head’ 7 of the 10 tosses  Probability of head = 0.7 Would you place the same bet on the next game as you would on the next coin toss? We need to incorporate prior knowledge Prior knowledge should only be used as a guide

Bayesian Inference Assumptions Given a fixed  tosses are independent If  is unknown tosses are not marginally independent – each toss tells us something about  The following network captures our assumptions  X[1] X[M]X[2] …

Bayesian Inference Joint probabilistic model Posterior probability over   X[1] X[M]X[2] … Likelihood Prior Normalizing factor For a uniform prior, posterior is the normalized likelihood

Bayesian Prediction Predict the data instance from the previous ones Solve for uniform prior P(  )=1 and binomial variable

Example: Binomial Data Prior: uniform for  in [0,1] P(  ) = 1  P(  |D) is proportional to the likelihood L(D:  ) MLE for P(X=H) is 4/5 = 0.8 Bayesian prediction is 5/ (M H,M T ) = (4,1)

Dirichlet Priors A Dirichlet prior is specified by a set of (non-negative) hyperparameters  1,...  k so that  ~ Dirichlet(  1,...  k ) if where and Intuitively, hyperparameters correspond to the number of imaginary examples we saw before starting the experiment

Dirichlet Priors – Example Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5)

Dirichlet Priors Dirichlet priors have the property that the posterior is also Dirichlet Data counts are M 1,...,M k Prior is Dir(  1,...  k )  Posterior is Dir(  1 +M 1,...  k +M k ) The hyperparameters  1,…,  K can be thought of as “imaginary” counts from our prior experience Equivalent sample size =  1 +…+  K The larger the equivalent sample size the more confident we are in our prior

Effect of Priors Different strength  H +  T Fixed ratio  H /  T Fixed strength  H +  T Different ratio  H /  T Prediction of P(X=H) after seeing data with M H =1/4M T as a function of the sample size

Effect of Priors (cont.) P(X = 1|D) N MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) N 0 1 Toss Result In real data, Bayesian estimates are less sensitive to noise in the data

General Formulation Joint distribution over D,  Posterior distribution over parameters P(D) is the marginal likelihood of the data As we saw, likelihood can be described compactly using sufficient statistics We want conditions in which posterior is also compact

Conjugate Families A family of priors P(  :  ) is conjugate to a model P(  |  ) if for any possible dataset D of i.i.d samples from P(  |  ) and choice of hyperparameters  for the prior over , there are hyperparameters  ’ that describe the posterior, i.e., P(  :  ’)  P(D|  )P(  :  )  Posterior has the same parametric form as the prior Dirichlet prior is a conjugate family for the multinomial likelihood Conjugate families are useful since: Many distributions can be represented with hyperparameters They allow for sequential update within the same representation In many cases we have closed-form solutions for prediction

Parameter Estimation Summary Estimation relies on sufficient statistics For multinomials these are of the form M[x i,pa i ] Parameter estimation Bayesian methods also require choice of priors MLE and Bayesian are asymptotically equivalent Both can be implemented in an online manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

This Week’s Assignment Compute P(S) Decompose as a Markov model of order k Collect sufficient statistics Use ratio to genome background Evaluation & Deliverable Test set likelihood ratio to random locations & sequences ROC analysis (ranking)

Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability P(S’|S) assumed to be sparse Usually encoded by a state transition graph SS’ O’ GG G0G0 Unrolled network S0S0 O0O0 S0S0 S1S1 O1O1 S2S2 O2O2 S3S3 O3O3

Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability P(S’|S) assumed to be sparse Usually encoded by a state transition graph S1S1 S2S2 S3S3 S4S4 s1s1 s2s2 s3s3 s4s4 s1s s2s s3s s4s P(S’|S) State transition representation