Download presentation
Presentation is loading. Please wait.
1
Intro to comp genomics Lecture 3-4: Examples, Approximate Inference
2
Example 1: Mixtures of Gaussians We have experimental results of some value We want to describe the behavior of the experimental values: Essentially one behavior? Two behaviors? More? In one dimension it may look very easy: just looking at the distribution will give us a good idea.. We can formulate the model probabilistically as a mixture of normal distributions. As a generative model: to generate data from the model, we first select the sub-model by sampling from the mixture variable. We then generate a value using the selected normal distribution. If the data is multi dimensional, the problem is becoming non trivial.
3
Inference is trivial Let’s represent the model as: What is the inference problem in our model? Inference: computing the posterior probability of a hidden variable given the data and the model parameters. For p 0 =0.2, p 1 =0.8, 0 =0, 1 =1, 0 =1, 1 =0.2, what is Pr(s=0|0.8) ?
4
Estimation/parameter learning Given data, how can we estimate the model parameters? Transform it into an optimization problem! Likelihood: a function of the parameters. Defined given the data. Find parameters that maximize the likelihood: the ML problem Can be approached heuristically: using any optimization technique. But it is a non linear problem which may be very difficult Generic optimization techniques: Gradient ascent: Find Simulation annealing Genetic algorithms And more..
5
The EM algorithm for mixtures – inference allow for learning We start by guessing parameters: We now go over the samples and compute their posteriors (i.e., inference): We use the posteriors to compute new estimates for the expected sufficient statistics of each distribution, and for the mixture coefficients: Continue iterating until convergence. The EM theorem: the algorithm will converge and will improve likelihood monotonically But: No Guarantee of finding the optimum Or of finding anything meaningful The initial conditions are critical: Think of starting from 0 =0, 1 =10, 1,2 =1 Solutions: start from “reasonable” solutions Try many starting points 0 1
6
Example 2: Mixture of sequence models a probabilistic model for binding sites: This is the site independent model, defining a probability space over k-mers Assume a set of sequences contain unknown binding sites (one for each) The position of the binding site is a hidden variable h. We introduce a background model P b that describes the sequence outside of the binding site (usually a d-order Markov model) Given complete data we can write down the likelihood of a sequence s as:
7
Inference of the binding site location posterior: Note that only k-factors need to be computed for each location (P b (s) is constant)) One hidden variable = trivial inference If we assume some of the sequences may lack a binding site, this should be incorporated into the model: hit l s This is sometime called the ZOOPS model (Zero or one positions)
8
Hidden Markov Models Observing only emissions of states to some probability space E Each state is equipped with an emission distribution (x a state, e emission) Emission space Caution! This is NOT the HMM Bayes Net 1.Cycles 2.States are NOT random vars! State space
9
Example 3: Mixture with “memory” We sample a sequence of dependent values At each step, we decide if we continue to sample from the same distribution or switch with probability p We can compute the probability directly only given the hidden variables. P(x) is derived by summing over all possible combination of hidden variables. This is another form of the inference problem (why?) There is an exponential number of h assignments, can we still solve the problem efficiently? BA
10
Inference in HMM Forward formula: Backward formula: Emissions States Finish Start Emissions States Finish Start
11
Computing posteriors: Emissions States Finish Start The posterior probability for transition from s’ to s after character i? The posterior probability for emitting the i’th character from state s?
12
Example 4: Hidden states Example: Two Markov models describe our data Switching between models is occurring at random How to model this? No Emission Hidden state
13
Example 5: Profile HMM for Protein or DNA motifs M I D M I D M I D M I D M I D M I D S F M (Match) states emit a certain amino acid/nucleotide profile I (Insert) states emit some background profile D (Delete) states are hidden Use the model for classification or annotation (Both emissions and transition probabilities are informative!) Can use EM to train the parameters from a set of examples (How do we determine the right size of the model?) (google PFAM, Prosite, “HMM profile protein domain”)
14
Example 6: N-order Markov model In most biological sequences, the Markov property is a big problem N-order relations can be modeled naturally: Common error: Forward/Backward in N-order HMM. Can dynamic programming work?
15
Emissions StatesFinishStart FinishStart 1-HMM Bayes Net: 2-HMM Bayes Net:
16
Example 7: Pair-HMM Given two sequences s 1,s 2, an alignment is defined by a set of ‘gaps’ (or indels) in each of the sequences. ACGCGAACCGAATGCCCAA---GGAAAACGTTTGAATTTATA ACCCGT-----ATGCCCAACGGGGAAAACGTTTGAACTTATA indel Standard dynamic programming algorithm compute the best alignment given such distance metric: Standard distance metric: Affine gap cost:Substitution matrix:
17
Pair-HMM Generalize the HMM concept to probabilistically model alignments. Problem: we are observing two sequences, not a-priori related. What will be emitted from our HMM? M G1G1 G2G2 S F Match states emit and aligned nucleotide pair Gap states emit a nucleotide from one of the sequences only Pr(M->G i ) – “gap open cost”, Pr(G 1 ->G 1 ) – “gap extension cost” Is it a BN template? Forward-backward formula?
18
Example 8: The simple tree model H2H2 S3S3 S2S2 S1S1 H1H1 Sequences of extant and ancestral species are random variables, with Val(X) = {A,C,G,T} Extant Species S j 1,., Ancestral species H j 1,..(n-1) Tree T: Parents relation pa S i, pa H i (pa S 1 = H 1,pa S 3 = H 2, The root: H 2) For multiple loci we can assume independence and use the same parameters (today): In the triplet: The model is defined using conditional probability distributions and the root “prior” probability distribution The model parameters can be the conditional probability distribution tables (CPDs) Or we can have a single rate matrix Q and branch lengths:
19
Ancestral inference The Total probability of the data s: This is also called the likelihood L( ). Computing Pr(s) is the inference problem Given the total probability it is easy to compute: Easy! Exponential? Marginalization over h i We assume the model (structure, parameters) is given, and denote it by : Posterior of h i given the data Total probability of the data
20
Example: ? A CA ? Given partial observations s: The Total probability of the data: Uniform prior
21
Algorithm (Following Felsenstein 1981): Up(i): if(extant) { up[i][a] = (a==S i ? 1: 0); return} up(r(i)), up(l(i)) iter on a up[i][a] = b,c Pr(X l(i) =b|X i =a) up[l(i)][b] Pr(X r(i) =c|X i =a) up[r(i)][c] Down(i): down[i][a]= b,c Pr(X sib(i) =b|X par(i) =c) up[sib(i)][b] Pr(X i =a|X par(i) =c) down[par(i)][c] down(r(i)), down(l(i)) Algorithm: up(root); LL = 0; foreach a { L += log(Pr(root=a)up[root][a]) down[root][a]=Pr(root=a) } down(r(root)); down(l(root)); Dynamic programming to compute the total probability ? S3S3 S2S2 S1S1 ? up[4] up[5]
22
Algorithm (Following Felsenstein 1981): Up(i): if(extant) { up[i][a] = (a==S i ? 1: 0); return} up(r(i)), up(l(i)) iter on a up[i][a] = b,c Pr(X l(i) =b|X i =a) up[l(i)][b] Pr(X r(i) =c|X i =a) up[r(i)][c] Down(i): down[i][a]= b,c Pr(X sib(i) =b|X par(i) =c) up[sib(i)][b] Pr(X i =a|X par(i) =c) down[par(i)][c] down(r(i)), down(l(i)) Algorithm: up(root); LL = 0; foreach a { L += log(Pr(root=a)up[root][a]) down[root][a]=Pr(root=a) } down(r(root)); down(l(root)); ? S3S3 S2S2 S1S1 ? down[4] down5] up[3] P(h i |s) = up[i][c]*down[i][c]/ ( j up[i][j]down[i][j]) Computing marginals and posteriors
23
Simple Tree: Inference as message passing s s ss s s s You are P(H|our data) I am P(H|all data) DATA
24
Transition posteriors: not independent! AC A C DATA Down: (0.25),(0.25),(0.25),(0.25) Up: (0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01) Up: (0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)
25
Understanding the tree model (and BNs): reversing edges The joint probability of the simple tree model: Can we change the position of the root and keep the joint probability as is? We need:
26
Inference can become difficult We want to perform inference in an extended tree model expressing context effects: 2 3 1 456 789 With undirected cycles, the model is well defined but inference becomes hard We want to perform inference on the tree structure itself! Each structure impose a probability on the observed data, so we can perform inference on the space of all possible tree structures, or tree structures + branch lengths 123 What makes these examples difficult?
27
Factor graphs Defining the joint probability for a set of random variables given: 1)Any set of node subsets (hypergraph) 2)Functions on the node subsets (Potentials) Joint distribution: Partition function: If the potentials are condition probabilities, what will be Z? Things are difficult when there are several modes Factor R.V. Not necessarily 1! (can you think of an example?)
28
More definitions The model: Potentials can be defined on discrete, real valued etc. it is also common to define general log-linear models directly: Inference: Learning: Find the factors parameterization:
29
Belief propagation in a factor graph Remember, a factor graph is defined given a set of random variables (use indices i,j,k.) and a set of factors on groups of variables (use indices a,b..) Think of messages as transmitting beliefs: a->i : given my other inputs variables, and ignoring your message, you are x i->a : given my other inputs factors and my potential, and ignoring your message, you are x x a refers to an assignment of values to the inputs of the factor a Z is the partition function (which is hard to compute) The BP algorithm is constructed by computing and updating messages: Messages from factors to variables: Messages from variables to factors: (any value attainable by x i )->real values
30
Messages update rules: Messages from variables to factors: Messages from factors to variables: a i a i
31
The algorithm proceeds by updating messages: Define the beliefs as approximating single variables posterios (p(h i |s)): Algorithm: Initialize all messages to uniform Iterate until no message change: Update factors to variables messages Update variables to factors messages Why this is different than the mean field algorithm?
32
Beliefs on factor inputs This is far from mean field, since for example: The update rules can be viewed as derived from constraints on the beliefs: 1.requirement on the variables beliefs (b i ) 2.requirement on the factor beliefs (b a ) 3.Marginalization requirement: a i a i
33
BP on Tree = Up-Down s4s4 s3s3 h2h2 h3h3 e s2s2 s1s1 h1h1 ba c d 21 3
34
Loopy BP is not guaranteed to converge XY 11 00 This is not a hypothetical scenario – it frequently happens when there is too much symmetry For example, most mutational effects are double stranded and so symmetric which can result in loops.
35
Sampling is a natural way to do approximate inference Marginal Probability (integration over all space) Marginal Probability (integration over A sample)
36
Sampling from a BN Naively: If we could draw h,s’ according to the distribution Pr(h,s’) then: Pr(s) ~ (#samples with s)/(# samples) Forward sampling: use a topological order on the network. Select a node whose parents are already determined sample from its conditional distribution (all parents already determined!) Claim: Forward sampling is correct: 2 3 1 How to sample from the CPD? 456 789
37
Focus on the observations Naïve sampling is terribly inefficient, why? What is the sampling error? Why don’t we constraint the sampling to fit the evidence s? 2 3 1 456 789 Two tasks: P(s) and P(f(h)|s), how to approach each/both? This can be done, but we no longer sample from P(h,s), and not from P(h|s) (why?)
38
Likelihood weighting Likelihood weighting: weight = 1 use a topological order on the network. Select a node whose parents are already determined if the variable was not observed: sample from its conditional distribution else: weight *= P(x i |pax i ), and fix the observation Store the sample x and its weight w[x] Pr(h|s) = (total weights of samples with h) / (total weights) 789 Weight=
39
Importance sampling Unnormalized Importance sampling: To minimize the variance, use a Q distribution is proportional to the target function: Our estimator from M samples is: But it can be difficult or inefficient to sample from P. Assume we sample instead from Q, then: Claim: Prove it!
40
Correctness of likelihood weighting: Importance sampling Unnormalized Importance sampling with the likelihood weighting proposal distribution Q and any function on the hidden variables: Proposition: the likelihood weighting algorithm is correct (in the sense that it define an estimator with the correct expected value) For the likelihood waiting algorithm, our proposal distribution Q is defined by fixing the evidence at the nodes in a set E and ignoring the CPDs of variable with evidence. We sample from Q just like forward sampling from a Bayesian network that eliminated all edges going into evidence nodes!
41
Normalized Importance sampling Sample: Normalized Importance sampling: When sampling from P(h|s) we don’t know P, so cannot compute w=P/Q We do know P(h,s)=P(h|s)P(s)=P(h|s) =P’(h) So we will use sampling to estimate both terms: Using the likelihood weighting Q, we can compute posterior probabilities in one pass (no need to sample P(s) and P(h,s) separately):
42
Likelihood weighting is effective here: But not here: observed unobserved Limitations of forward sampling
43
Symmetric and reversible Markov processes Definition: we call a Markov process symmetric if its rate matrix is symmetric: What would a symmetric process converge to? Definition: A reversible Markov process is one for which: ijji Time: t s Claim: A Markov process is reversible iff such that: If this holds, we say the process is in detailed balance and the p are its stationary distribution. ii jj q ji q ij Proof: Bayes law and the definition of reversibility
44
Reversibility Claim: A Markov process is reversible iff we can write: where S is a symmetric matrix. Q,tQ,t’ Q,t Q,t+t’ Claim: A Markov process is reversible iff such that: If this holds, we say the process is in detailed balance. ii jj q ji q ij Proof: Bayes law and the definition of reversibility
45
Markov Chain Monte Carlo (MCMC) We don’t know how to sample from P(h)=P(h|s) (or any complex distribution for that matter) The idea: think of P(h|s) as the stationary distribution of a Reversible Markov chain Find a process with transition probabilities for which: Then sample a trajectory Theorem: (C a counter) Process must be irreducible (you can reach from anywhere to anywhere with p>0) (Start from anywhere!)
46
The Metropolis(-Hastings) Algorithm Why reversible? Because detailed balance makes it easy to define the stationary distribution in terms of the transitions So how can we find appropriate transition probabilities? We want: Define a proposal distribution: And acceptance probability: What is the big deal? we reduce the problem to computing ratios between P(x) and P(y) xy F
47
Acceptance ratio for a BN To sample from: We affected only the CPDs of h i and its children Definition: the minimal Markov blanket of a node in BN include its children, Parents and Children’s parents. To compute the ratio, we care only about the values of h i and its Markov Blanket For example, if the proposal distribution changes only one variable h i what would be the ratio? We will only have to compute:
48
Gibbs sampling A very similar (in fact, special case of the metropolis algorithm): Start from any state h do { Chose a variable H i Form h t+1 by sampling a new h i from } This is a reversible process with our target stationary distribution: Gibbs sampling is easy to implement for BNs:
49
Sampling in practice How much time until convergence to P? (Burn-in time) Mixing Burn in Sample Consecutive samples are still correlated! Should we sample only every n-steps? We sample while fixing the evidence. Starting from anywere but waiting some time before starting to collect data A problematic space would be loosely connected: Examples for bad spaces
50
More terminology: make sure you know how to define these: Inference Parameter learning Likelihood Total probability/Marginal probability Exact inference/Approximate inference
51
Z-scores, T-test – the basics You want to test if the mean (RNA expression) of a gene set A is significantly different than that of a gene set B. If you assume the variance of A and B is the same: t is distributed like T with n A +n B -2 degrees of freedom If you don’t assume the variance is the same: But in this case the whole test becomes rather flaky! In a common scenario, you have a small set of genes, and you screen a large set of conditions for interesting biases. You need a quick way to quantify deviation of the mean For a set of k genes, sampled from a standard normal distribution, how would the mean be distributed? The Mean So if your conditions are normally distributed, and pre-standartize to mean 0, std 1 You can quickly compute the sum of values over your set and generate a z-score
52
Kolmogorov-smirnov statistics The D-statistics is a-parameteric: you can transform x arbitrarly (e.g. logx) without changing it The D statistics distribution is given by a the form: An a-parameteric variant on the T-test theme is the Mann-Whitney test. You Take your two sets and rank them together. You count the ranks of one of your set (R1)
53
Hyper-geometric and chi-square test A B Chi-square distributed with m*n-m-n+1 d.o.f.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.