Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.

Probabilistic Approaches to Phylogenies BMI/CS 576 www.biostat.wisc.edu/bmi576.html Sushmita Roy sroy@biostat.wisc.edu Oct 2 nd, 2014

Readings Chapter 8 – Sections 8.1, 8.2, 8.3, 8.4

Key concepts Scoring a tree based on likelihood of observed data – How to compute the likelihood of sequences using a given tree and conditional probabilities Felsenstein’s algorithm General understanding of where the conditional probabilities are obtained from General understanding of search strategies – (similar to parsimony)

Probabilistic methods for phylogenies data: a set of n sequences tree: a phylogenetic tree for the n sequences Two approaches – Maximum likelihood framework P(data|tree) – Bayesian framework P(tree|data) Both need a probabilistic model to compute the probability of a set of sequences, given a tree and branch lengths

Maximum likelihood methods for phylogenetic trees “best” tree is the one that maximizes the likelihood of data (sequence) given model (Tree topology and branch length) Phylogenetic tree construction requires – Scoring a tree Requires computing the likelihood of sequence given the tree That is computing the probability of the sequences given a tree – Searching the space of possible trees Given a probabilistic model of sequence changes, we can compute the tree with the greatest likelihood (maximum likelihood)

Notation for computing the probability of sequences given a tree x j : sequence at node j x j i : character in the i th position for the j th sequence t j : length of the j th branch T : tree topology P(x|y,t): probability of switching from y to x from ancestor to child along a branch of length t – We will come back to defining such probabilistic models later

Computing the probability of sequences on a tree If we know P(x|y,t) we can compute the probability of the sequences This relies on a key assumption: – sequence at a child node i is independent of everything else given i ’s parent – E.g. for a node i whose parent is given by α(i) P(x i |x α(i),x j,x k..)=P(x i |x α(i) ) We will also make additional simplifying assumptions – We have an ungapped alignment – Characters at different sites evolve independently

Example of computing the probability of sequences given a tree x1x1 x2x2 x3x3 x4x4 x5x5 t1t1 t2t2 t3t3 t4t4 The probability of these sequences given this tree is Assume we are given the following tree for three sequences x 1, x 2 and x 3 at the leaf nodes First, assume that the sites evolve independently, and there are a total on N sites Hereafter, let’s just focus on one site, u Also for clarity, we will use t for all branch lengths

Example continued The expression Assume conditional independence, the above is Written more compactly as α(i) denotes the parent of i x1x1 x2x2 x3x3 x4x4 x5x5 t1t1 t2t2 t3t3 t4t4

Example continued Or more generally for n sequences at the leaves as Between internal nodes Between extant and internal nodes

But.. the ancestral sequences cannot be observed So our probability calculation needs to sum over all ancestral states Let us consider a simple example of two sequences xu1xu1 xu2xu2 t1t1 t2t2

Summing of ancestral state for a pair of sequences q a is the probability of observing character a at the root node 3 xu1xu1 xu2xu2 t1t1 t2t2 a

Generalizing this to n sequences Requires us to sum over all of the internal nodes α(i) gives the parent of I a i is a variable storing the character at the i th internal node Felsenstein’s algorithm gives an efficient way to compute this quantity Between internal nodes Between extant and internal nodes

Felsenstein’s algorithm Input: Given a set of n sequences at the leaf nodes, conditional probability distribution of character switch, and a tree topology Output: The likelihood of the sequences Also based on Dynamic programming – Relies on computations performed in subtrees for computations at the root of these subtrees Very similar ideas as in the Weighted Parsimony algorithm

Notation for Felsenstein’s algorithm P(L k |a): probability of the leaves below node k, given that the residue at k is a i and j will denote the children of k a, b, c characters at any node We’ll drop the subscript u and work with only one site

Felsentein’s algorithm Initialize: k=2n-1 Recursion: – If k is a leaf node, – Else, compute P(L i |a) and P(L j |a) for all a at daughters i and j Termination – Likelihood at a site

An observation and a simplication Note that Further more, we will assume that the conditional probabilities are independent of branch length Finally, assume q a =0.25 for all a in {A,T,G,C}

What is probability for the following set of residues ATG 1 23 4 5 ACGT A0.70.1 C 0.70.1 G 0.70.1 T 0.7 Assume the above conditional probability matrix P(b|a) for all branches a b

In class exercise

The probabilities computed for each node Probability of sequence given tree is 0.25(0.0058+0.0022+0.0154 + 0.0058)=0.0073

Felsentein’s algorithm comments Very similar to the weighted parsimony case – Main differences are at Leaf nodes Minimization versus summation for internal nodes Can it be used to infer ancestral states as well? – Instead of summing, we would maximize – As in the parsimony case, we would need to keep track of the maximizing assignment

Errata Slide 16 and slide 17 should have – Replace P(L i |a) by P(L i |b) Slide 20 had typos.

Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.

Similar presentations

Presentation on theme: "Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.

Similar presentations

Presentation on theme: "Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014."— Presentation transcript:

Similar presentations

About project

Feedback