Recitation 5 2/4/09 ML in Phylogeny Comp. Genomics Recitation 5 2/4/09 ML in Phylogeny Based on Slides by Ron Shamir and Nir Friedman
Outline Maximum likelihood (ML) ML in phylogeny Ancestral sequence reconstruction using ML
Maximum likelihood One of the methods for parameter estimation Likelihood: L=P(Data|Parameters) Simple example: Simple coin with P(head)=p 10 coin tosses 6 heads, 4 tails L=P(Data|Params)=(106)p6 (1-p)4
Maximum likelihood We want to find p that maximizes L=(106)p6 (1-p)4 Infi 1, Remember? Log is a monotolical function, we can optimize logL=log[(106)p6 (1-p)4]= log(106)+6logp+4log(1-p)] Deriving by p we get: 6/p-4/(1-p)=0 Estimate for p:0.6 (Makes sense?)
Likelihood of a Tree Input (small problem): n sequences A tree T, with labels on the leafs (X) Find optimal labeled tree : labeling of internal nodes (Y) branch lengths (b) Maximizing the likelihood P(X|T,Y,b)
Likelihood (2) How to compute P(X|T,Y,b)? Assumptions: Each character is independent The branching is a Markov process: The probability of a node having a given label is only a function of the parent node and the branch length b between them. The probabilities P(x|y,t) are known
Example x1 x2 x3 x4 x5 t1 t2 t3 t5
What if we want P(X|T,b)? Assume that the branch lengths b are known. Independence of sites Markov property independence of each branch ALGMB, December 01 © Ron Shamir , TAU
Properties of P Additivity: Reversibility Allows to freely move the root
Efficient Likelihood Calculation (Felsenstein ’73) Use dynamic prog. similar to parsimony Need Sj(v,a) = Pr(subtree rooted in v | vj = x) Initialization: For each leaf v set Sj(v,a) = 1 if i is labeled by a, otherwise Sj(v,a) = 0 Recursion: Traverse the tree in postorder: for each node v with children u and w, for each state x Complexity: O(nmk2) n species, m chars, k states
Ancestral sequence reconstruction Input: Rooted tree + extant (leaf) sequences Substitution matrix + branch lengths Problem: Find the sequence assignment of internal states which maximizes the total tree likelihood
Solving ancestral sequence reconstruction Simple with parsimony methods, ≈ through the Fitch/Sankoff algorithms Here, we’re interested in ML Maximizing P(ancestral S|contemporary S) Joint vs. Marginal Marginal: focus on a single node (e.g., the root), and maximize its likelihood Joint: Infer all the sequences together
Solutions We can enumerate all the possible ancestral states and check their likelihood… cn possible combinations per character n – number of internal nodes Inapplicable when the tree is large Koshi and Goldshtein (1996) – fast algorithm for marginal reconstruction Pupko, Pe’er, Shamir and Graur (2004): fast algorithm for joint reconstruction
Basics We assume different sites evolve independently Working one site at a time Pij(t) – the probability of observing ij in time t We want to maximize P(v|data)=P(data|v)*P(v)/P(Data) Constant!
DP to the rescue DP often suitable for tree problems Idea: Start from the leaves and climb up the tree The subtree under every node is dependent only on the state of its parent! For node x compute Lx(i) and Cx(i) Lx(i) – the likelihood of x’s subtree under the condition that its parent is assigned with i Cx(i) – the state of x that gives rise to this likelihood
Algorithm phase I Initialization: Progression: Termination: For a leaf y assigned with j: Cy(i)=j, Ly(i)=Pij(t) Progression: For an internal node z with sons x,y already visited: for each i we compute Termination: For the root with sons x,y,z – choose k maximizing
Algorithm phase II “Traceback” Traverse the tree from root to the leaves For every internal node x with father y already reconstructed with i Reconstruct the state in x by setting Cx(i) Continue until all the nodes are reconstructed
Example
Complexity For n internal nodes and c possible states we compute a DP table of O(nc) cells. As we maximize in every cell over c states, time is O(nc2) As c is constant – O(n)