The concept of likelihood refers given some data D, a decision must be made about an adequate explanation of the data. In the phylogenetic framework, one.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Chapter 4 Probability and Probability Distributions
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Phylogenetic Estimation using Maximum Likelihood By: Jimin Zhu Xin Gong Xin Gong Sravanti polsani Sravanti polsani Rama sharma Rama sharma Shlomit Klopman.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Class 3: Estimating Scoring Rules for Sequence Alignment.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Probabilistic methods for phylogenetic trees (Part 2)
Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Molecular phylogenetics
Statistical Decision Theory
Tree Inference Methods
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Introduction to Phylogenetics
Comp. Genomics Recitation 3 The statistics of database searching.
Calculating branch lengths from distances. ABC A B C----- a b c.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Sequence Alignment.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
The ideal approach is simultaneous alignment and tree estimation.
Maximum likelihood (ML) method
Multiple Alignment and Phylogenetic Trees
Recitation 5 2/4/09 ML in Phylogeny
Inferring phylogenetic trees: Distance and maximum likelihood methods
The Most General Markov Substitution Model on an Unrooted Tree
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

The concept of likelihood refers given some data D, a decision must be made about an adequate explanation of the data. In the phylogenetic framework, one hypotheses include the different tree structures, the branch lengths, the parameters of the model of sequence evolution, and so on. 6.1 Chatper 6 Maximum-likelihood Tree-puzzle

N=100 h=21 t=79 Binominal distribution Likelihood function

Fig6.1

For ease of computation, first compute the logarithm of the likelihood function, which results in sums rather than products: It is known that the maximum of a function y=f(x) – when it exists- is given by the value x for which the first derivative of the function equals zero.

This derivative is equal to zero if θ=h/n, positive for smaller values of θ, and negative for larger values, so that log [L(θ)] attains its maximum when. Thus is the maximum-likelihood estimate (MLE) of the probability of observing a heads in a single coin toss. θ ^ = h/n θ ^

L(21/100) ~ L( 1/2 ) ~ 1.61 x  : 0.21 vs / 1.61x ~ 6 x 10 7 Odds ratio In evolution, point mutations are considered chance events, just like tossing a coin. therefore, at least in principle, the probability of finding a mutation along one branch in a phylogenetic tree can be calculated by using the same maximum-likelihood framework.

The main idea behind phylogeny inference with maximum likelihood is to determine the tree topology, branch lengths, and parameters of the evolutionary model (e.g., transition/ transversion ratio, base frequencies, rate variation among sites) that maximize the probability of observing the sequences at hand. In other words, the likelihood function is the conditional probability of the data given a hypothesis (i.e., a model of substitution with a set of parameters θ and the tree τ, including branch lengths):

A tree with two taxa has only on branch connecting the two sequences; the sole purpose of the exercise is reconstructing the branch length that produces the data with maximal probability. 6.2 A B

6.2.1 The simple case: Maximum-likelihood tree for two sequences The alignment has length l for the two sequences S i = (s i (1), …, s i ( j )), (i = 1, 2), where s i ( j ) is the nucleotide, the amino acid, or any other letter from a finite alphabet at sequence position j in sequence i.

GATCATC……..ATCATAAAATTTACGCA GATACCC……..ATCAATAAATTTACCCA

Identical pairs of nucleotides (l 0 ) and the number of different pairs (l 1 ), where l 0 + l 1 = l. GATCATC……..ATCATAAAATTTACGCA GATACCC……..ATCAATAAATTTACCCA GAACCTC……..AACATAAAATTTAGCCA …..

First, it is assumed that each site s in the alignment evolves according to the same model M. The assumption also implies that all sites evolve at the same rate μ. The rate at a site is modified by a rate-specific factor, ρ j > 0. …..The probability of a certain site pattern are available. Pr [D s, τ, M, ρ j ], j =1, …l (6.11) L (τ, M, ρ) = Pr[D, τ, M, ρ]=  Pr[D, τ, M, ρ] j=1 l j j (6.12) D=(D 1, D 2, D 3, …, D l ) a tree 

First, for a fixed choice of τ, M, and the site rate vector ρ, the probability to observe the alignment D can be computed with Equation Second, for a given alignment D, Equation 6.12 can be used to find the MLEs. It is assumed that the site-specific rate factor ρ j is drawn from a Г- distribution with expectation 1 and variance 1/α.

Consider the tree τ with its branch lengths (i.e., number of substitutions), the model of sequence evolution M with its parameters (e.g., transition/ transversion ratio, stationary base composition), and the site-specific rate factor ρ j = 1 for each site j. The goal is to compute the probability of observing one of the 4 n possible patterns in an alignment of n sequences. 6.3

d1 d2 d5 d4 d3 s6s5 s2 s1 s4 s3

It is assumed that evolution started from sequence S 0 and then proceeded along the branches of tree τ with branch lengths d 1, d 2, d 3, d 4, and d 5. To compute Pr [D j, , M, 1] for a specific site j, where Dj = (s 1, s 2, s 3, s 4 ) are the nucleotides observed, it is necessary to know the ancestral states s 0 and s 5. The conditional probability of the data given the ancestral states then will be as follows:

However, in almost any realistic situation, the ancestral sequences are not available. Therefore, one sums over all possible combinations of ancestral states of nucleotides. The sum can be efficiently assessed by evaluating the likelihoods moving from the end nodes of the tree to the root. In each step, two nodes from the tree are removed and replaced by a single node.

To generalize this equation for more than four sequences, it is necessary to sum all the possible assignments of nucleotides at the n - 2 inner nodes of the tree. Let D j = (s 1, s 2, s 3,…, s n ) be a pattern at a site j, with tree τ and a model M fixed. Nucleotides at inner nodes of the tree are abbreviated as x i, i = n + 1, …, 2n –

For an inner node i with offspring o 1 and o 2, the vector L i j = (L i j = (A), L i j = (C) L i j = (G) L i j = (T)) is defined recursively as where d 0 1 and d 0 2 are the number of substitutions connecting node i and its descendants in the tree 6.3.1

It is assumed that the node 2n – 2 has three offspring: o1 and o2, and o3, respectively.

d1 d2 d5 d4 d3 s6s5 s2 s1 s4 s3

Finding those branch lengths for tree τ maximizing the log-likelihood function. When computing the maximum-likelihood tree, the model parameters and branch lengths have to be computed for each tree, and then the tree that yields the highest likelihood has to be selected. Because of the numerous tree topologies, testing all possible trees is impossible. 6.4

Thus, various heuristics are used to suggest reasonable trees, including stepwise addition (e.g., used in Felsenstein’s PHYLIP package: program DNAML. exe) and star decomposition MOLPHY, as well as the neighbor-joining (NJ) algorithm. 6.4

Given a set of n aligned nucleotide sequences, any group of four of them is called a quartet. The quartet–puzzling algorithm analyzes all possible quartets in a data set

In essence, the algorithm is a three–step procedure. The first step, the so-called maximum-likelihood step computes for each of the ( ) = possible quartets the maximum-likelihood values L 1, L 2, and L 3 for the three possible four-sequence trees T 1, T 2, T n 4 n!n! 4!(n-4)!

The resulting list of 3 · likelihoods is then used in the quartet-puzzle step to compute an intermediate tree by inserting sequences sequentially in an already reconstructed subtree. n 4 ()

eventually, sequence E is inserted at the branch with minimal penalty. For large data it is not feasible to compute all intermediate trees. Thus the quartet-puzzle step is repeated at least a thousand times for various input orders of sequences to avoid reconstruction artifacts due to the ordering of the sequences and to get a representative collection of trees.

Finally, in step three, the majority -rule consensus is computed from the resulting intermediate trees. The resulting tree is called the quartet-puzzling tree. The consensus step provides information about the number of times a particular grouping occurred in the intermediate trees. This so-called reliability value, or support value, measures (in %) how frequently a group of sequences occurs among all intermediate trees.

The accepted strategy is to infer a “reasonable” tree topology with faster–reconstruction methods and use that tree to estimate the parameters. Eventually, a maximum likelihood tree can be re-estimated with the new set of parameters. This approach assumes that parameter estimates are not greatly disturbed when using a slightly incorrect topology. Among the fast distance-based tree reconstruction methods, NJ. It has been shown that the NJ tree is always similar to the true tree. 6.5

1.Based on reasonable pairwise “genetic distance, estimates” an NJ tree is computed. 2.Then, maximum-likelihood branch lengths are computed for this tree topology and parameters of the sequence evolution are estimates. 3.Based on these estimates, a new NJ tree is computed and Step 2 is repeated. Step (2) and (3) are repeated until the estimates of the model parameters are stable. TREE-PUZZLE employs this idea to obtain approximate estimates of model parameters, saving computation time and still serving as an efficient tool to estimate model parameters.

Approach also may be used to study the amount of evolutionary information contained in a data set. If L 1, L 2, and L 3 are the likelihoods of trees T 1, T 2, and T 3, then it is possible to compute the posterior probabilities of each tree T i as p i = L i /(L 1 +L 2 +L 3 ), where the p i terms sum to 1 and 0 < p i <1 for each i. The probabilities p 1, p 2, and p 3 can be reported simultaneously as a point P lying inside an equilateral triangle, each corner of the triangle representing one of the three possible tree topologies. 6.6

If P is close to one corner – for example, the corner T 1 – the tree T 1 receives the highest support. In a maximum-likelihood analysis, the tree T i, which satisfies p i = max {p 1, p 2, p 3 }, is selected as the MLE. However, this decision is questionable if P is close to the center of the triangle. A more realistic representation of the data is a star-like tree rather than an artificially strictly bifurcating tree.

Therefore, the likelihood-mapping method partitions the area of the equilateral triangle into seven regions. The three trapezoids at the equilateral triangle into seven regions. The three trapezoids at the corners represent the areas supporting strictly bifurcating trees. The three rectangles on the sides represent regions where the decision between two trees is not obvious. The center of the triangle represents sets of P vectors where all three trees are poorly supported. The three likelihoods for the three tree topologies of each possible quartet are reported as a dot in an equilateral triangle like the one.

The distribution of dots in the seven major areas of the triangle gives an overall impression of the tree- likeness of the data. That is informative about the mode of evolution of the sequences under investigation is the percentage of dots belonging to the three main different areas in the equilateral triangle. The three corners represent fully resolved tree topologies. The presence of tree-like phylogenetic signal in the data. The center is the area of star-like phylogeny. The three areas on the sides represent network-like phylogeny, in which the data support conflicting tree topologies.