Maximum likelihood (ML) method

Slides:



Advertisements
Similar presentations
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Advertisements

Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Phylogenetic Trees Lecture 4
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
IE68 - Biological databases Phylogenetic analysis
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Maximum Parsimony.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Probabilistic methods for phylogenetic trees (Part 2)
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
1 Additive Distances Between DNA Sequences MPI, June 2012.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
A brief introduction to phylogenetics
1 Evolutionary Change in Nucleotide Sequences Dan Graur.
Comp. Genomics Recitation 3 The statistics of database searching.
Calculating branch lengths from distances. ABC A B C----- a b c.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
 Tue Introduction to models (Jarno)  Thu Distance-based methods (Jarno)  Fri ML analyses (Jarno)  Mon Assessing hypotheses.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Evolutionary Models CS 498 SS Saurabh Sinha. Models of nucleotide substitution The DNA that we study in bioinformatics is the end(??)-product of evolution.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Modelling evolution Gil McVean Department of Statistics TC A G.
Molecular Evolution. Study of how genes and proteins evolve and how are organisms related based on their DNA sequence Molecular evolution therefore is.
Evolutionary Change in Sequences
Phylogenetics LLO9 Maximum Likelihood and Its Applications
Models for DNA substitution
Lecture 10 – Models of DNA Sequence Evolution
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Linkage and Linkage Disequilibrium
The +I+G Models …an aside.
Models of Sequence Evolution
Goals of Phylogenetic Analysis
Molecular Evolution.
Summary and Recommendations
CS 581 Tandy Warnow.
Pairwise Sequence Alignment (cont.)
Why Models of Sequence Evolution Matter
BNFO 602 Phylogenetics – maximum likelihood
The Most General Markov Substitution Model on an Unrooted Tree
Lecture 10 – Models of DNA Sequence Evolution
Lecture 11 – Increasing Model Complexity
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Maximum likelihood (ML) method Jarno Tuimala Thanks to James McInerney for the slides with a darker background!

Maximum likelihood Historically a new method (Felsenstein, 1980’s) ML assumes a model of sequence evolution Using the model, ML method tries to answer the question: what is the likelihood (conditional probability) of observing these data given a certain model

Maximum Likelihood - goal To estimate the probability that we would observe a particular dataset, given a phylogenetic tree and some notion of how the evolutionary process worked over time. The principal objective of maximum likelihood is to estimate the probability of observing a set of sequences (from extant organisms, obtained in a molecular biology laboratory). As we will see later, this probability is based on a number of things, but mainly on what we call a model of sequence evolution. This model has two or more components - a phylogenetic tree with branch lengths and a description of the process of how evolution occurred, often a substitution matrix, a description of rate variation between different sites, an idea of the frequency of the nucleotide/codon/amino acid pools during the evolutionary time period. ) ( Probability of given

Probability of observing a sequence What is the probability of observing a sequence ACGT, if p(a)=p(c)=p(g)=p(t)=0.25 ? Assumption: sequence sites evolve independently P(ACGT) = p(a)*p(c)*p(g)*p(t) = 0.25*0.25*0.25*0.25 = 0.00390625 LogP = log(0.00390625) = -5.545177

Substitution matrix For nucleotide sequences, there are 16 possible ways to describe substitutions - a 4x4 matrix. Convention dictates that the order of the nucleotides is a,c,g,t Note: for amino acids, the matrix is a 20 x 20 matrix and for codon-based models, the matrix is 61 x 61

Does changing a model affect the outcome? There are different models Jukes and Cantor (JC69): All base compositions equal (0.25 each), rate of change from one base to another is the same Kimura 2-Parameter (K2P): All base compositions equal (0.25 each), different substitution rate for transitions and transversions). Hasegawa-Kishino-Yano (HKY): Like the K2P, but with base composition free to vary. General Time Reversible (GTR): Base composition free to vary, all possible substitutions can differ. All these models can be extended to accommodate invariable sites and site-to-site rate variation.

Probability of observing a sequence change 1/2 Alignment: ACCT GCCT Change probabilities (Jukes-Cantor, μ=0.1): Tree: ACCT – GCCT Nucleotide frequences: p(a)=p(c)=p(g)=p(t)=0.25

Probability of observing a sequence change 2/2 P(ACCT, GCCT) = ∏ik (frequency*change probability) P(ACCT, GCCT) = 0.25*0.0062*0.25*0.9815*0.25*0.9815* 0.25*0.9815 = 0.00002289932 Log(P(ACCT, GCCT)) = -4.64

Different Branch Lengths For very short branch lengths, the probability of a character staying the same is high and the probability of it changing is low (for our particular matrix). For longer branch lengths, the probability of character change becomes higher and the probability of staying the same is lower. The previous calculations are based on the assumption that the branch length describes one Certain Evolutionary Distance or CED. If we want to consider a branch length that is twice as long (2 CED), then we can multiply the substitution matrix by itself (matrix2). X =

Optimizing the branch lengths

Invariable sites For a given dataset we can assume that a certain proportion of sites are not free to vary - purifying selection (related to function) prevents these sites from changing). We can therefore observe invariable positions either because they are under this selective constraint or because they have not had a chance to vary or because there is homoplasy in the dataset and a reversal (say) has caused the site to appear constant. The likelihood that a site is invariable can be calculated by incorporating this possibility into our model and calculating for every site the likelihood that it is an invariable site. It might improve the likelihood of the dataset if we remove a certain proportion of invariable sites (in a way that is analogous to the preceding discussion).

Variable sites Obviously other sites in the dataset are free to vary. Selection intensity on these sites is rarely uniform, so it is desirable to model site-by-site rate variation. This is done in two ways: site specific (codon position, or alpha helix etc.) using a discrete approximation to a continuous distribution (gamma distribution). Again, these variables are modeled over all possibilities of sequence change over all possibilities of branch length over all possibilities of tree topology.

The shape of the gamma distribution for different values of alpha.

Incorporating gamma 1/2 Alignment: ACCT GCCT Change probabilities (Jukes-Cantor, μ=0.1): Tree: ACCT – GCCT Nucleotide frequences: p(a)=p(c)=p(g)=p(t)=0.25 Two gamma classes, p(g1)=0.8, p(g2)=0.2

Incorporating gamma 2/2 P(ACCT, GCCT) = (0.25*0.0062*0.8 + 0.25*0.0062*0.2)* (0.25*0.9815*0.8 + 0.25*0.9815*0.2)* (0.25*0.9815*0.8 + 0.25*0.9815*0.2)* (0.25*0.9815*0.8 + 0.25*0.9815*0.2) = 0.00002289932 Log(P(ACCT, GCCT)) = -4.64 Using gamma, more calculations are done, and more time is consumed

Selecting the correct model 1/4 It was previously pointed out that parsimony can be inconsistent. ML can be inconsistent too! If the model used in the ML analysis is incorrect, the method might become inconsistent. Before analysis, the correct model should be selected.

Selecting the correct model 2/4

Selecting a correct model 3/4

Selecting a correct model 4/4

Checking for saturation

Likelihood mapping with TreePuzzle

Practical issues There is an ML equivalent to Wagner method for generating initial trees, but it is very slow. Many programs create an initial tree using parsimony or distance methods or use a completely random tree. Search strategy is similar to parsimony: 100 RAS + TBR for small dataset In addition, simulated annealing can be used for larger datasets