Current challenges for molecular phylogenetics Barbara Holland School of Mathematics & Physics University of Tasmania Mostly statistical.

Slides:



Advertisements
Similar presentations
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Advertisements

 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
The Human Genome (Harding & Sanger) * *20  globin (chromosome 11) 6*10 4 bp 3*10 9 bp *10 3 Exon 2 Exon 1 Exon 3 5’ flanking 3’ flanking 3*10 3.
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Molecular phylogenetics
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
Tree Inference Methods
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
A brief introduction to phylogenetics
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
The bootstrap, consenus-trees, and super-trees Phylogenetics Workhop, August 2006 Barbara Holland.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Why phylogenetics? Barbara Holland School of Physical Sciences University of Tasmania.
Phylogeny Ch. 7 & 8.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
The Big Issues in Phylogenetic Reconstruction Randy Linder Integrative Biology, University of Texas
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Methods in Phylogenetic Inference Chris Castorena Thornton Lab.
Building Phylogenies. Phylogenetic (evolutionary) trees Human Gorilla Chimp Gibbon Orangutan Describe evolutionary relationships between species Cannot.
Phylogenetics LLO9 Maximum Likelihood and Its Applications
Phylogenetic basis of systematics
Maximum likelihood (ML) method
Multiple Alignment and Phylogenetic Trees
Models of Sequence Evolution
Goals of Phylogenetic Analysis
Methods of molecular phylogeny
Summary and Recommendations
The Most General Markov Substitution Model on an Unrooted Tree
Assessing model adequacy in molecular phylogenetics (or more to the point – not doing it and saying why it’s tricky) Barbara Holland University of Tasmania.
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Modelling heterogeneity in multi-gene data sets
Presentation transcript:

Current challenges for molecular phylogenetics Barbara Holland School of Mathematics & Physics University of Tasmania Mostly statistical

Charles Darwin and Alfred Russell Wallace Evolution as descent with modification, implying relationships between organisms by unbroken genetic lines Phylogenetics seeks to determine these genetic relationships Darwin’s sketch: the first phylogenetic tree? Charles Darwin Alfred Russel Wallace

Since the publication of Origin of the Species in 1859 people have been trying to infer the evolutionary “tree of life”. Ernst Haeckel’s Tree of Life (1866)

Haeckel’s Pedigree of Man

Why molecular phylogeny Most molecules evolve independently of adaptations affecting morphology. It is fairly easy to find genes that are present in all species of interest, e.g., a 12S RNA molecule in mitochondria is functional over all mammals. Useful mathematical models of sequence evolution have been developed that underpin attempts to infer evolutionary trees

Golden Mole Mole Whale ?

Golden MoleMoleWhale Laurasiatheria Afrotheria

? hedgehog tenrec elephant

LaurasiatheriaAfrotheria hedgehogtenrec elephant

A brief and incomplete history of molecular phylogenetics 60s 70s 80s 90s 00s Antibodies DNA-DNA hybridisation SarichWilson Assessing support - Bootstrap Explicit Models - Maximum likelihood Systematic bias – Felsenstein Zone Felsenstein Sequence Data (Amino acid then DNA) Parsimony Distance based More complex models - Bayesian methods Various perils...anomalous gene trees, non identifiable models Population processes, gene trees in species trees MORE Sequence Data PCR

The molecular phylogeny problem ACCGCTTA ACTGCTTA ACTGCTAAACTGCTTA ACTGCTAA ACCCCTTA Time ACCCCATA …ACCCCTTA… …ACCCCATA… …ACTGCTTA… …ACTGCTAA… We see the aligned modern day sequences And want to recover the underlying evolutionary tree. ?

Sequence evolution is modelled as a Markov process time A C Consider a single edge in a phylogeny, i.e. evolution of a single species, and the evolution of a single DNA base amongst the possible states {A, C, G, T}. The probability of mutating from state i to j over a length of time t depends only on the current state i and the potential future state j, not on any of the previous history of the sequence, and can be written p ij (t). A T G time t

Continuous time Markov chains ACGT A p AA p AC p AG p AT C p CA p CC p CG p CT G p GA p GC p GG p GT T p TA p TC p TG p TT M = ACGT A -q A* q AC q AG q AT C q CA -q C* q CG q CT G q GA q GC -q G* q GT T q TA q TC q TG -q T* Q = Where q i* = Σ j q ij, j ≠ i i.e. rows sum to zero. M = exp(Qt) Transition matrixInstataneous rate matrix Typically we restrict to stationary, reversible models, with the stationary distribution denoted by π. So, π Q = 0, and D(π)Q is symmetric.

Models of nucleotide substitution Jukes Cantor (JC) –All substitutions equally likely –Base frequencies equal Kimura 2 Parameter (K2P) –Transitions and transversions at different rates –Base frequencies equal HKY model –Transitions and transversions at different rates –Base frequencies different General Time Reversible (GTR) AG C T α α αα α α

Models of nucleotide substitution Jukes Cantor (JC) –All substitutions equally likely –Base frequencies equal Kimura 2 Parameter (K2P) –Transitions and transversions at different rates –Base frequencies equal HKY model –Transitions and transversions at different rates –Base frequencies different General Time Reversible (GTR) AG C T β β αα α α

Models of nucleotide substitution Jukes Cantor (JC) –All substitutions equally likely –Base frequencies equal Kimura 2 Parameter (K2P) –Transitions and transversions at different rates –Base frequencies equal HKY model –Transitions and transversions at different rates –Base frequencies different General Time Reversible (GTR) AG C T β β αα α α

Models of nucleotide substitution Jukes Cantor (JC) –All substitutions equally likely –Base frequencies equal Kimura 2 Parameter (K2P) –Transitions and transversions at different rates –Base frequencies equal HKY model –Transitions and transversions at different rates –Base frequencies different General Time Reversible (GTR) AG C T β γ ζα ε δ

Models define probability distributions on site patterns The model θ consists of: the tree topology, edge weights, Q matrix*, and root distribution π. * More generally, this could be a set of Q matrices M1M1 M2M2 M 12 M3M3 Edge weights t 1, t 2, t 3, t 12 M e = exp(Qt e ) p ijk = Σ x,y M 1 (x,i) M 2 (x,j) M 12 (y,x) M 3 (y,k) π(y) x y

Tree estimation using maximum likelihood For a given set of parameters θ we can calculate the probability of any particular site pattern. The overall probability of an alignment is then taken to be the the product of the probabilities for each site (i.i.d assumption). This is the likelihood function, i.e. the probability of the data given the model. We can then use optimisation techniques to find the model parameters (tree topology, edge lengths, parameters of the substitution model) that maximise the likelihood.

Extra features of sequence evolution that can be modelled Site to site rate variation (usually modelled by a gamma distribution) Invariant sites BUT Some parts of reality are problematic… –Base composition bias –Sites that are free to vary change across the tree –Non independence of sites

Likelihood versus parsimony (the Felsenstein Zone) Prior to the introduction of ML to phylogenetics community by Joe Felsenstein Maximum Parsimony (MP) was the most widely used method for estimating phylogenetic trees. MP chooses the tree that requires the fewest mutations to explain the data A B C D A A G G G A A D B A G G A A A C

Likelihood versus parsimony (the Felsenstein Zone) The MP criterion has been shown to be statistically inconsistent on some trees under the models of nucleotide substitution discussed previously. Likelihood is statisitically consistent (given the correct model). Felsenstein (1978)Hendy & Penny (1989)

Assessing confidence It is not just of interest to get a point estimate of the phylogenetic tree. We would also like some measure of confidence in our point estimate. – Is our tree likely to change if we get more data? – How robust is our result to sampling error? The bootstrap is a useful tool for answering these sorts of questions.

The bootstrap (Felsenstein 1985) For each bootstrap sample: –Create a new alignment (of the same length as the original) by resampling the columns of the observed alignment –Construct a tree for the ‘bootstrap’ alignment The bootstrap support for each edge is the number of bootstrap trees that edge appears in.

a ATATAAA bATTATAA cTAAAATA dTATAAAT a ATTTAAA bATTATAA cTAAAATA dTAAAAAT a AAATAAA bATTATAA cTAAAATA dTTTAAAT a ATATAAA bATTATAA cTAAAATA dTATAAAT a ATTTAAA bATAATAA cTAAAATA dTAAAAAT a a aa b b bb c c c c d d d d a b c d 0.75

Example where the bootstrap is useful Simulate data on the four taxon tree below (JC model) Use sequence lengths of 100, 1000, and ((a,b),(c,d)) 5.7%97%100% ((a,c),(b,d)) 42.8%<5%0 ((a,d),(b,c)) 49.8%<5% abcd

Example where it is not so useful Simulate data on the two four-taxon trees below (JC model) in the proportion 55%, 45% and concatenate the sequences Use total sequence lengths of 100, 1000, and ((a,b),(c,d)) 64%80%98% ((a,c),(b,d)) 33%20%<5 ((a,d),(b,c)) 3%0%< abc d acb d 55% 45%

Genome-scale phylogeny Data sets with many concatenated genes –Rokas et al, Nature 2003 (106 genes, 8 taxa) –Goremykin et al, MBE 2004 (61 genes, 14 taxa) Estimated trees have very high bootstrap support. BUT... trees are sensitive to: model used, method used, data-coding.

Case study: The Amborella Wars

Angiosperms Grasses A New Caladonian shrub

NJ bootstrap with ML distances using a GTR + gamma model alpha (gamma shape parameter) bootstrap support Amb+Nym Grasses Amb Nym Skewed ratesEqual rates

Sensitivity to model choice Phylogenomic datasets may involve hundreds of genes for many species. These data sets create challenges for current phylogenetic methods, as different genes have different functions and hence evolve under different processes. One question is how best to model this heterogeneity to give reliable phylogenetic estimates of the species tree.

Example C. albicans S. kluyveri S. kudriavzevii S. bayanus S. cerevisiae S. paradoxus S. mikatae S. castellii Rokas et al. (2003) produced 106 gene trees for 8 yeast taxa

Two extremes How many parameters do we need to adequately represent the branches of all (unrooted) gene trees ? Between 13 (consensus tree) & 13 x 106 = 1378 Too few parameters introduces bias Too many parameters increases the variance

Stochastic partitioning Attempts to cluster genes into classes that have evolved in a similar fashion. Each class is allowed its own set of parameters (e.g. branch lengths or model of nucleotide substitution)

Algorithm overview 1.Randomly assign the n genes to k classes. 2.Optimise parameters for each class 3.Compute the posterior probability for each gene with the parameters from each class. 4.Move each gene into the class for which it has highest posterior probability 5.Go to step 2, when no genes change class STOP

How many classes?

Conclusions regarding stochastic partitioning Pros –AIC/BIC allows you a quantitative method to choose how many parameters are needed. –Identifies groups of genes under similar constraints Cons –Slow –Randomized algorithm so different starting points lead to different partitions.

Brief Tour… Combinatorics of tree space Graph Theory Stochastic Models, Inference & Probability Theory Algebraic Geometry Lie groups, representation theory ….

Figure 2 Matsen and Steel (2007) …the underlying assumption was that mixture model data on one topology can be distinguished from data evolved on an unmixed tree of another topology given enough data and the ``correct'' method. Here we show that this assumption can be false. For biologists our results imply that, for example, the combined data from two genes whose phylogenetic trees differ only in terms of branch lengths can perfectly fit a tree of a different topology. Identifiability

Elizabeth Allman John Rhodes Algebraic geometry approach

The boundary of phylogenetics and population genetics Fisher-Wright model Phylogenetic tree

Gene trees in species phylogenies James DegnanNoah Rosenberg

Representation theory, Lie groups, Markov invariants, closure of model classes Jeremy Sumner Peter Jarvis