Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.

Slides:



Advertisements
Similar presentations
The multispecies coalescent: implications for inferring species trees
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
Bioinformatics Algorithms and Data Structures
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
The Tree of Life From Ernst Haeckel, 1891.
Continuous Coalescent Model
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.
Phylogenetic trees Sushmita Roy BMI/CS 576
Gene Trees and Species Trees: Lessons from morning glories Lauren A. Eserman & Richard E. Miller Department of Biological Sciences Southeastern Louisiana.
“Species Trees”. What is the “species tree?” The true tree (when there is one) The population tree The dominant history ????
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Speciation history inferred from gene trees L. Lacey Knowles Department of Ecology and Evolutionary Biology University of Michigan, Ann Arbor MI
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
16 September 2007 Coalescent Consequences for Consensus Cladograms J. H. Degnan 1, M. Degiorgio 2, D. Bryant 3, and N. A. Rosenberg 1,2 1 Dept. of Human.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Trees & Topologies Chapter 3, Part 1. Terminology Equivalence Classes – specific separation of a set of genes into disjoint sets covering the whole set.
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Evolutionary Biology Concepts Molecular Evolution Phylogenetic Inference BIO520 BioinformaticsJim Lund Reading: Ch7.
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Ch. 6: Permutations!.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Classification. Cell Types Cells come in all types of shapes and sizes. Cell Membrane – cells are surrounded by a thin flexible layer Also known as a.
Classification.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Estimating genetic diversity (  within populations  =  a function of the number of polymorphic sites in a population (S) “Watterson’s theta”
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Algebra 1 Predicting Patterns & Examining Experiments Unit 7: You Should Probably Change Section 2: Making Arrangements.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Phylogenetics-2 Marek Kimmel (Statistics, Rice)
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
Lecture 19 – Species Tree Estimation
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Multiple Alignment and Phylogenetic Trees
Endeavour to reconstruct the characters of each hypothetical ancestor.
Phylogeny.
Probability.
10.4 How to Construct a Cladogram
Presentation transcript:

Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio

Gene trees and species trees Different genes may produce different inferences about species relationships

Coalescent model for evolution within species, conditional on the species tree Hudson (1983, Evolution) Tajima (1983, Genetics) Nei (1987, Molecular Evolutionary Genetics book) Pamilo & Nei (1988, Molecular Biology and Evolution) Takahata (1989, Genetics) Wu (1991, Genetics) Hudson (1992, Genetics) Maddison (1997, Systematic Biology) T2T2 T3T3

1.Coalescences occur within species, with the same rate for each lineage pair. 3.When species splits are encountered, lineages from all groups descended from the split are allowed to coalesce. Assumptions of the multispecies coalescent model conditional on a species tree 2.The rate of coalescence is proportional to the number of pairs of lineages. T2T2 T3T3

The probability that i lineages have j ancestors at T coalescent time units (T = t / N ) in the past is a [k] = a(a-1)…(a-k+1) a (k) = a(a+1)…(a+k-1) Takahata and Nei (1985, Genetics) Tavare (1984, Theoretical Population Biology)

Concordant gene treeDiscordant gene tree 2.1/3 of the probability that gene tree is determined in the ancestral phase, or (1/3)e -T 1.The probability gene tree is determined in the 2-species phase, or 1-e -T Probability of concordance equals 1-(2/3)e -T For 3 taxa, the probability of concordance is a sum of two terms: T ABC Probability of a concordant gene tree topology Hudson (1983, Evolution) Nei (1987, Molecular Evolutionary Genetics) Tajima (1983, Genetics)

Probability of the matching gene tree ((AB)C) Probability of a particular discordant gene tree ((BC)A)

It would be desirable to have a general computation of the probability that a particular species tree topology with branch lengths gives rise to a particular gene tree topology

Gene tree probabilities under the multispecies coalescent model A coalescent history gives the list of species tree branches on which gene tree coalescences occur. Consider a species tree S (topology and branch lengths) Consider a species tree G (topology only) ABCABC JH Degnan & LA Salter Evolution 59: (2005)

The list of coalescent histories for an example with five taxa ABCDEACBDE Species tree Gene tree (A,C) ((AC),B)(D,E)(((AC)B,(DE)) Probability g ij (T) is the probability that i lineages coalesce to j lineages during time T

What are the properties of the number of coalescent histories? Computing the probabilities of gene trees Is it possible for the most likely gene tree to disagree with the species tree? Using the probabilities of gene trees How do species tree inference algorithms behave when applied to multiple gene trees?

The number of coalescent histories

The number of coalescent histories for the matching gene tree ABCDEF A S,m is the number of coalescent histories for the matching gene tree when we subdivide the species tree root into m pieces

The number of coalescent histories for trees with at most 5 taxa

Number of coalescent histories for special shapes with n taxa Catalan number C n-1 (Degnan 2005) 1, 2, 5, 14, 42, 132, 429, 1430… Number of taxa in left subtree is l -, -, -, 13, 42, 138, 462, 1573…

The number of coalescent histories for up to 11 taxa

Ratio of the largest and smallest number of coalescent histories for n taxa >

Which types of shapes have the most coalescent histories? The number of coalescent histories for trees with 8 taxa Most Least

Caterpillar-like shapes with n taxa, based on 4- and 5-taxon subtrees C n-1 ~(5/4)C n-1 (1.25)C n-1 ~(23/16)C n-1 (1.4375)C n-1

Largest values for caterpillar-like shapes based on 7 and 8-taxon subtrees ~(1381/256)C n-1 ( )C n-1 ~(189/64)C n-1 ( )C n-1

Can a non-matching gene tree have more coalescent histories? Caterpillar species tree 1430 coalescent histories 1441 coalescent histories

Is it possible for the most likely gene tree to disagree with the species tree? Using the probabilities of gene trees How do species tree inference algorithms behave when applied to multiple gene trees? What are the properties of the number of coalescent histories? Computing the probabilities of gene trees

For n>3 taxa, can species trees be discordant with the gene trees they are most likely to produce?

The labeled history for a gene tree is its sequence of coalescence events. BCDABCDA The two labeled histories below produce the same labeled topology ((AB)(CD)) Randomly joining pairs of lineages leads to a uniform distribution over the set of possible labeled histories. The number of labeled histories possible for four taxa is

ABCD T2T2 T3T3 If the branch lengths of the species tree are sufficiently short, coalescences will occur more anciently than the species tree root. BCDA BCDA BCDA Combined probability 1/9 Probability 1/18

((AB)(CD))0.132 ((AC)(BD))0.094 ((AD)(BC))0.094 (((AB)C)D)0.125 (((AB)D)C)0.100 (((AC)B)D)0.070 (((AC)D)B)0.062 (((AD)B)C)0.032 (((AD)C)B)0.032 (((BC)A)D)0.070 (((BC)D)A)0.062 (((BD)A)C)0.032 (((BD)C)A)0.032 (((CD)A)B)0.032 (((CD)B)A) ABCD Species tree Gene tree frequency distribution Matching gene tree

T 2 (units of N generations) T3T3 Species tree is (((AB)C)D) Most likely gene tree is not (((AB)C)D) T2T2 T3T3 Species tree is (((AB)C)D) but most likely gene tree is ((AB)(CD)) A species tree topology produces anomalous gene trees if branch lengths can be chosen so that the most likely gene tree topology differs from the species tree topology.

ABCD T2T2 T3T3 BCDA BCDA BCDA Combined probability 1/9 Probability 1/18 Does the 4-taxon symmetric species tree topology produce anomalous gene trees?

3 species – no anomalous gene trees. 4 species – asymmetric but not symmetric species trees have AGTs. 5 or more species? Probability of the concordant gene tree Probability of a particular discordant gene tree

BCDABCDAEBDEAFC For n > 4, suppose a species tree topology is not n-maximally probable. If its branches are short enough, it produces AGTs that are n-maximally probable. With 5 or more species, any species tree topology produces at least one anomalous gene tree. A labeled topology for n taxa is n-maximally probable if its probability under random branching is greater than or equal to that of any other labeled topology with n taxa. Proof:

Suppose a species tree topology is n-maximally probable. With 5 or more species, any species tree topology produces at least one anomalous gene tree. Proof (continued): For n > 8 an inductive argument reduces the problem to the case of n=5, 6, 7, or 8. For n=5, 6, 7, or 8 taxa it remains to show that the n-maximally probable species tree topologies produce AGTs.

With 5 or more species, any species tree topology produces at least one anomalous gene tree. Proof (continued): For n=5 the n-maximally probable species tree topology produces AGTs.

With 5 or more species, any species tree topology produces at least one anomalous gene tree. Proof (continued): For n=5, 6, 7, or 8 the n-maximally probable species tree topologies produce AGTs.

With 5 or more species, any species tree topology produces at least one anomalous gene tree. Proof (continued): For n > 8 one of the two most basal subtrees has between 5 and n-1 taxa inclusive. GHI J Choose branch lengths to produce an AGT for that subtree, and make them long for the other subtree. An inductive argument for n > 8 reduces the problem to the case of n=5, 6, 7, or 8.

If the species tree topology is not n-maximally probable, it has maximally probable AGTs. With 5 or more species, any species tree topology produces at least one anomalous gene tree. Proof (summary): For n > 8, induction reduces the problem to the case of n=5, 6, 7, or 8. By example, n-maximally probable species tree topologies produce AGTs for n=5, 6, 7, or 8. This completes the proof

Some properties of anomalous gene trees

Species tree Gene tree ABCDE DECAB Anomalous gene trees can have the same unlabeled shape as the species tree

There exist mutually anomalous sets of tree topologies (“wicked forests”).

AGTs can occur if some but not all species tree branches are short T4 T3 T2

T 2 (units of N generations) T3T3 Does the severity of AGTs increase with more taxa? Maximal value for shared branch length that still produces AGTs:

Does the severity of AGTs increase with more taxa?

Number of AGTs for the 4-taxon asymmetric species tree

Number of AGTs for 5-taxon species trees

Does the number of AGTs increase with more taxa?

What implications do gene tree probabilities have for phylogenetic inference algorithms?

Most commonly observed gene tree topology Statistically inconsistent in estimating the species tree T3T3 T2T2 ABCD T 2 (units of N generations) T3T3 ABCD ABCD Species treeEstimated species tree

Estimated gene tree of concatenated sequence Statistically inconsistent in estimating the species tree

Maximum likelihood based on the frequency distribution of gene tree topologies Statistically consistent even when anomalous gene trees exist ((AB)(CD))0.132 ((AC)(BD))0.094 ((AD)(BC))0.094 (((AB)C)D)0.125 (((AB)D)C)0.100 (((AC)B)D)0.070 (((AC)D)B)0.062 (((AD)B)C)0.032 (((AD)C)B)0.032 (((BC)A)D)0.070 (((BC)D)A)0.062 (((BD)A)C)0.032 (((BD)C)A)0.032 (((CD)A)B)0.032 (((CD)B)A) ABCD Species tree Gene tree frequency distribution Matching gene tree Anomalous gene tree

Consensus among gene tree topologies -Majority rule consensus -Greedy consensus -Rooted triple consensus (R*)

Tree obtained by agglomeration using minimum pairwise coalescence times across a large number of loci (“Glass tree”)

Summary There exist algorithms for computing gene tree probabilities on species trees The number of coalescent histories increases quickly - algorithmic improvements in gene tree probability computations are likely possible HOWEVER, some algorithms can infer the correct species tree even when gene tree discordance is extreme A species tree can disagree with the gene tree that it is most likely to produce This severe discordance only gets worse with more taxa

Acknowledgments David Bryant Mike DeGiorgio James Degnan Randa Tao National Science Foundation DEB