An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA ISMB 2016, July 11, 2016
Coalescent Theory: Introduction Population gene genealogy Coalescent Theory: Introduction Coalescence Allele Sampled allele Generation Wright-Fisher Model: Non-overlapping generations Constant population size Random mating Time Coalescent: trace backward in time from sampled lineages. Coalescence: two sample lineages find common ancestor. Gene genealogy: determined by coalescent process Stochastic: when coalescence occurs is a stochastic process
Coalescent in Multiple Populations “Outline" tree is the species (population) tree: evolutionary history of populations T “Embedded" tree is the gene tree: evolutionary history of individual alleles Coalescent: tracing lineages backward in time to common ancestor in a population Multispecies coalescent: each extant/ancestral population runs a separate coalescent: determines gene genealogy with multiple populations Coalescence Gene tree/species tree discordance: gene trees for the a species tree with different topology due to stochastic coalescence. Stochasticity: inherent in coalescent process (a gene tree has certain probability of being observed under the coalescent process). Larger T: B and C more likely to coalesce within T Small T: B and C may not coalesce with T (and may lead to different gene tree) Multispecies coalescent: allow computation of probability of gene genealogies from multiple species (populations)
Gene Tree Probability For a species tree, any gene tree topology can arise, but with probability. For species tree Ts (with branch length) and a gene tree topology Tg: Gene tree probability P(Tg|Ts): probability of observing a binary gene tree topology Tg for species tree Ts under coalescent theory. The larger P(Tg|Ts) is, the more likely Tg will be observed. Has multiple applications in population and evolutionary genomics. e.g. Infer Ts from multiple gene genealogies Tg by maximum likelihood For small gene tree and species tree, calculation by hand may be feasible (e.g. Hudson, 1983, Takahata and Nei, 1985, Rosenberg 2002): usually listing all possible gene trees and species trees with say five species. For larger trees, an algorithm is needed. (Nowadays, gene trees and species trees can have hundreds of taxa/alleles) Key: efficient computation of the gene tree probability.
An algorithm for Gene Tree Probability (Degnan and Salter, 2005) History 1 History 2. Same gene tree Coalescent history: each coalescent event occur at which species tree branch: a and b coalesce within T Then coalescence with c above T. Note: a and b can coalesce above T: give different history T T a b c a b c Degnan and Salter: P(Tg|Ts)=H P(Tg, H|Ts), H: coalescent histories of Tg Why enumerate H? H specifies for each species tree branch, there are some u gene lineages coalescing into v lineages. v lineages Classic result in coalescent theory (Tavare 1984; Watterson 1984; Takahata and Nei 1985) puv(T): probability of u (unlabeled) lineages coalesce to v lineages within time T is: T u lineages P(Tg, H|Ts): product of puv(T) over all species tree branches (independence assumed)
An algorithm for Gene Tree Probability (Degnan and Salter, 2005) Recall: puv(T): the probability of u (not labeled) lineages coalesce to v lineages within time T P(Tg, H|Ts) = p21(T1) For a fixed coalescent history H: * p22(T2) * p31() * C (combinatorial factor) Main challenge: need to consider all possible coalescent histories. T2 Assume coalescent events along different species tree branch are independent. Then, the gene tree probability is equal to the product of Puv(Ti) for each branch bi. No known polynomial time algorithms for gene tree probability computation. STELLS algorithm (Wu, Evolution, 2012): another algorithm for gene tree probability. Faster than Degnan and Salter’s but still slow for large trees (exponential time in general) T1
Population genetics: large number of gene alleles (small number of populations) Phylogenetics study: Hundreds of species; 1 or several alleles per species Population genetics study: 1000 Genomes Project: 26 populations; haplotypes from 2504 individuals Question: can gene tree probability be computed efficiently for many gene alleles when the number of populations is small? Application: inference of demographic history This talk: Polynomial-time algorithm for computing gene tree probability computation when the number of populations is fixed to a constant Basic idea: merge multiple coalescent history into a compact coalescent history
Compact Coalescent History (CCH) Upper lineage count (ULC): number (not specific lineages as in coalescent history) of gene lineages at the top of population tree branch. Compact Coalescent History (CCH) ULC3=1 ULC3=1 h h Coalescent history 2 Coalescent history 1 g d g c ULC2=1 ULC2=1 ULC1=3 ULC1=3 f f A A B B e c e d a1 a2 a3 a4 b1 b2 b3 a1 a2 a3 a4 b1 b2 b3 CCH: (3, 1, 1) Compact coalescent history (CCH): ordered list of ULC for each population tree branch Two different coalescent histories Can lead to same CCH Number of CCH is smaller than number of coalescent history Gene tree probability: can be computed from CCH (details omitted)
Number of Compact Coalescent History is Polynomial Bounded for Constant Number of Populations n: number of gene lineages m: number of population tree branches When the number of populations is constant m is constant CCH: (c1,c2,c3,c4,1) c3 c4 c1 c2, c3, c4 ≤ n |CCH| = m (constant) c2 c1 CCH: length m vector of integers. Each position, value range from 1 to n. The number of CCH ≤ nm. Polynomial in n when m is constant. Time (seconds) Run-time (seconds) for computing the gene tree probability for 500 gene trees using new algorithm and original STELLS: two populations. Allele number per population
Application: Inference of Population Trees Using Pairwise Distance (also see Wu, Bioinformatics, 2015) Population haplotypes Gene genealogies Estimate population divergence distance of two populations Population tree Assume: infinite sites model and no intra-locus recombination Maximum likelihood estimate (gene tree prob algorithm) Neighbor joining 1 2 3 4 a1 AAAA a2 CAGA b1 CTGA b2 AAAC A B Haplotypes of two populations A and B b1 a2 a1 b2 AAAA 1 2 3 4 Mutation D A B C A - 2.0 1.0 B - 2.0 C - Pairwise population distance A B b1 a2 b2 a1 1.1 0.9 1.0 A C B Compare w/ TreeMix (Pickrell and Pritchard, 2012) doesn’t consider linkage disequilibrium or LD). Our approach is more accurate (but slower) than TreeMix 20 alleles for 8 populations each. Population tree height of 0.1 Average Robinson-Foulds error for our method: 0.11 TreeMix: 0.18 Partly supported by U.S. National Science Foundation grants IIS-0953563 and IIS-1447711.