An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.

Slides:



Advertisements
Similar presentations
The multispecies coalescent: implications for inferring species trees
Advertisements

Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of.
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
The Coalescent Theory And coalescent- based population genetics programs.
Population Genetics, Recombination Histories & Global Pedigrees Finding Minimal Recombination Histories Global Pedigrees Finding.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Sampling distributions of alleles under models of neutral evolution.
N-gene Coalescent Problems Probability of the 1 st success after waiting t, given a time-constant, a ~ p, of success 5/20/2015Comp 790– Continuous-Time.
Lecture 23: Introduction to Coalescence April 7, 2014.
Atelier INSERM – La Londe Les Maures – Mai 2004
Molecular Evolution Revised 29/12/06
Forward Genealogical Simulations Assumptions:1) Fixed population size 2) Fixed mating time Step #1:The mating process: For a fixed population size N, there.
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
From population genetics to variation among species: Computing the rate of fixations.
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Phylogenetic trees Sushmita Roy BMI/CS 576
Extensions to Basic Coalescent Chapter 4, Part 1.
Molecular phylogenetics
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
16 September 2007 Coalescent Consequences for Consensus Cladograms J. H. Degnan 1, M. Degiorgio 2, D. Bryant 3, and N. A. Rosenberg 1,2 1 Dept. of Human.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University.
Calculating branch lengths from distances. ABC A B C----- a b c.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.
Introduction to History of Life. Biological evolution consists of change in the hereditary characteristics of groups of organisms over the course of generations.
Lecture 17: Phylogenetics and Phylogeography
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Restriction enzyme analysis The new(ish) population genetics Old view New view Allele frequency change looking forward in time; alleles either the same.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
Monkey Business Bioinformatics Research Center University of Aarhus Thomas Mailund Joint work with Asger Hobolth, Ole F. Christiansen and Mikkel H. Schierup.
Lecture 6 Genetic drift & Mutation Sonja Kujala
Lecture 19 – Species Tree Estimation
Yufeng Wu and Dan Gusfield University of California, Davis
Evolution The two most important mechanisms of evolution are
Introduction to Bioinformatics Resources for DNA Barcoding
Evolutionary genomics can now be applied beyond ‘model’ organisms
Genetic Linkage.
IMa2(Isolation with Migration)
Gil McVean Department of Statistics
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Distance based phylogenetics
COALESCENCE AND GENE GENEALOGIES
Multiple Alignment and Phylogenetic Trees
L4: Counting Recombination events
Genetic Linkage.
Hierarchical clustering approaches for high-throughput data
Statistical Modeling of Ancestral Processes
The ‘V’ in the Tajima D equation is:
Molecular Clocks Rose Hoberman.
The coalescent with recombination (Chapter 5, Part 1)
Genetic Linkage.
David H. Spencer, Kerry L. Bubb, Maynard V. Olson 
Outline Cancer Progression Models
Bruce Rannala, Jeff P. Reeve  The American Journal of Human Genetics 
Presentation transcript:

An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA ISMB 2016, July 11, 2016

Coalescent Theory: Introduction Population gene genealogy Coalescent Theory: Introduction Coalescence Allele Sampled allele Generation Wright-Fisher Model: Non-overlapping generations Constant population size Random mating Time Coalescent: trace backward in time from sampled lineages. Coalescence: two sample lineages find common ancestor. Gene genealogy: determined by coalescent process Stochastic: when coalescence occurs is a stochastic process

Coalescent in Multiple Populations “Outline" tree is the species (population) tree: evolutionary history of populations T “Embedded" tree is the gene tree: evolutionary history of individual alleles Coalescent: tracing lineages backward in time to common ancestor in a population Multispecies coalescent: each extant/ancestral population runs a separate coalescent: determines gene genealogy with multiple populations Coalescence Gene tree/species tree discordance: gene trees for the a species tree with different topology due to stochastic coalescence. Stochasticity: inherent in coalescent process (a gene tree has certain probability of being observed under the coalescent process). Larger T: B and C more likely to coalesce within T Small T: B and C may not coalesce with T (and may lead to different gene tree) Multispecies coalescent: allow computation of probability of gene genealogies from multiple species (populations)

Gene Tree Probability For a species tree, any gene tree topology can arise, but with probability. For species tree Ts (with branch length) and a gene tree topology Tg: Gene tree probability P(Tg|Ts): probability of observing a binary gene tree topology Tg for species tree Ts under coalescent theory. The larger P(Tg|Ts) is, the more likely Tg will be observed. Has multiple applications in population and evolutionary genomics. e.g. Infer Ts from multiple gene genealogies Tg by maximum likelihood For small gene tree and species tree, calculation by hand may be feasible (e.g. Hudson, 1983, Takahata and Nei, 1985, Rosenberg 2002): usually listing all possible gene trees and species trees with say five species. For larger trees, an algorithm is needed. (Nowadays, gene trees and species trees can have hundreds of taxa/alleles) Key: efficient computation of the gene tree probability.

An algorithm for Gene Tree Probability (Degnan and Salter, 2005) History 1 History 2. Same gene tree Coalescent history: each coalescent event occur at which species tree branch: a and b coalesce within T Then coalescence with c above T. Note: a and b can coalesce above T: give different history T T a b c a b c Degnan and Salter: P(Tg|Ts)=H P(Tg, H|Ts), H: coalescent histories of Tg Why enumerate H? H specifies for each species tree branch, there are some u gene lineages coalescing into v lineages. v lineages Classic result in coalescent theory (Tavare 1984; Watterson 1984; Takahata and Nei 1985) puv(T): probability of u (unlabeled) lineages coalesce to v lineages within time T is: T u lineages P(Tg, H|Ts): product of puv(T) over all species tree branches (independence assumed)

An algorithm for Gene Tree Probability (Degnan and Salter, 2005) Recall: puv(T): the probability of u (not labeled) lineages coalesce to v lineages within time T P(Tg, H|Ts) = p21(T1) For a fixed coalescent history H: * p22(T2) * p31() * C (combinatorial factor)  Main challenge: need to consider all possible coalescent histories. T2 Assume coalescent events along different species tree branch are independent. Then, the gene tree probability is equal to the product of Puv(Ti) for each branch bi. No known polynomial time algorithms for gene tree probability computation. STELLS algorithm (Wu, Evolution, 2012): another algorithm for gene tree probability. Faster than Degnan and Salter’s but still slow for large trees (exponential time in general) T1

Population genetics: large number of gene alleles (small number of populations) Phylogenetics study: Hundreds of species; 1 or several alleles per species Population genetics study: 1000 Genomes Project: 26 populations; haplotypes from 2504 individuals Question: can gene tree probability be computed efficiently for many gene alleles when the number of populations is small? Application: inference of demographic history This talk: Polynomial-time algorithm for computing gene tree probability computation when the number of populations is fixed to a constant Basic idea: merge multiple coalescent history into a compact coalescent history

Compact Coalescent History (CCH) Upper lineage count (ULC): number (not specific lineages as in coalescent history) of gene lineages at the top of population tree branch. Compact Coalescent History (CCH) ULC3=1 ULC3=1 h h Coalescent history 2 Coalescent history 1 g d g c ULC2=1 ULC2=1 ULC1=3 ULC1=3 f f A A B B e c e d a1 a2 a3 a4 b1 b2 b3 a1 a2 a3 a4 b1 b2 b3 CCH: (3, 1, 1) Compact coalescent history (CCH): ordered list of ULC for each population tree branch Two different coalescent histories Can lead to same CCH Number of CCH is smaller than number of coalescent history Gene tree probability: can be computed from CCH (details omitted)

Number of Compact Coalescent History is Polynomial Bounded for Constant Number of Populations n: number of gene lineages m: number of population tree branches When the number of populations is constant  m is constant CCH: (c1,c2,c3,c4,1) c3 c4 c1 c2, c3, c4 ≤ n |CCH| = m (constant) c2 c1 CCH: length m vector of integers. Each position, value range from 1 to n. The number of CCH ≤ nm. Polynomial in n when m is constant. Time (seconds) Run-time (seconds) for computing the gene tree probability for 500 gene trees using new algorithm and original STELLS: two populations. Allele number per population

Application: Inference of Population Trees Using Pairwise Distance (also see Wu, Bioinformatics, 2015) Population haplotypes Gene genealogies Estimate population divergence distance of two populations Population tree Assume: infinite sites model and no intra-locus recombination Maximum likelihood estimate (gene tree prob algorithm) Neighbor joining 1 2 3 4 a1 AAAA a2 CAGA b1 CTGA b2 AAAC A B Haplotypes of two populations A and B b1 a2 a1 b2 AAAA 1 2 3 4 Mutation D A B C A - 2.0 1.0 B - 2.0 C - Pairwise population distance A B b1 a2 b2 a1 1.1 0.9 1.0 A C B Compare w/ TreeMix (Pickrell and Pritchard, 2012) doesn’t consider linkage disequilibrium or LD). Our approach is more accurate (but slower) than TreeMix 20 alleles for 8 populations each. Population tree height of 0.1 Average Robinson-Foulds error for our method: 0.11 TreeMix: 0.18 Partly supported by U.S. National Science Foundation grants IIS-0953563 and IIS-1447711.