The multispecies coalescent: implications for inferring species trees

Slides:



Advertisements
Similar presentations
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Advertisements

The methodology used for the 2001 SARs Special Uniques Analysis Mark Elliot Anna Manning Confidentiality And Privacy Group ( University.
The Simple Linear Regression Model Specification and Estimation Hill et al Chs 3 and 4.
Chapter 7 Sampling and Sampling Distributions
THE CENTRAL LIMIT THEOREM
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Distributions of sampling statistics Chapter 6 Sample mean & sample variance.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
The Coalescent Theory And coalescent- based population genetics programs.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Sampling distributions of alleles under models of neutral evolution.
Discordance due to gene flow or horizontal gene transfer.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Lecture 23: Introduction to Coalescence April 7, 2014.
Molecular Evolution Revised 29/12/06
Chapter 6 Introduction to Sampling Distributions
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Parametric Inference.
QMS 6351 Statistics and Research Methods Chapter 7 Sampling and Sampling Distributions Prof. Vera Adamchik.
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Latent Tree Models Part II: Definition and Properties
Gene Trees and Species Trees: Lessons from morning glories Lauren A. Eserman & Richard E. Miller Department of Biological Sciences Southeastern Louisiana.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Molecular phylogenetics
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 7 Sampling Distributions.
AP Statistics Chapter 9 Notes.
Random Sampling, Point Estimation and Maximum Likelihood.
Speciation history inferred from gene trees L. Lacey Knowles Department of Ecology and Evolutionary Biology University of Michigan, Ann Arbor MI
Extensions to Basic Coalescent Chapter 4, Part 2.
16 September 2007 Coalescent Consequences for Consensus Cladograms J. H. Degnan 1, M. Degiorgio 2, D. Bryant 3, and N. A. Rosenberg 1,2 1 Dept. of Human.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Quantifying uncertainty in species discovery with approximate Bayesian computation (ABC): single samples and recent radiations Mike HickersonUniversity.
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University.
Summarising Sets of Phylogenies Consensus Trees and Split/Consensus Networks Aidan Budd EMBL Heidelberg Friday July 2nd 2010 Basic Molecular Evolution.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 7 Sampling Distributions.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.
The star-tree paradox in Bayesian phylogenetics Bengt Autzen Department of Philosophy, Logic and Scientific Method LSE.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Confidence Interval Estimation For statistical inference in decision making:
Lecture 17: Phylogenetics and Phylogeography
Testing alternative hypotheses. Outline Topology tests: –Templeton test Parametric bootstrapping (briefly) Comparing data sets.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
The 2-state symmetric Markov model 1. Another way to think about it as a ‘random cluster’ model Cut each edge e independently with probability 2p e This.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Species Tree Workshop January 14, 2012 Practice with BEST Please download MrBayes 3.2 for either Windows, Macintos, or UNIX from
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Probability and Probability Distributions. Probability Concepts Probability: –We now assume the population parameters are known and calculate the chances.
Monkey Business Bioinformatics Research Center University of Aarhus Thomas Mailund Joint work with Asger Hobolth, Ole F. Christiansen and Mikkel H. Schierup.
Lecture 19 – Species Tree Estimation
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
Statistical Modeling of Ancestral Processes
Lecture 19 – Non-tree Based Methods
Presentation transcript:

The multispecies coalescent: implications for inferring species trees James Degnan 21 February 2008

Outline 1. Background --gene trees vs. species trees --coalescence and incomplete lineage sorting 2. Inferring species trees --Concatenation --Consensus Trees 3. Conclusions

Population Genetics and Phylogenetics Population genetics: traditionally used to analyze single populations. Phylogenetics: What is the best way to infer relationships between populations/species? Graphic by Mark A. Klinger, Carnegie Museum of Natural History, Pittsburgh

Desirable properties of species tree estimators 1. Statistical consistency (sample size = # of genes) 2. Efficiency 3. Robustness to violations in assumptions

Bridging the popgen/phylo divide “Incorporation of explicit models of lineage sorting will be needed for continued development of phylogenetic inference near the species level.” –Maddison and Knowles (2006). “Closer integration of population-genetic factors in phylogenetics, including further insights into gene-tree/species tree, and horizontal gene transfer.” --from Mike Steel’s website, My pick for five directions in phylogenetics that will grow in the next five years (2006).

The coalescent process Past Present

One population

Multiple populations/species Present Past

Gene tree in a species tree

Model species tree with gene tree A B C D The gene tree is a random variable. The gene tree distribution is parameterized by the species tree topology and internal branch lengths.

How can we compute probabilities of gene trees given species trees? -Under a coalescent model, probabilities for gene trees with three species were derived by Nei (1987): 1-(2/3)e-T -Probabilities for the gene tree to match the species tree topology for 4 and 5 species given by Pamilo and Nei (1988). -All 30 species tree/gene tree combinations for 4 species given by Rosenberg (2002). -General case solved by Degnan and Salter (2005) and implemented by program COAL. Also allows individuals sampled in species i.

This coalescent history: (1,3,3) Definition: a coalescent history is a list of the populations in which each coalescent event occurs. A B C D This coalescent history: (1,3,3) Other coalescent histories: (2,3,3), (3,3,3)

Gene tree probabilities

Gene tree probabilities combinatorial enumeration, complexity only known in special cases probability coalescences are consistent with g branch length u coalesce into v internal branches of S

Data from Ebersberger et al. 2007. Mol. Biol. Evol. 24:2266-2276. Theoretical distribution based on parameters from Rannala and Yang, 2003. Genetics 164:1645-1656. 1.2 4.2 t/N =

y x

Definition: a gene tree which is more probable than the gene tree matching the species tree is called an anomalous gene tree (Degnan and Rosenberg, 2006). Theorem 1. For the asymmetric species tree topology with four species and for any species tree topology with more than four species, there exist branch lengths such that at least one gene tree is anomalous (Degnan and Rosenberg, 2006).

Is species tree inference consistent in this setting? 1. Concatenation? 2. Consensus?

Species Tree inference—concatenation Species Trees are often estimated by concatenating several gene sequences and analyzing as one (data from Chen and Li, 2001). Gene 1 Human CTTGAATAATTTTTAC Chimp CTTCAATAATTTTTAC Gorilla TTTGAATAATTTTTAC Orang CTTGAATAATTTTTAT Gene 2 TAGAGTTTCCTTGTGGTG TAGAGTTTCCTTGTGGTA CAGAGTTTCCTTGTGGTC Gene 3 CGGTTT TGGTTT CRGTTT

Concatenation and gene tree discordance How does concatenation perform when sequences are generated from different topologies? CGGTTT TGGTTA TAGTTA CGATTA TGATTA TAATTT TGAATT TGCTAT CCCTAT Simulated gene trees Species tree: y = 1.0, x = 0.05 y x CGGTTT TGGTTA TAGTTA CGATTA TGATTA TAATTT TGAATT TGCTAT CCCTAT concatenated sequence

Trees inferred from concatenated sequences (Kubatko and Degnan, 2007) y = 1.0, x = 0.05 Number of genes

Is species tree inference consistent in this setting? 1. Concatenation? No. 2. Consensus?

Consensus (majority-rule)

Types of consensus trees Majority rule—consensus tree has all clades that were observed in > 50% of trees. Greedy—sort clades by their proportions. Accept the most frequently observed clades one at a time that are compatible with already accepted clades. Do this until you have a fully resolved tree. R*—for each set of 3 taxa, find the most commonly occurring triple e.g., (AB)C, (AC)B or (BC)A. Build the tree from the most commonly occurring triple. (AB)D, (CD)B are two rooted triples

Asymptotic consensus trees Consensus trees are usually statistics, functions of data like x-bar. Definition: an asymptotic consensus tree is the tree that is obtained by computing the consensus tree using topology probabilities from the multispecies coalescent model. Motivation: if there are a large number of independent loci, observed gene tree, clade, and rooted triple proportions should approximate their theoretical probabilities.

Simulated gene trees Greedy consensus tree

Greedy consensus tree

Greedy consensus tree Simulated gene trees Greedy consensus tree R* consensus tree

Majority-rule: unresolved zone

Too-greedy zone

Is species tree inference consistent in this setting? 1. Concatenation? No. 2. Consensus? Yes (R*), no for greedy and majority-rule.

Are consensus trees inconsistent estimators of species trees? Theorem 2. (i) Majority-rule asymptotic consensus trees (MACTs) do not have any clades not on the species tree. (ii) Majority-rule unresolved zones exist for any species tree topology with n ≥ 3 species. Theorem 3. Greedy asymptotic consensus trees (GACTs) can be misleading estimators of species trees for the 4-species asymmetric tree and for any species tree with n > 4 species. Theorem 4. R* asymptotic consensus trees (RACTs) always match the species tree.

What about finite samples? If you sample 10 loci, you could have: All 10 match the species tree 9 match the species tree, 1 disagrees 8 match the species tree, 2 disagree, etc. You can consider gene trees as categories and use multinomial probabilities for the probability of your sample

R* consensus, y = 0.4, x = 0.6

Conclusion Coalescent gene tree probabilities can be used to prove or disprove the statistical consistency of species tree estimators.

R* consensus, y = x = 0.1 Number of genes Probability