A simulation study comparing phylogeny reconstruction methods for linguistics Collaborators: Francois Barbancon, Don Ringe, Luay Nakhleh, Steve Evans Tandy.

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Challenges in Computational Linguistic Phylogenetics Tandy Warnow Departments of Computer Science and Bioengineering.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Perfect phylogenetic networks, and inferring language evolution Tandy Warnow The University of Texas at Austin (Joint work with Don Ringe, Steve Evans,
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Phylogeny Reconstruction Methods in Linguistics with François Barbançon, Steve Evans, Luay Nakhleh, Don Ringe, and Ann Taylor Tandy Warnow The University.
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
Detecting language contact in Indo-European Tandy Warnow The Program for Evolutionary Dynamics at Harvard The University of Texas at Austin (Joint work.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
The history of the Indo-Europeans Tandy Warnow The University of Texas at Austin.
CS 394C September 16, 2013 Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Detecting language contact in Indo-European Tandy Warnow The Program for Evolutionary Dynamics at Harvard The University of Texas at Austin (Joint work.
Terminology of phylogenetic trees
Molecular phylogenetics
A simulation study comparing phylogeny reconstruction methods for linguistics Collaborators: Francois Barbancon, Don Ringe, Luay Nakhleh, Steve Evans Tandy.
Detecting language contact in Indo-European Tandy Warnow The Program for Evolutionary Dynamics at Harvard The University of Texas at Austin (Joint work.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
The history of the Indo-Europeans Tandy Warnow (sorry I spelled my name wrong) The University of Texas at Austin.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Distance based phylogenetics
Modelling language evolution
Multiple Sequence Alignment Methods
Phylogenetic Inference
Inferring phylogenetic trees: Distance and maximum likelihood methods
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
Tandy Warnow Department of Computer Sciences
The Most General Markov Substitution Model on an Unrooted Tree
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
But what if there is a large amount of homoplasy in the data?
Phylogeny estimation under a model of linguistic character evolution
Presentation transcript:

A simulation study comparing phylogeny reconstruction methods for linguistics Collaborators: Francois Barbancon, Don Ringe, Luay Nakhleh, Steve Evans Tandy Warnow The University of Texas at Austin The Newton Institute for Mathematical Research

Possible Indo-European tree (Ringe, Warnow and Taylor 2000)

Phylogenetic Network for IE Nakhleh et al., Language 2005

Controversies for IE history Subgrouping: Other than the 10 major subgroups, what is likely to be true? In particular, what about –Italo-Celtic –Greco-Armenian –Anatolian + Tocharian –Satem Core (Indo-Iranian and Balto-Slavic) –Location of Germanic Dates? How tree-like is IE?

Controversies for IE history Note: many reconstructions of IE have been done, but produce different histories which differ in significant ways (e.g., the location of Germanic) Possible issues: Dataset (modern vs. ancient data, errors in the cognancy judgments, lexical vs. all types of characters, screened vs. unscreened) Translation of multi-state data to binary data Reconstruction method

The performance of methods on an IE data set (Transactions of the Philological Society, Nakhleh et al. 2005) Observation: Different datasets (not just different methods) can give different reconstructed phylogenies. Objective: Explore the differences in reconstructions as a function of data (lexical alone versus lexical, morphological, and phonological), screening (to remove obviously homoplastic characters), and methods. However, use a better basic dataset (where cognancy judgments are more reliable).

Better datasets Ringe & Taylor –The screened full dataset of 294 characters (259 lexical, 13 morphological, 22 phonological) –The unscreened full dataset of 336 characters (297 lexical, 17 morphological, 22 phonological) –The screened lexical dataset of 259 characters. –The unscreened lexical dataset of 297 characters.

Differences between different characters Lexical: most easily borrowed (most borrowings detectable), and homoplasy relatively frequent (we estimate about 25-30% overall for our wordlist, but a much smaller percentage for basic vocabulary). Phonological: can still be borrowed but much less likely than lexical. Complex phonological characters are infrequently (if ever) homoplastic, although simple phonological characters very often homoplastic. Morphological: least easily borrowed, least likely to be homoplastic.

Phylogeny reconstruction methods Neighbor joining UPGMA (technique in glottochronology) Maximum parsimony Maximum compatibility (weighted and unweighted) Gray and Atkinson (Bayesian estimation based upon presence/absence of cognates)

Some observations UPGMA (i.e., the tree-building technique for glottochronology) does the worst (e.g. splits Italic and Iranian groups). Other than UPGMA, all methods reconstruct the ten major subgroups, as well as Anatolian + Tocharian and Greco-Armenian. The Satem Core (Indo-Iranian plus Balto-Slavic) is not always reconstructed. Almost all analyses put Italic, Celtic, and Germanic together. (The only exception is weighted maximum compatibility on datasets that include morphological characters.)Methods differ significantly on the datasets, and from each other.

GA = Gray+Atkinson Bayesian MCMC method WMC = weighted maximum compatibility MC = maximum compatibility (identical to maximum parsimony on this dataset) NJ = neighbor joining (distance-based method, based upon corrected distance) UPGMA = agglomerative clustering technique used in glottochronology. *

Different methods/data give different answers. We don’t know which answer is correct. Which method(s)/data should we use?

Simulation study (cartoon)

FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FN FP

Phylogenetic Network Evolution

Modelling borrowing: Networks and Trees within Networks

Some useful terminology: homoplasy no homoplasy back-mutation parallel evolution

Some useful terminology: lexical clock AB C A B D D C lexical clockno lexical clock edge lengths represent expected numbers of substitutions

Heterotachy = departure from rates-across-sites The underlying tree is fixed, but there are no constraints on edge length variations between characters. A B C D A B C D

Previous simulations Most previous simulations of linguistic evolution had evolved characters without any borrowing or homoplasy, all under an i.i.d. assumption, and many have assumed a strong lexical clock. Some (notably McMahon and McMahon) had evolved characters with small amounts of borrowing but no homoplasy, on small networks (with one contact edge)

Our model (Cambridge University Press, 2006) “Genetic evolution” Characters evolve independently from each other, but under a linguistic equivalent of the Tuffley & Steel “no common mechanism” model We allow for a single homoplastic state, h*. The non-homoplastic states are indicated by n (or n’). If a character changes state on an edge, it either evolves into the homoplastic state h*, or innovates to a new non-homoplastic state. For each character c and tree edge e, there is a quintuple of probabilities: p e,c (n,n), p e,c (n,n’), p e,c (n,h*), p e,c (h*,h*), and p e,c (h*,n). Binary phonological characters c satisfy p e,c (h*,h*)=1 and p e,c (h*,n)=0. We make the mild assumption that 0<p e,c (n,h*)<1. Morphological and lexical characters have an unbounded number of states, so we only require that 0<p e,c (n,n’)<1.

“Borrowing” (horizontal transfer): Each contact edge e=(v,w) has a parameter K e which is the probability of transmission of character states from v to w. Note that K (v,w) may not be equal to K (w,v). Each character c has a relative probability B c of being borrowed, so that the probability that character c is borrowed across a contact edge e is B c K e

Theoretical results - I The model tree (but not its root or parameters) is identifiable, and can be reconstructed with high probability in polynomial time, given logarithmic number of morphological and lexical characters (extension of result by Mossel and Steel 2004 for homoplasy-free model).

Theoretical results - II Other statistically consistent and simpler polynomial time algorithms exist (compute all bipartitions or compute all quartet-trees), with longer sequence length requirements. These apply to the morphological and lexical characters. Reconstruction from binary phonological characters can be done if they evolve iid, using a distance-based approach.

Theoretical results - III The likelihood of the tree can be computed in linear time for each character, using a dynamic programming algorithm.

Our simulation study (in press) Model phylogenetic networks: each had 30 leaves and up to three contact edges, and varied in the deviation from a lexical clock. Characters: we had up to 360 lexical characters and 60 morphological characters, each type with two rates for homoplasy and borrowing estimated from our “screened” and “unscreened” IE data. We also varied the degree of heterotachy (deviation from the rates-across-sites assumption). Performance metric: We compared estimated trees to the “genetic tree” wrt the missing edge rate.

Observations 1. Choice of reconstruction method does matter. 2. Relative performance between methods is quite stable (distance-based methods worse than character-based methods). 3. Choice of data does matter (good idea to add morphological characters). 4. Accuracy only slightly lessened with small increases in homoplasy, borrowing, or deviation from the lexical clock. 5. Some amount of heterotachy helps!

Relative performance of methods on moderate homoplasy datasets under various model conditions: (i)varying the deviation from the lexical clock, (ii) varying heterotachy, and (iii) varying the number of contact edges. (i) (ii) (iii)

Relative performance of methods for low homoplasy datasets under various model conditions: (i)Varying the deviation from the lexical clock, (ii)Varying the heterotachy, and (iii) Varying the number of contact edges. (i) (ii) (iii)

Impact of homoplasy for characters evolved down a network with three contact edges under a moderate deviation from the lexical clock and moderate heterotachy.

Impact of homoplasy for characters evolved down a tree under a moderate deviation from a lexical clock and moderate heterotachy. (Our weighting is inappropriate for “unscreened” data.)

Impact of the number of contact edges for characters evolved under low homoplasy, moderate deviation from a lexical clock, and moderate heterotachy.

Impact of the deviation from a lexical clock (from low to moderate) for characters evolved down a network with three contact edges under low levels of homoplasy and with moderate heterotachy.

Impact of heterotachy for characters evolved down a network with three contact edges, with low homoplasy, and with moderate deviation from a lexical clock. Heterotachy increases with the parameter.

Impact of data selection for characters evolved down a network with three contact edges, under low homoplasy (“screened data"), moderate deviation from a lexical clock, and moderate heterotachy.

Observations 1. Choice of reconstruction method does matter. 2. Relative performance between methods is quite stable (distance-based methods worse than character-based methods). 3. Choice of data does matter (good idea to add morphological characters). 4. Accuracy only slightly lessened with small increases in homoplasy, borrowing, or deviation from the lexical clock. 5. Some amount of heterotachy helps!

Future research We need more investigation of methods based on stochastic models (Bayesian beyond G+A, maximum likelihood, NJ with better distance corrections), as these are now the methods of choice in biology. This requires better models of linguistic evolution and hence input from linguists!

Future research (continued) Should we screen? The simulation uses low homoplasy as a proxy for screening, but real screening throws away data and may introduce bias. How do we detect/reconstruct borrowing? How do we handle missing data in methods based on stochastic models? How do we handle polymorphism?

For more information Please see the Computational Phylogenetics for Historical Linguistics web site for papers, data, and additional material

Acknowledgements Funding: NSF, the David and Lucile Packard Foundation, the Radcliffe Institute for Advanced Studies, The Program for Evolutionary Dynamics at Harvard, and the Institute for Cellular and Molecular Biology at UT-Austin. Collaborators: Don Ringe, Steve Evans, Luay Nakhleh, and Francois Barbancon.