Modelling language evolution

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Challenges in Computational Linguistic Phylogenetics Tandy Warnow Departments of Computer Science and Bioengineering.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Perfect phylogenetic networks, and inferring language evolution Tandy Warnow The University of Texas at Austin (Joint work with Don Ringe, Steve Evans,
Molecular Evolution Revised 29/12/06
Phylogeny Reconstruction Methods in Linguistics with François Barbançon, Steve Evans, Luay Nakhleh, Don Ringe, and Ann Taylor Tandy Warnow The University.
Detecting language contact in Indo-European Tandy Warnow The Program for Evolutionary Dynamics at Harvard The University of Texas at Austin (Joint work.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Bell Work Dogs of a certain breed can have black fur or white fur. Black fur is dominant, but the breeder only wants puppies with white fur. Cross two.
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Classification and Phylogenies Taxonomic categories and taxa Inferring phylogenies –The similarity vs. shared derived character states –Homoplasy –Maximum.
Phylogenetic trees Sushmita Roy BMI/CS 576
CS 394C September 16, 2013 Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Detecting language contact in Indo-European Tandy Warnow The Program for Evolutionary Dynamics at Harvard The University of Texas at Austin (Joint work.
Molecular phylogenetics
A simulation study comparing phylogeny reconstruction methods for linguistics Collaborators: Francois Barbancon, Don Ringe, Luay Nakhleh, Steve Evans Tandy.
Detecting language contact in Indo-European Tandy Warnow The Program for Evolutionary Dynamics at Harvard The University of Texas at Austin (Joint work.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Phylogenies Reconstructing the Past. The field of systematics Studies –the mechanisms of evolution evolutionary agents –the process of evolution speciation.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
The history of the Indo-Europeans Tandy Warnow (sorry I spelled my name wrong) The University of Texas at Austin.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
A simulation study comparing phylogeny reconstruction methods for linguistics Collaborators: Francois Barbancon, Don Ringe, Luay Nakhleh, Steve Evans Tandy.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Evaluating the Fossil Record with Model Phylogenies Cladistic relationships can be determined without ideas about stratigraphic completeness; implied gaps.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
Evolutionary genomics can now be applied beyond ‘model’ organisms
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Distance based phylogenetics
Multiple Sequence Alignment Methods
Challenges in constructing very large evolutionary trees
The Tree of Life From Ernst Haeckel, 1891.
Summary and Recommendations
Tandy Warnow Department of Computer Sciences
Reading Phylogenetic Trees
Chapter 19 Molecular Phylogenetics
The Most General Markov Substitution Model on an Unrooted Tree
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Phylogeny estimation under a model of linguistic character evolution
Presentation transcript:

Modelling language evolution Tandy Warnow The University of Texas at Austin

Species phylogeny Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona Orangutan Gorilla Chimpanzee Human A phylogeny is a tree representation for the evolutionary history relating the species we are interested in. This is an example of a 13-species phylogeny. At each leaf of the tree is a species – we also call it a taxon in phylogenetics (plural form is taxa). They are all distinct. Each internal node corresponds to a speciation event in the past. When reconstructing the phylogeny we compare the characteristics of the taxa, such as their appearance, physiological features, or the composition of the genetic material.

Possible Indo-European tree (Ringe, Warnow and Taylor 2000)

Controversies for Indo-European history Subgrouping: Other than the 10 major subgroups, what is likely to be true? In particular, what about Italo-Celtic, Greco-Armenian, Anatolian + Tocharian, Satem Core?

This talk Empirical evidence of how estimated phylogenies depend upon both the data and the method - and can be wrong Models of language evolution (from the earliest ones to more recent ones), why we need them, and what we still need to do. Note: simulations and estimation methods both depend upon model assumptions! Results of simulation studies based upon some new models Comments

Nakhleh et al., Transactions of the Philological Society 2005 Methods studied: UPGMA (lexico-statistics), Neighbor joining, maximum parsimony, maximum compatibility, weighted MP, weighted MC, and Gray&Atkinson. Datasets: Four versions of the Ringe&Taylor IE data (lexical, morphological, and phonological characters): lexical only vs. all, screened vs. unscreened Observations: UPGMA (lexico-statistics) does the worst - it splits known subgroups. Other than UPGMA, all methods reconstruct the ten major subgroups, Anatolian + Tocharian, and Greco-Armenian. Nothing else is consistently reconstructed. When using lexical data only, all methods group Italic, Celtic, and Germanic together. Some methods (not all) will reconstruct different trees on different datasets. Screening datasets to remove obvious homoplasy can result in better (?) trees.

Question: how to determine which phylogenies are reliable? Data: need high quality data! Phylogenetic reconstruction methods need to be tested before being trusted! Examples of possible tests: Benchmark real datasets (need good benchmarks! Are there any?) Simulated datasets (need good models!)

Simulation study (cartoon)

Simulation study (cartoon) FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Modelling language evolution Models of evolution allow reconstruction methods to be evaluated in simulation. This allows us to understand the conditions under which each method will perform well. Models of evolution (for simulation purposes) need to reflect good scholarship, and should be able to reproduce the properties of real data. Models of evolution are also present in estimation methods, whether explicitly (as in ML or Bayesian) or implicitly.

Issues in modelling language evolution Character evolution model. Variation between characters. Cladogenesis model: tree vs. network vs. dialect continuum?

Modelling the evolution of single linguistic characters Types of linguistic characters: Phonological (sound changes) Lexical (meanings based on a wordlist) Morphological (especially inflectional) Modelling issues: state space, lexical clock, homoplasy, and polymorphism Easy: lexical clock not believed, and most linguistic characters have infinite number of possible states. More interesting: homoplasy, polymorphism, and variation between characters.

Homoplasy-free evolution When a character changes state, it changes to a new state not in the tree In other words, there is no homoplasy (character reversal or parallel evolution) First inferred for weird innovations in phonological characters and morphological characters in the 19th century. 1 1 1

Lexical characters can also evolve without homoplasy For every cognate class, the nodes of the tree in that class should form a connected subset - as long as there is no undetected borrowing nor parallel semantic shift. However, our research suggests that ~15% of lexical characters evolve homoplastically. 1 1 1 1 1 2

Polymorphism Polymorphism means two or more states exhibited by the same language for a character. Most common examples are lexical: two or more words for the same basic meaning. Examples: big/large, little/small, rock/stone. Lexical polymorphism results primarily from semantic shift, but polymorphism due to borrowing also occurs. Incidence: lexical polymorphism is very common but transient (almost all polymorphisms lost within a millenium). Less frequent for other types of characters.

Modelling variation between characters: Rates-across-sites If a site (i.e., character) is twice as fast as another on one edge, it is twice as fast everywhere. B D A C B D A C

Modelling variation between characters: The no-common-mechanism model In this model, there is a separate random variable for every combination of site and edge - the underlying tree is fixed, but otherwise there are no constraints on variation between sites. C A B D A B D C

Homoplasy-free models without polymorphism The earliest models were all tree models, homoplasy-free and obeyed the lexical clock. Ringe-Warnow: “PP” (perfect phylogeny - i.e., homoplasy-free, no common mechanism, non-parametric tree model)

Cladogenesis The “speciation” model ranges from trees all the way to dialect continuums. Intermediate models include horizontal transfer (borrowing) and hybridization (creoles).

Modelling borrowing: Networks and Trees within Networks

Perfect Phylogenetic Network model Nakhleh et al. Perfect Phylogenetic Network (PPN) model: all characters evolve without homoplasy down a tree contained within the network. Published in Language, 2005. Warnow-Evans-Ringe-Nakhleh (2004): extends PPN model to allow for limited and identifiable homoplasy.

“Perfect Phylogenetic Network” for IE Nakhleh et al., Language 2005

What about polymorphism? Our first model of polymorphism (Bonet et al., 1996) was a non-parametric model for homoplasy-free characters, no-common-mechanism model, with polymorphism due to semantic shift. Three problems: (1) because it is non-parametric, it cannot be used for simulation (2) homoplasy is fairly frequent for lexical characters (15% of characters) (3) what about polymorphism due to borrowing?

Nichols and Gray model for polymorphism Geoff Nichols and Russel Gray (2006): Homoplasy-free, rates-across-sites, parametric model in which the character adds and loses states under a stochastic process. The number of states in a lineage can go up and down (including down to 0 and then back up). Problems: (1) homoplasy is frequent in lexical characters (2) what is the linguistic process?

What needs to be done in modelling We need parametric models of character evolution that include reasonable levels of homoplasy, in which polymorphism arises due to semantic shift (conflation of two characters), by borrowing, or due to other linguistic processes. We also need cladogenesis models that incorporate population-level processes, and can represent dialect continuums.

Simulation study (Barbancon et al.) Simulated evolution down networks with 30 leaves, three contact edges, and with moderate levels of homoplasy and borrowing for 300 lexical characters and 60 morphological characters. Compared trees constructed by various methods to the “genetic tree” contained in the network, for topological accuracy. Methods compared: NJ, UPGMA, weighted and unweighted MP and MC.

Standard Model Conditions Screened dataset Lexical characters: 4% homoplastic, 10% evolve with borrowing Morphological characters: no homoplasy nor borrowing Unscreened dataset Lexical characters: 20% homoplastic, 20% borrowed Morphological characters: 5% homoplastic, no borrowing Molecular clock for the cladogenesis model No-common-mechanism model with moderate variation between characters Lexical weight=1, morphological weight=50

Simulation study (cartoon) FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Clocklike data

Clocklike data

Points Screening the data helps to improve the phylogenetic accuracy of most methods. When data are generated under network models, methods which reconstruct trees do not perform well. Modelling helps us predict the conditions under which different methods will perform well, or poorly. The more accurate the models, the more relevant the predictions. We need better models!

Future research Testing other methods in simulation (including some network construction methods) Formulating improved (more realistic) models of language evolution Implementing simulation tools under these improved models Developing estimation methods under these improved models Reanalyzing IE, and looking at some new families (or subfamilies)

Acknowledgements Funding: NSF, the David and Lucile Packard Foundation, the Radcliffe Institute for Advanced Studies, The Program for Evolutionary Dynamics at Harvard, and the Institute for Cellular and Molecular Biology at UT-Austin. Collaborators: Don Ringe, Steve Evans, Luay Nakhleh, and Francois Barbancon.

For more information Please see the Computational Phylogenetics for Historical Linguistics web site for papers, data, and additional material http://www.cs.rice.edu/~nakhleh/CPHL

Differences between characters Lexical: most easily borrowed (most borrowings detectable), and homoplasy relatively frequent (we estimate about 25-30% overall for our wordlist, but a much smaller percentage for basic vocabulary). Also, lexical characters have a high incidence (80%) of transient polymorphism. Phonological: can still be borrowed but much less likely than lexical. Complex phonological characters are infrequently (if ever) homoplastic, although simple phonological characters very often homoplastic. Morphological: least easily borrowed, least likely to be homoplastic. Rarely polymorphic.