Computational Phylogenetics for Language: Theory, Applications, and Extensions Claire Bowern, Yale University: claire.bowern@yale.edu UC Berkeley, May,

Slides:



Advertisements
Similar presentations
The Robert Gordon University School of Engineering Dr. Mohamed Amish
Advertisements

METHODS FOR HAPLOTYPE RECONSTRUCTION
Descent with Modification: A Darwinian View of Life.
Chapter 19 Evolutionary Genetics 18 and 20 April, 2004
Perfect phylogenetic networks, and inferring language evolution Tandy Warnow The University of Texas at Austin (Joint work with Don Ringe, Steve Evans,
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.
Science and Engineering Practices
Zoology Zoon = animal Logos = study of Zoology = study of animals
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
ANTH 331: Culture and the Individual Kimberly Porter Martin, Ph.D. Theory.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
How classification works
Underlying Principles of Zoology Laws of physics and chemistry apply. Principles of genetics and evolution important. What is learned from one animal group.
1 What is Life? – Living organisms: – are composed of cells – are complex and ordered – respond to their environment – can grow and reproduce – obtain.
GENE 3000 Fall 2013 slides wiki. wiki. wiki.
Pama-Nyungan Phylogenetics and Beyond Claire Bowern, Yale University.
Why phylogenetics? Barbara Holland School of Physical Sciences University of Tasmania.
Cultural Anthropology What is it?. Anthropology  Comparative study of human societies and cultures.
McGraw-Hill © 2008 The McGraw-Hill Companies, Inc. All Rights Reserved.
Anthropology is the study of mankind or human kind. The term Anthropology comes from the Greek word Anthropos which means “man or human” and logos which.
Serial Founder Effects in Linguistics and Genetics Claire Bowern (with Keith Hunley and Meghan Healy) Yale and University of New Mexico Feb 9, 2012 Based.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Classification Biology I. Lesson Objectives Compare Aristotle’s and Linnaeus’s methods of classifying organisms. Explain how to write a scientific name.
The Evolution of Color Term Systems in Pama-Nyungan Claire Bowern, Yale University joint work with Hannah Haynie, Colorado State (work under review at.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 1 Lecture Slides.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Chapter 15 Pages DARWIN’S THEORY OF EVOLUTION.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Chapter 9. A Model of Cultural Evolution and Its Application to Language From “The Computational Nature of Language Learning and Evolution” Summarized.
Section 2: Modern Systematics
Biointelligence Laboratory, Seoul National University
Evolutionary genomics can now be applied beyond ‘model’ organisms
MCMC Output & Metropolis-Hastings Algorithm Part I
Natural Selection Lab 14.
Lecture 81 – Lecture 82 – Lecture 83 Modern Classification Ozgur Unal
Rule Induction for Classification Using
The Science of Biology Chapter 1.
Language evolution Brian O’Meara EEB464 Fall 2016.
Statistical Data Analysis
Language evolution Brian O’Meara EEB464 Fall 2017.
Section 2: Modern Systematics
Chapter 25 Comparing Counts.
How to handle missing data values
Introduction to Phylogenetic Systematics
The Science of Biology Chapter 1.
Ch.10: Principles of Evolution
Theory of Evolution Chapter 15.
Phylogeny & Systematics
Hierarchical clustering approaches for high-throughput data
Summary and Recommendations
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Nature of Science Understandings for HS
The Science of Biology Chapter 1.
18.2 Modern Systematics I. Traditional Systematics
PSY 626: Bayesian Statistics for Psychological Science
Evolution Lecture 1 Name________Date ________
Statistical Data Analysis
Chapter 26 Comparing Counts.
Ensemble learning Reminder - Bagging of Trees Random Forest
RESEARCH BASICS What is research?.
The Science of Biology Chapter 1.
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
Ch.10: Principles of Evolution
Chapter 26 Comparing Counts.
Summary and Recommendations
Language evolution Brian O’Meara EEB464 Fall 2018.
Chapter 22, Descent with Modification
Evolution Biology Mrs. Johnson.
Presentation transcript:

Computational Phylogenetics for Language: Theory, Applications, and Extensions Claire Bowern, Yale University: claire.bowern@yale.edu UC Berkeley, May, 2016

Today: Evolutionary approaches to language change General discussion of ‘evolutionary’ modeling What is phylogenetics? Is it ‘ok’ to use biological thinking in historical linguistics? Illustrations from Pama-Nyungan (Australia): tree construction Further applications. part of a long-term research project looking into the nature of language change and the specifics of the prehistory of Australia. Thinking about language change not just on its own, but as one of several cultural systems that provides us with evidence about the past.

Pama-Nyungan

What is phylogenetics? Simplistically: Constructing trees (computationally) and using them to investigate the prehistory of languages and populations. Underlying assumption: Cultural systems (including language) share the same systemic properties as biology when it comes to change. That is, social transmission is similar enough to biological transmission that we can use the same modeling principles (cf Darwin 1859, Nunn 2010, Messoudi 2010, etc) this is, of course, controversial (cf Richerson, Boyd, Mesoudi, Dawkins, Dennett, Cavalli-Sforza, Sterelny, etc). Distinguishing general approach/outlook from the use of computers in reconstruction, though the two are linked.

Cultural Evolution: Fundamental Questions Approach to the study of the past that looks at the role of horizontal vs vertical transmission commonalities and differences between societies causes of change within lineages causes of group split Mesoudi (2010); Richerson et al (2013); Dediu et al (2013); Gray et al (2008), Holden and Mace (2005) Some of the questions that cultural evolution is interested in: mutatis mutandis, very similar to many of the questions that historical lingusitis has looked at,

Linguistics and Cultural Evolution cf Croft (2000, 2008) amongst others: theoretical views of language change through biological analogs. cf Gray et al (2009), Greenhill (2006), Jordan and Gray (2000), Holden and Mace (2005), Pagel et al (2007), among many others: application of specific methods to solve specific problems. Here: part of a larger research program in cashing out the implications of an evolutionary view of language change, in a particular family, without assuming direct parallels with biological systems there have, of course, been many approaches to language change that take a more or less biological view of things, and all have their promoters and detractors. This is, not thinking about importing biological methods, but rather using methods which are common to all sorts of evolutionary systems, including language, culture, biology.

Linguistics and Biological Evolution Broad parallels: units [genes, words, etc] transmitted within a population subject to variation skewing in the distribution of variants gives rise to change [aka selection] => descent with modification But also crucial differences: rates of change mechanism of transmission, degree of horizontal transmission changes across lifespan transmission of acquired traits Critiques of evolutionary views of linguistics often base their objections around these differences: e.g. Blench (2015); Other critiques relate to how the models are interpreted and applied to language; more on that below.

Linguistics and Biological Evolution Do the differences matter? Lewontin (1970); Sterelny (2006); Mameli & Sterelny (2009): Prerequisites to modeling: Variation Differential fitness [aka selection] Heritability [aka a transmission mechanism] That is, any system with these properties can be modeled in general terms using evolutionary methods, though the specifics of the models will vary. In this I argue against Croft (2001) and several others, who look for explicit parallels. and against Andersen (2006) and others, who reject the biological comparisons wholesale. Note that my

© 2006 Arnold Kopff; flickr.com Pama-Nyungan

Problem: The Pama-Nyungan ‘Rake’ O’Grady, Voegelin & Voegelin (1965), Dixon (1980, 2002), Hercus (1994), Bowern & Koch (2004), etc. Missing data? Too many loans? Haven’t looked hard enough? Or indicative of how hunter- gatherer languages expand?

The CHIRILA database (Bowern 2016) 780,000 lexical items 343 Pama-Nyungan languages; 1140+ doculects 56 Non-Pama-Nyungan languages, 15+ families The entire corpus of Tasmanian Grammatical features for 90 languages Morphology collection in progress Aim for complete lexical records for Australia

Loans aren’t the problem That is, they are just as much a problem elsewhere in the world.

Elements of a Bayesian Model Data [working on a model of lexical replacement] Matrix of coded data: form + meaning cognate sets. Cognate judgments, coded as strictly as possible according to the comparative method [then converted to binary matrix] Loans left in but tagged Model of cognate evolution: gains and losses equally likely? or gains indicative of a group, while losses aren’t? Stochastic Dollo model (Nichols and Gray 2006; developed for linguistic data) Clock model fixed? single rate of change? or variable rate, drawn from a distribution of rates. Priors don’t want to collapse cognates – you want words with different meanings, that’s information about innovation.

Bayesian model of cognate evolution Here: Stochastic Dollo model (Nicholls and Gray 2008) 0 → 1 is indicative of shared descent 1 → 0 is not nb, doesn’t take borrowing into account; therefore likely to be misled by extensive shared borrowing. Other more flexible models include covarion, but StDollo is closer to standard historical assumptions

Bayesian tree inference (MCMC) Monte-Carlo Markov Chain Construct a random starting tree from the data and from parameter values. Evaluate its likelihood, given the data and our assumptions about how words evolve. Change the tree structure within the parameters specified. Score the new tree. Accept or Reject the new tree. Continue for 10,000,000 iterations, sampling every 1000. Summarize the results. Subgroups that appear often in the analysis have high support.

Tree-based Research Questions: Can we recover the uncontroversial lower-level groupings? [testing internal validity of model] What higher-level groupings do we reconstruct? What level of support do they have?

Predictions: If loans are problematic: Should see the effects in the tree, as per results in Bowern et al (2011); e.g. Ngumpin-Yapa, Yardli. If a rake is the best tree structure: Should see low levels of support for higher nodes, and conflicting evidence for grouping.

Initial data: 194 Pama-Nyungan languages 189 words of basic vocabulary, coded for cognacy according to CM Stochastic Dollo model [vs CTMC and Covarion] Relaxed clock

1) Subgroup recovery Tracked 28 subgroups; recovered 24: Problems: 4 groups appear as paraphyletic Western Torres (Mabuiag) has high replacement levels; Paman has missing data and was under- sampled. Ngumpin-Yapa and Yardli have very high loan levels; In addition, Yardli has high levels of missing data;

2) and 3): Higher level groups Highest nodes had variable support: mostly strong, but some weak. Therefore, not a rake, but not fully resolved. In particular, multiple bifurcation at the earliest nodes.

Support for rake model?

Extensions

Cognate Coding

More languages and cognate coding Added 105 languages Added 20 cognates (body parts, kin terms, ‘camp’, ‘hill’) Numerous minor coding changes, updating current knowledge, typographical errors, etc This solved the lower level Western Torres, Paman and Karnic (Yardli) problems. That is, we now recover all groups as per established classifications. Implication: Even within ‘basic vocabulary’, the wordlist matters. This needs further investigation. Traditional historical linguistics doesn’t have good models of lexical replacement. remember that there are about 45,000 codes in this db, the occasional typo is to be expected.

Next stages (2012-now): Sample undersampled areas [Paman, Kulin] Extend cognate coding to additional widespread, well attested forms Study the effects of loans on coding [recoding solves Ngumpin, but so does adding more cognates and langs]. Use the tree to probe unidentified wordlists. Examine the unity of Pama-Nyungan, by coding relatives and adjacent families [Garrwan, Tangkic, Nyulnyulan, Worrorran] Use the tree in ancestral state reconstruction [cf. Zhou & Bowern 2015, Bowern et al 2013, Haynie & Bowern (under review)].

Unidentified languages?

Unidentified languages/wordlists Poorly attested materials: do they belong to languages we already know about? Or are there additional languages not previously identified in classifications? Can we classify languages with doubtful subgroup affiliation? Solution: code for cognacy and investigate phylogenetically Relevant both for science and for revitalization/reclamation efforts

Unidentified Languages/Wordlists Most wordlists group closely with already coded varieties

Bigambal: Bandjalangic or Central NSW?

Is Pama-Nyungan really a family?

Unity of Pama-Nyungan Pama-Nyungan’s nearest relatives: Garrwan Tangkic Classed as Pama-Nyungan in early classifications on the basis of typology (eg OVV65) Reclassified in Blake (1988) on their pronouns

Garrwan and Tangkic are are ‘Western’

‘Outgroups’ are ‘Western’

Conclusions

Ten years ago… no Pama-Nyungan tree no consensus on how Pama-Nyungan subgroups are related no data repository and therefore, no easy way to study change in Australia

Now… Much better idea of macro-groupings, but still substantial issues about how they might fit together. the data matters the model matters [not insoluble, just work for the future] Much better idea of the extent of the diversity on the continent More than we thought… CHIRILA database, access to extensive data New ways to investigate language in space, questions of language diversification in space Need for further investigation of the internal composition of Pama-Nyungan.

Phylogenetics in linguistics An important tool for future work in this area Linguistics, along with archaeology, anthropology, and genetics, is a crucial tool for finding out about the past. Computational (linguistic) phylogenetics is an important way to investigate both individual language histories and processes of change more generally. Explicit evolutionary hypotheses Explicit quantification of uncertainty But, as with all work, only as good as the data that go into it.

Acknowledgments NSF grants BCS-0844550 and BCS-1423711 The Aboriginal and Torres Strait Islanders who have given permission for their languages to be included in the database, and made data available. The 100+ linguists who have given permission for their work to be included in the database. The 50+ research students (undergraduates and graduates) who have been involved in the project since 2007, at Rice Univ. and Yale.

Evolutionary thinking in linguistics Explicit, precise models of language change A view of language change that is explicitly connected with what people do (e.g. where they move) Explicit quantification of uncertainty. Rigorous and consistent treatment of data (especially important in regions, such as Australia, without a long research tradition) A way of bridging population-level vs Individual-level questions (the old conundrum of what happens in the heads of individuals vs what happens to languages; cf Hale (1998) One of many methods we can use to investigate the past. give references cf Gray et al 2008; Gray and Atkinson 2003; Bowern 2010

Ingredients of a phylogenetic analysis Evolutionary model: explicit hypothesis about how the units of analysis (e.g. words) change over time Quantified data: ‘characters’ Other priors (assumptions about the data, other modeling assumptions). assumption of uniformitarianism, more or less (that is, mechanisms in the past work the same as in the present) does not require the assumption that evolution is ‘tree-like’