Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Phylogenetics for Language: Theory, Applications, and Extensions Claire Bowern, Yale University: claire.bowern@yale.edu UC Berkeley, May,

Similar presentations


Presentation on theme: "Computational Phylogenetics for Language: Theory, Applications, and Extensions Claire Bowern, Yale University: claire.bowern@yale.edu UC Berkeley, May,"— Presentation transcript:

1 Computational Phylogenetics for Language: Theory, Applications, and Extensions
Claire Bowern, Yale University: UC Berkeley, May, 2016

2 Today: Evolutionary approaches to language change
General discussion of ‘evolutionary’ modeling What is phylogenetics? Is it ‘ok’ to use biological thinking in historical linguistics? Illustrations from Pama-Nyungan (Australia): tree construction Further applications. part of a long-term research project looking into the nature of language change and the specifics of the prehistory of Australia. Thinking about language change not just on its own, but as one of several cultural systems that provides us with evidence about the past.

3 Pama-Nyungan

4 What is phylogenetics? Simplistically: Constructing trees (computationally) and using them to investigate the prehistory of languages and populations. Underlying assumption: Cultural systems (including language) share the same systemic properties as biology when it comes to change. That is, social transmission is similar enough to biological transmission that we can use the same modeling principles (cf Darwin 1859, Nunn 2010, Messoudi 2010, etc) this is, of course, controversial (cf Richerson, Boyd, Mesoudi, Dawkins, Dennett, Cavalli-Sforza, Sterelny, etc). Distinguishing general approach/outlook from the use of computers in reconstruction, though the two are linked.

5 Cultural Evolution: Fundamental Questions
Approach to the study of the past that looks at the role of horizontal vs vertical transmission commonalities and differences between societies causes of change within lineages causes of group split Mesoudi (2010); Richerson et al (2013); Dediu et al (2013); Gray et al (2008), Holden and Mace (2005) Some of the questions that cultural evolution is interested in: mutatis mutandis, very similar to many of the questions that historical lingusitis has looked at,

6 Linguistics and Cultural Evolution
cf Croft (2000, 2008) amongst others: theoretical views of language change through biological analogs. cf Gray et al (2009), Greenhill (2006), Jordan and Gray (2000), Holden and Mace (2005), Pagel et al (2007), among many others: application of specific methods to solve specific problems. Here: part of a larger research program in cashing out the implications of an evolutionary view of language change, in a particular family, without assuming direct parallels with biological systems there have, of course, been many approaches to language change that take a more or less biological view of things, and all have their promoters and detractors. This is, not thinking about importing biological methods, but rather using methods which are common to all sorts of evolutionary systems, including language, culture, biology.

7 Linguistics and Biological Evolution
Broad parallels: units [genes, words, etc] transmitted within a population subject to variation skewing in the distribution of variants gives rise to change [aka selection] => descent with modification But also crucial differences: rates of change mechanism of transmission, degree of horizontal transmission changes across lifespan transmission of acquired traits Critiques of evolutionary views of linguistics often base their objections around these differences: e.g. Blench (2015); Other critiques relate to how the models are interpreted and applied to language; more on that below.

8 Linguistics and Biological Evolution
Do the differences matter? Lewontin (1970); Sterelny (2006); Mameli & Sterelny (2009): Prerequisites to modeling: Variation Differential fitness [aka selection] Heritability [aka a transmission mechanism] That is, any system with these properties can be modeled in general terms using evolutionary methods, though the specifics of the models will vary. In this I argue against Croft (2001) and several others, who look for explicit parallels. and against Andersen (2006) and others, who reject the biological comparisons wholesale. Note that my

9 © 2006 Arnold Kopff; flickr.com
Pama-Nyungan

10 Problem: The Pama-Nyungan ‘Rake’
O’Grady, Voegelin & Voegelin (1965), Dixon (1980, 2002), Hercus (1994), Bowern & Koch (2004), etc. Missing data? Too many loans? Haven’t looked hard enough? Or indicative of how hunter- gatherer languages expand?

11 The CHIRILA database (Bowern 2016)
780,000 lexical items 343 Pama-Nyungan languages; doculects 56 Non-Pama-Nyungan languages, 15+ families The entire corpus of Tasmanian Grammatical features for 90 languages Morphology collection in progress Aim for complete lexical records for Australia

12 Loans aren’t the problem
That is, they are just as much a problem elsewhere in the world.

13 Elements of a Bayesian Model
Data [working on a model of lexical replacement] Matrix of coded data: form + meaning cognate sets. Cognate judgments, coded as strictly as possible according to the comparative method [then converted to binary matrix] Loans left in but tagged Model of cognate evolution: gains and losses equally likely? or gains indicative of a group, while losses aren’t? Stochastic Dollo model (Nichols and Gray 2006; developed for linguistic data) Clock model fixed? single rate of change? or variable rate, drawn from a distribution of rates. Priors don’t want to collapse cognates – you want words with different meanings, that’s information about innovation.

14 Bayesian model of cognate evolution
Here: Stochastic Dollo model (Nicholls and Gray 2008) 0 → 1 is indicative of shared descent 1 → 0 is not nb, doesn’t take borrowing into account; therefore likely to be misled by extensive shared borrowing. Other more flexible models include covarion, but StDollo is closer to standard historical assumptions

15 Bayesian tree inference (MCMC)
Monte-Carlo Markov Chain Construct a random starting tree from the data and from parameter values. Evaluate its likelihood, given the data and our assumptions about how words evolve. Change the tree structure within the parameters specified. Score the new tree. Accept or Reject the new tree. Continue for 10,000,000 iterations, sampling every 1000. Summarize the results. Subgroups that appear often in the analysis have high support.

16 Tree-based Research Questions:
Can we recover the uncontroversial lower-level groupings? [testing internal validity of model] What higher-level groupings do we reconstruct? What level of support do they have?

17 Predictions: If loans are problematic:
Should see the effects in the tree, as per results in Bowern et al (2011); e.g. Ngumpin-Yapa, Yardli. If a rake is the best tree structure: Should see low levels of support for higher nodes, and conflicting evidence for grouping.

18 Initial data: 194 Pama-Nyungan languages
189 words of basic vocabulary, coded for cognacy according to CM Stochastic Dollo model [vs CTMC and Covarion] Relaxed clock

19 1) Subgroup recovery Tracked 28 subgroups; recovered 24:
Problems: 4 groups appear as paraphyletic Western Torres (Mabuiag) has high replacement levels; Paman has missing data and was under- sampled. Ngumpin-Yapa and Yardli have very high loan levels; In addition, Yardli has high levels of missing data;

20 2) and 3): Higher level groups
Highest nodes had variable support: mostly strong, but some weak. Therefore, not a rake, but not fully resolved. In particular, multiple bifurcation at the earliest nodes.

21 Support for rake model?

22 Extensions

23 Cognate Coding

24 More languages and cognate coding
Added 105 languages Added 20 cognates (body parts, kin terms, ‘camp’, ‘hill’) Numerous minor coding changes, updating current knowledge, typographical errors, etc This solved the lower level Western Torres, Paman and Karnic (Yardli) problems. That is, we now recover all groups as per established classifications. Implication: Even within ‘basic vocabulary’, the wordlist matters. This needs further investigation. Traditional historical linguistics doesn’t have good models of lexical replacement. remember that there are about 45,000 codes in this db, the occasional typo is to be expected.

25 Next stages (2012-now): Sample undersampled areas [Paman, Kulin]
Extend cognate coding to additional widespread, well attested forms Study the effects of loans on coding [recoding solves Ngumpin, but so does adding more cognates and langs]. Use the tree to probe unidentified wordlists. Examine the unity of Pama-Nyungan, by coding relatives and adjacent families [Garrwan, Tangkic, Nyulnyulan, Worrorran] Use the tree in ancestral state reconstruction [cf. Zhou & Bowern 2015, Bowern et al 2013, Haynie & Bowern (under review)].

26 Unidentified languages?

27 Unidentified languages/wordlists
Poorly attested materials: do they belong to languages we already know about? Or are there additional languages not previously identified in classifications? Can we classify languages with doubtful subgroup affiliation? Solution: code for cognacy and investigate phylogenetically Relevant both for science and for revitalization/reclamation efforts

28 Unidentified Languages/Wordlists
Most wordlists group closely with already coded varieties

29 Bigambal: Bandjalangic or Central NSW?

30 Is Pama-Nyungan really a family?

31 Unity of Pama-Nyungan Pama-Nyungan’s nearest relatives: Garrwan
Tangkic Classed as Pama-Nyungan in early classifications on the basis of typology (eg OVV65) Reclassified in Blake (1988) on their pronouns

32 Garrwan and Tangkic are are ‘Western’

33 ‘Outgroups’ are ‘Western’

34 Conclusions

35 Ten years ago… no Pama-Nyungan tree
no consensus on how Pama-Nyungan subgroups are related no data repository and therefore, no easy way to study change in Australia

36 Now… Much better idea of macro-groupings, but still substantial issues about how they might fit together. the data matters the model matters [not insoluble, just work for the future] Much better idea of the extent of the diversity on the continent More than we thought… CHIRILA database, access to extensive data New ways to investigate language in space, questions of language diversification in space Need for further investigation of the internal composition of Pama-Nyungan.

37 Phylogenetics in linguistics
An important tool for future work in this area Linguistics, along with archaeology, anthropology, and genetics, is a crucial tool for finding out about the past. Computational (linguistic) phylogenetics is an important way to investigate both individual language histories and processes of change more generally. Explicit evolutionary hypotheses Explicit quantification of uncertainty But, as with all work, only as good as the data that go into it.

38 Acknowledgments NSF grants BCS-0844550 and BCS-1423711
The Aboriginal and Torres Strait Islanders who have given permission for their languages to be included in the database, and made data available. The 100+ linguists who have given permission for their work to be included in the database. The 50+ research students (undergraduates and graduates) who have been involved in the project since 2007, at Rice Univ. and Yale.

39

40 Evolutionary thinking in linguistics
Explicit, precise models of language change A view of language change that is explicitly connected with what people do (e.g. where they move) Explicit quantification of uncertainty. Rigorous and consistent treatment of data (especially important in regions, such as Australia, without a long research tradition) A way of bridging population-level vs Individual-level questions (the old conundrum of what happens in the heads of individuals vs what happens to languages; cf Hale (1998) One of many methods we can use to investigate the past. give references cf Gray et al 2008; Gray and Atkinson 2003; Bowern 2010

41 Ingredients of a phylogenetic analysis
Evolutionary model: explicit hypothesis about how the units of analysis (e.g. words) change over time Quantified data: ‘characters’ Other priors (assumptions about the data, other modeling assumptions). assumption of uniformitarianism, more or less (that is, mechanisms in the past work the same as in the present) does not require the assumption that evolution is ‘tree-like’


Download ppt "Computational Phylogenetics for Language: Theory, Applications, and Extensions Claire Bowern, Yale University: claire.bowern@yale.edu UC Berkeley, May,"

Similar presentations


Ads by Google