Download presentation
Presentation is loading. Please wait.
Published byJared Murphy Modified over 8 years ago
1
Language Classification1 Adding typology to lexicostatistics: a combined approach to language classification ASJP Consortium
2
Language Classification2 Overview Project ASJP (Started January 2007): (Automated Similarity Judgment Program)
3
Language Classification3 Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships
4
Language Classification4 Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Basis: Distance matrix between individual languages based on lexical elements
5
Language Classification5 Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Basis: Distance matrix between individual languages Method: Lexicostatistics: mass comparison of basic lexical items
6
Language Classification6 Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but:
7
Language Classification7 Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use of computational algorithms and tools
8
Language Classification8 Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use of computational algorithms and tools 2. methodology from classification in biology
9
Language Classification9 Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use of computational algorithms and tools 2. methodology from classification in biology 3. extended by all relevant data available
10
Language Classification10 ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification of areas/groups Caveat:
11
Language Classification11 ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification of areas/groups BUT: 1. Optimize lexicostatistics on basis of expert knowledge on well-explored areas Caveat:
12
Language Classification12 ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification of areas/groups BUT: 1. Optimize lexicostatistics on basis of expert knowledge 2. Provide method and tools to assess and improve classifications for un(der)explored areas Caveat:
13
Language Classification13 Overview Current collaborators: Dik Bakker David Beck Oleg Belyaev Cecil H. Brown Pamela Brown Matthew Dryer Dmitry Egorov Pattie Epps Anthony Grant Eric W. Holman Hagen Jung Johann-Mattis List Robert Mailhammer André Müller Uri Tadmor Matthias Urban Viveka Velupillai Søren Wichmann Kofi Yakpo
14
Language Classification14 Overview Current collaborators: Dik Bakker David Beck Oleg Belyaev Cecil H. Brown Pamela Brown Matthew Dryer Dmitry Egorov Pattie Epps Anthony Grant Eric W. Holman Hagen Jung Johann-Mattis List Robert Mailhammer André Müller Uri Tadmor Matthias Urban Viveka Velupillai Søren Wichmann Kofi Yakpo
15
Language Classification15 LEX Overview ASJP system
16
Language Classification16 LEX ASJP software Overview ASJP system Method
17
Language Classification17 LEX ASJP software distance matrix Overview ASJP system
18
Language Classification18 LEX ASJP software distance matrix Overview ASJP system DUTCHENGLISH53.3 DUTCHFRENCH72.7 DUTCHMANDARIN93.8 …
19
Language Classification19 LEX distance matrix ASJP software CLASSIF software Overview ASJP system
20
Language Classification20 LEX distance matrix ETHN WALS EXPRT EVALUATION CLASSIF software STAT software ASJP software Existing Expert Classifications:
21
Language Classification21 LEX distance matrix ETHN WALS EXPRT ASJP software CLASSIF software STAT software CALIBRATION Existing Expert Classifications: Method
22
Language Classification22 LEX distance matrix ETHN WALS EXPRT GEO GRAPH MAP software ASJP software CLASSIF software STAT software
23
Language Classification23 LEX distance matrix ETHN WALS EXPRT GEO GRAPH HIST FACTS ASJP software CLASSIF software STAT software MAP software
24
Language Classification24 LEX distance matrix ETHN WALS EXPRT GEO GRAPH HIST FACTS TYPOL DATA CLASSIF software STAT software MAP software ASJP software
25
Language Classification25 LEX distance matrix ASJP software CLASSIF software Today …
26
Language Classification26 LEX distance matrix TYPOL DATA ASJP software CLASSIF software Today …
27
Language Classification27 List of basic lexical items
28
Language Classification28 Lexical items Word list Morris Swadesh (1955): 100 basic meanings
29
Language Classification29 1. I21. dog41. nose61. die81. smoke 2. you22. louse42. mouth62. kill82. fire 3. we23. tree43. tooth63. swim83. ash 4. this24. seed44. tongue64. fly84. burn 5. that25. leaf45. claw65. walk85. path 6. who26. root46. foot66. come86. mountain 7. what27. bark47. knee67. lie87. red 8. not28. skin48. hand68. sit88. green 9. all29. flesh49. belly69. stand89. yellow 10. many30. blood50. neck70. give90. white 11. one31. bone51. breasts71. say91. black 12. two32. grease52. heart72. sun92. night 13. big33. egg53. liver73. moon93. hot 14. long34. horn54. drink74. star94. cold 15. small35. tail55. eat75. water95. full 16. woman36. feather56. bite76. rain96. new 17. man37. hair57. see77. stone97. good 18. person38. head58. hear78. sand98. round 19. fish39. ear59. know79. earth99. dry 20. bird40. eye60. sleep80. cloud100. name
30
Language Classification30 Lexical items Swadesh list: assumptions
31
Language Classification31 Lexical items Swadesh list: - Word in most languages
32
Language Classification32 Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed
33
Language Classification33 Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed - Relatively stable over time
34
Language Classification34 Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed - Relatively stable over time - Easily accessible (fieldwork / grammars)
35
Language Classification35 Lexical items Languages transcribed to date: - Over 3500 languages (incl. dialects; around 45% of lgs of the world)
36
Language Classification36 Languages currently collected
37
Language Classification37 Lexical items: further reduction Reduction of the full Swadesh list:
38
Language Classification38 Lexical items: further reduction Reduction of the full Swadesh list: 1. Not the complete list, only most stable items
39
Language Classification39 Lexical items: further reduction Reduction of the full Swadesh list: 1. Not the complete list, only most stable items 2. Not full IPA representation, but generalized coding
40
Language Classification40 Lexical items: further reduction 1. Not the complete list - Most stable items = least formal variation in well-established genetic groups (Dryer’s genera)
41
Language Classification41 Lexical items: further reduction 1. Not the complete list - Most stable items = least formal variation in well-established genetic groups (Dryer’s genera) Nichols (1995):lg pairs (word k =word k ) +++ all pairs
42
Language Classification42 Lexical items: further reduction 1. Not the complete list - Most stable items = least formal variation in well-established genetic groups (Dryer’s genera) Nichols (1995):lg pairs (word k =word k ) all pairs What is optimal number … ?
43
Language Classification Ethnologue Classification* WALS Classification** *Goodman-Kruskal **Pearson + Stability -
44
Language Classification Ethnologue Classification WALS Classification + Stability -
45
Language Classification45 WALS Classification Ethnologue Classification
46
Language Classification46 WALS Classification Ethnologue Classification
47
Language Classification47 WALS Classification Ethnologue Classification 40
48
Language Classification48 WALS Classification Ethnologue Classification
49
Language Classification49 WALS Classification Ethnologue Classification
50
Language Classification50 I dog nose die smoke you louse mouth kill fire we tree tooth swim ash this seed tongue fly burn that leaf claw walk path who root foot come mountain what bark knee lie red not skin hand sit green all flesh belly stand yellow many blood neck give white one bone breast say black two grease heart sun night big egg liver moon hot long horn drink star cold small tail eat water full woman feather bite rain new man hair see stone good person head hear sand round fish ear know earth dry bird eye sleep cloud name 40 Most Stable
51
Language Classification51 Lexical items: transcription 2. NOT full IPA but ASJPcode: 7 Vowels 34 Consonants All other phonemes to ‘closest sound’ (automatic)
52
Language Classification52 Abaza (Caucasian): MeaningIPA PERSONʕʷɨʧʼʲʷʕʷɨs LEAFbɣʲɨ SKINʧʷazʲ HORNʧʼʷɨʕʷa NOSEpɨnʦʼa TOOTHpɨʦ
53
Language Classification53 Abaza (Caucasian): MeaningIPAASJPcode PERSONʕʷɨʧʼʲʷʕʷɨsXw3Cw"yXw3s LEAFbɣʲɨbxy3 SKINʧʷazʲCwazy HORNʧʼʷɨʕʷaCw"3Xwa NOSEpɨnʦʼap3nc"a TOOTHpɨʦp3c
54
Language Classification54 Loss of information? Shown for representative groups: - ASJP as good for separating language families as full IPA
55
Language Classification55 Loss of information? Shown for representative groups: - ASJP as good for separating language families as full IPA - More accurate for precise genetic classification than IPA (under our current method)
56
Language Classification56 Comparing words and languages
57
Language Classification57 Comparing words Most successful measure to date: Levenshtein Distance
58
Language Classification58 Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form
59
Language Classification59 Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form A L T A S J P
60
Language Classification60 Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form A L T A S J P x x x = 3
61
Language Classification61 Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form 1. Normalization: LDN = ( LD / L max ) 0.0 – 1.0
62
Language Classification62 Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form 1. Normalization: LDN = ( LD / L max ) 0.0 – 1.0 2. Eliminate ‘ background noise’: LDND = ( LDN / LDN different pairs )
63
Language Classification63 Classifying languages
64
Language Classification64 LNGSILIYOUWEONETWO CANTONESEyue NohneihdeihNhdeihyatyih HAINAN_MINNANnan valuvaneNzy~a7no*|no HAKKAhak NaiNiNaiteuyitly~oN|Ni MANDARINcmn wonimenwomeniel SUZHOU_WUwuu NonESia*nj3ji7lia* A_TONGaot aNnaNniNsani MIKIRmjw nenEngnetumisihini TARAONmhu ha*nu*niNkiNkaiN NAXInbf N3nvN3Ng3d35i CHIANGRAI_MIENium yiameibuayeti HMONG_DAWmww kukopeio SUYONG_HMONGmww ko peiau TAK_HMONGmww kukopeio … …
65
Language Classification65 Swadesh (3500) AJSP
66
Language Classification66 Swadesh (3500) distance matrix AJSP
67
Language Classification67 LG1LG2 LDND MANDARINMIDDLE_CHINESE81.75 MANDARINOLD_CHINESE94.30 MANDARINSUZHOU_WU85.87 MANDARINDHAMMAI97.48 MANDARINA_TONG97.91 MANDARINKAYAH_LI_EASTERN94.75 MANDARINMIKIR 99.05 MANDARINLEPCHA97.24 MANDARINAPATANI92.24 MANDARINBENGNI96.91 MANDARINBOKAR95.28 …
68
Language Classification68 LG1LG2 LDND MANDARINMIDDLE_CHINESE81.75 MANDARINOLD_CHINESE94.30 MANDARINSUZHOU_WU85.87 MANDARINDHAMMAI97.48 MANDARINA_TONG97.91 MANDARINKAYAH_LI_EASTERN94.75 MANDARINMIKIR 99.05 MANDARINLEPCHA97.24 MANDARINAPATANI92.24 MANDARINBENGNI96.91 MANDARINBOKAR95.28 …
69
Language Classification69 LG1LG2 LDND MANDARINMIDDLE_CHINESE81.75 MANDARINOLD_CHINESE94.30 MANDARINSUZHOU_WU85.87 MANDARINDHAMMAI97.48 MANDARINA_TONG97.91 MANDARINKAYAH_LI_EASTERN94.75 MANDARINMIKIR 99.05 MANDARINLEPCHA97.24 MANDARINAPATANI92.24 MANDARINBENGNI96.91 MANDARINBOKAR95.28 …
70
Language Classification70 LG1LG2 LDND MANDARINMIDDLE_CHINESE81.75 MANDARINOLD_CHINESE94.30 MANDARINSUZHOU_WU85.87 MANDARINDHAMMAI97.48 MANDARINA_TONG97.91 MANDARINKAYAH_LI_EASTERN94.75 MANDARINMIKIR 99.05 MANDARINLEPCHA97.24 MANDARINAPATANI92.24 MANDARINBENGNI96.91 MANDARINBOKAR95.28 3500 languages ~ 240.000.000 comp
71
Language Classification71 Processing problems …
72
Language Classification72 Solution: parallel processing
73
Language Classification73 Swadesh (3500) distance matrix AJSP MEGA4 http://www.megasoftware.net/ DNA patterns
74
Language Classification74 Swadesh (3500) distance matrix AJSP MEGA4 Neighbour Joining
75
Language Classification75 SEE COMPLETE TREE-OF-THE-MONTH ON: email.eva.mpg.de/~wichmann/ASJPHomePage
76
LDND + Mega4 Mayan (34 / 69 Ethn)
77
LDND + Mega4 Mayan (34 / 69)
78
cholan Mayan (34 / 69) LDND + Mega4
79
cholan Mayan (34 / 69) tzeltalan cholan LDND + Mega4
80
Mayan (34 / 69) yucatecan tzeltalan cholan LDND + Mega4
81
Mayan (34 / 69) yucatecan tzeltalancholan LDND + Mega4 Ethnologue/experts:
82
Language Classification82 ASJP and genetic classification - Method works at a global level
83
Language Classification83 ASJP and genetic classification - Method works at a global level - Often also at the lowest levels
84
Language Classification84 ASJP and genetic classification - Method works at a global level - Often also at the lowest levels - Refinement necessary at intermediate level
85
Language Classification85 Adding typological data
86
Language Classification86 Trying to improve the fit … Enrich lexical with typological data: Haspelmath, M., M. Dryer,D. Gil & B. Comrie (eds) (2005). The World Atlas Of Language Structures. Oxford: Oxford University Press WALS Online: http://wals.info/WALS Online: http://wals.info/
87
Language Classification87 Swadesh (3500) distance matrix ASJP TREE SFTW WALS (2580) + Lexical plus typological data
88
Language Classification88 distance matrix ASJP TREE SFTW ‘SWALSH’
89
Language Classification89 Improving the fit Enrich lexical with typological data: - NOT 1:1 with ASJP languages
90
Language Classification90 distance matrix ASJP TREE SFTW SWALSH (1250)
91
Language Classification91 Improving the fit Enrich lexical with typological data: - NOT 1:1 with ASJP languages - WALS matrix very UNevenly filled (16%) cf. Cysouw (2008) – STUF 61.3
92
Language Classification92 Improving the fit Enrich lexical with typological data: - NOT 1:1 with ASJP languages - WALS features very unevenly filled Determine most stable features
93
Language Classification93 Feature Stability Nichols (1995): metric for S(Ftr k ) in G x : pairs (val k =val k ) all pairs
94
Language Classification94 Feature Stability ASJP: metric for S tability Ftr k : For G x : pairs (val k =val k ) all pairs
95
Language Classification95 Feature Stability ASJP: metric for stability Ftr k : For G x : pairs (val k =val k ) all pairs Size differences between G
96
Language Classification96 Feature Stability ASJP: metric for stability Ftr k : S Fk = pairs (val k =val k ) all pairs
97
Language Classification97 Feature Stability ASJP: metric for stability Ftr k : S Fk = pairs (val k =val k ) all pairs pairs (val k =val k ) U all pairs all pairs ‘Background noise’
98
Language Classification98 Feature Stability ASJP: metric for stability Ftr k : S Fk = pairs (val k =val k ) all pairs pairs (val k =val k ) U all pairs all pairs (1 – U) Normalization: S Fk comparable
99
Language Classification99 Most stable WALS features 31. Sex-based and Non-sex-based Gender Systems0.81 118. Predicative Adjectives0.74 30. Number of Genders0.73 119. Nominal and Locational Predication0.71 29. Syncretism in Verbal Person/Number Marking0.71
100
Language Classification100 Most instable WALS features 128. Utterance Complement Clauses0.07 115. Negative Indefinite Pronouns/Predicate Negation0.07 59. Possessive Classification0.01 135. Red and Yellow-0.07 58. Obligatory Possessive Inflection-0.25
101
Language Classification101 Correlation with Ethnologue Min ftrs 20
102
Language Classification102 Correlation with Ethnologue Min ftrs 40 20
103
Language Classification103 Correlation with Ethnologue Min ftrs 60 40 20
104
Language Classification104 Correlation with Ethnologue Min ftrs 80 60 40 20
105
Language Classification105 Correlation with Ethnologue Min ftrs 100 80 60 40 20
106
Language Classification106 Correlation with Ethnologue Min ftrs 100 80 60 40 20 + Stability -
107
Language Classification107 Correlation with Ethnologue Min ftrs 100 80 60 40 20
108
Language Classification108 Correlation with Ethnologue Min ftrs 100 80 60 40 20 40
109
Language Classification109 Correlation with Ethnologue Min ftrs 100 80 60 40 20 60
110
Language Classification110 Correlation with Ethnologue Min ftrs 100 80 60 40 20 85
111
Language Classification111 Correlation with Ethnologue Min ftrs 100 80 60 40 20
112
Language Classification112 WALS
113
Language Classification113 WALS Swadesh40
114
Language Classification114 Improving the fit Typological variables* do not perform better than lexical ones to establish genetic relationships *WALS!
115
Language Classification115 Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships What about a combination?
116
Language Classification116 Only WALS FtrsLgs 10079 80109 60139 40218 20341 Only Sw40
117
Language Classification117 Only WALS FtrsLgs 10079 80109 60139 40218 20341 Only Sw40
118
Language Classification118 Only WALS FtrsLgs 10079 80109 60139 40218 20341 Only Sw40 85:15
119
Language Classification119 Only WALS FtrsLgs 10079 80109 60139 40218 20341 Only Sw40 70:30
120
Language Classification120 Only WALS FtrsLgs 10079 80109 60139 40218 20341 Only Sw40 50:50 0.91
121
Language Classification121 Only WALS FtrsLgs 10079 80109 60139 40218 20341 Only Sw40 35:65
122
Language Classification122 Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships A combined, balanced approach is superior, but …
123
Language Classification123 Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships A combined, balanced approach is superior, but … … at a much higher cost per language than just lexicostatistics: 84% WALS to be filled in …
124
Language Classification124 Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships A combined, balanced approach is superior, but … … at a much higher cost per language Continue extension/optimization of lexical method
125
Language Classification125 1. Brown, Cecil H., Eric W. Holman, Søren Wichmann & Viveka Velupillai (2008). Automated Classification of the World’s languages: a description of the method and prelimary results. Sprachtypologie und Universalienforschung 61: 285-308. 2. Holman, E. W., S. Wichmann, C. H. Brown, V. Velupillai, A. Müller & D. Bakker (2008) 'Advances in automated language classification'. In A. Arppe, K. Sinnemäke and U. Nikanne (eds) Quantitative Investigations in Theoretical Linguistics. Helsinki: University of Helsinki, 40-43. 3. Holman, E. W., S. Wichmann, C. H. Brown, V. Velupillai, A. Müller & D. Bakker (2008). ‘Explorations in automated language classification’. Folia Linguistica 42-2, 331-354. 4. Bakker, D., A. Müller, V. Velupillai, S. Wichmann, C. H. Brown, P. Brown, D. Egorov, R. Mailhammer, A. Grant, E. W. Holman (2009). ’Adding typology to lexicostatistics: a combined approach to language classification’. Linguistic Typology 13, 167-179. Publications 2008 - 2009
126
Language Classification126 ?
127
Language Classification127 Overall goal: - Method + Tools for Reconstruction of Language Relationships Derived goals: - Critical assessment and refinement of existing classifications - Classify newly described and unclassified languages - Search for (ir)regularities in family reconstructions - Test hypotheses about families - Experimentally find an optimal dating method - Automatically detect borrowings ASJP
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.