Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Classification1 Adding typology to lexicostatistics: a combined approach to language classification ASJP Consortium.

Similar presentations


Presentation on theme: "Language Classification1 Adding typology to lexicostatistics: a combined approach to language classification ASJP Consortium."— Presentation transcript:

1 Language Classification1 Adding typology to lexicostatistics: a combined approach to language classification ASJP Consortium

2 Language Classification2 Overview Project ASJP (Started January 2007): (Automated Similarity Judgment Program)

3 Language Classification3 Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships

4 Language Classification4 Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Basis: Distance matrix between individual languages based on lexical elements

5 Language Classification5 Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Basis: Distance matrix between individual languages Method: Lexicostatistics: mass comparison of basic lexical items

6 Language Classification6 Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but:

7 Language Classification7 Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use of computational algorithms and tools

8 Language Classification8 Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use of computational algorithms and tools 2. methodology from classification in biology

9 Language Classification9 Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use of computational algorithms and tools 2. methodology from classification in biology 3. extended by all relevant data available

10 Language Classification10 ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification of areas/groups Caveat:

11 Language Classification11 ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification of areas/groups BUT: 1. Optimize lexicostatistics on basis of expert knowledge on well-explored areas Caveat:

12 Language Classification12 ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification of areas/groups BUT: 1. Optimize lexicostatistics on basis of expert knowledge 2. Provide method and tools to assess and improve classifications for un(der)explored areas Caveat:

13 Language Classification13 Overview Current collaborators: Dik Bakker David Beck Oleg Belyaev Cecil H. Brown Pamela Brown Matthew Dryer Dmitry Egorov Pattie Epps Anthony Grant Eric W. Holman Hagen Jung Johann-Mattis List Robert Mailhammer André Müller Uri Tadmor Matthias Urban Viveka Velupillai Søren Wichmann Kofi Yakpo

14 Language Classification14 Overview Current collaborators: Dik Bakker David Beck Oleg Belyaev Cecil H. Brown Pamela Brown Matthew Dryer Dmitry Egorov Pattie Epps Anthony Grant Eric W. Holman Hagen Jung Johann-Mattis List Robert Mailhammer André Müller Uri Tadmor Matthias Urban Viveka Velupillai Søren Wichmann Kofi Yakpo

15 Language Classification15 LEX Overview ASJP system

16 Language Classification16 LEX ASJP software Overview ASJP system Method

17 Language Classification17 LEX ASJP software distance matrix Overview ASJP system

18 Language Classification18 LEX ASJP software distance matrix Overview ASJP system DUTCHENGLISH53.3 DUTCHFRENCH72.7 DUTCHMANDARIN93.8 …

19 Language Classification19 LEX distance matrix ASJP software CLASSIF software Overview ASJP system

20 Language Classification20 LEX distance matrix ETHN WALS EXPRT EVALUATION CLASSIF software STAT software ASJP software Existing Expert Classifications:

21 Language Classification21 LEX distance matrix ETHN WALS EXPRT ASJP software CLASSIF software STAT software CALIBRATION Existing Expert Classifications: Method

22 Language Classification22 LEX distance matrix ETHN WALS EXPRT GEO GRAPH MAP software ASJP software CLASSIF software STAT software

23 Language Classification23 LEX distance matrix ETHN WALS EXPRT GEO GRAPH HIST FACTS ASJP software CLASSIF software STAT software MAP software

24 Language Classification24 LEX distance matrix ETHN WALS EXPRT GEO GRAPH HIST FACTS TYPOL DATA CLASSIF software STAT software MAP software ASJP software

25 Language Classification25 LEX distance matrix ASJP software CLASSIF software Today …

26 Language Classification26 LEX distance matrix TYPOL DATA ASJP software CLASSIF software Today …

27 Language Classification27 List of basic lexical items

28 Language Classification28 Lexical items Word list Morris Swadesh (1955): 100 basic meanings

29 Language Classification29 1. I21. dog41. nose61. die81. smoke 2. you22. louse42. mouth62. kill82. fire 3. we23. tree43. tooth63. swim83. ash 4. this24. seed44. tongue64. fly84. burn 5. that25. leaf45. claw65. walk85. path 6. who26. root46. foot66. come86. mountain 7. what27. bark47. knee67. lie87. red 8. not28. skin48. hand68. sit88. green 9. all29. flesh49. belly69. stand89. yellow 10. many30. blood50. neck70. give90. white 11. one31. bone51. breasts71. say91. black 12. two32. grease52. heart72. sun92. night 13. big33. egg53. liver73. moon93. hot 14. long34. horn54. drink74. star94. cold 15. small35. tail55. eat75. water95. full 16. woman36. feather56. bite76. rain96. new 17. man37. hair57. see77. stone97. good 18. person38. head58. hear78. sand98. round 19. fish39. ear59. know79. earth99. dry 20. bird40. eye60. sleep80. cloud100. name

30 Language Classification30 Lexical items Swadesh list: assumptions

31 Language Classification31 Lexical items Swadesh list: - Word in most languages

32 Language Classification32 Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed

33 Language Classification33 Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed - Relatively stable over time

34 Language Classification34 Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed - Relatively stable over time - Easily accessible (fieldwork / grammars)

35 Language Classification35 Lexical items Languages transcribed to date: - Over 3500 languages (incl. dialects; around 45% of lgs of the world)

36 Language Classification36 Languages currently collected

37 Language Classification37 Lexical items: further reduction Reduction of the full Swadesh list:

38 Language Classification38 Lexical items: further reduction Reduction of the full Swadesh list: 1. Not the complete list, only most stable items

39 Language Classification39 Lexical items: further reduction Reduction of the full Swadesh list: 1. Not the complete list, only most stable items 2. Not full IPA representation, but generalized coding

40 Language Classification40 Lexical items: further reduction 1. Not the complete list - Most stable items = least formal variation in well-established genetic groups (Dryer’s genera)

41 Language Classification41 Lexical items: further reduction 1. Not the complete list - Most stable items = least formal variation in well-established genetic groups (Dryer’s genera) Nichols (1995):lg pairs (word k =word k ) +++ all pairs

42 Language Classification42 Lexical items: further reduction 1. Not the complete list - Most stable items = least formal variation in well-established genetic groups (Dryer’s genera) Nichols (1995):lg pairs (word k =word k ) all pairs  What is optimal number … ?

43 Language Classification Ethnologue Classification* WALS Classification** *Goodman-Kruskal **Pearson +  Stability  -

44 Language Classification Ethnologue Classification WALS Classification +  Stability  -

45 Language Classification45 WALS Classification Ethnologue Classification

46 Language Classification46 WALS Classification Ethnologue Classification

47 Language Classification47 WALS Classification Ethnologue Classification 40

48 Language Classification48 WALS Classification Ethnologue Classification

49 Language Classification49 WALS Classification Ethnologue Classification

50 Language Classification50 I dog nose die smoke you louse mouth kill fire we tree tooth swim ash this seed tongue fly burn that leaf claw walk path who root foot come mountain what bark knee lie red not skin hand sit green all flesh belly stand yellow many blood neck give white one bone breast say black two grease heart sun night big egg liver moon hot long horn drink star cold small tail eat water full woman feather bite rain new man hair see stone good person head hear sand round fish ear know earth dry bird eye sleep cloud name 40 Most Stable

51 Language Classification51 Lexical items: transcription 2. NOT full IPA but ASJPcode: 7 Vowels 34 Consonants All other phonemes to ‘closest sound’ (automatic)

52 Language Classification52 Abaza (Caucasian): MeaningIPA PERSONʕʷɨʧʼʲʷʕʷɨs LEAFbɣʲɨ SKINʧʷazʲ HORNʧʼʷɨʕʷa NOSEpɨnʦʼa TOOTHpɨʦ

53 Language Classification53 Abaza (Caucasian): MeaningIPAASJPcode PERSONʕʷɨʧʼʲʷʕʷɨsXw3Cw"yXw3s LEAFbɣʲɨbxy3 SKINʧʷazʲCwazy HORNʧʼʷɨʕʷaCw"3Xwa NOSEpɨnʦʼap3nc"a TOOTHpɨʦp3c

54 Language Classification54 Loss of information? Shown for representative groups: - ASJP as good for separating language families as full IPA

55 Language Classification55 Loss of information? Shown for representative groups: - ASJP as good for separating language families as full IPA - More accurate for precise genetic classification than IPA (under our current method)

56 Language Classification56 Comparing words and languages

57 Language Classification57 Comparing words Most successful measure to date: Levenshtein Distance

58 Language Classification58 Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form

59 Language Classification59 Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form A L T A S J P

60 Language Classification60 Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form A L T A S J P x x x = 3

61 Language Classification61 Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form 1. Normalization:  LDN = ( LD / L max )  0.0 – 1.0

62 Language Classification62 Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form 1. Normalization:  LDN = ( LD / L max )  0.0 – 1.0 2. Eliminate ‘ background noise’: LDND = ( LDN / LDN different pairs )

63 Language Classification63 Classifying languages

64 Language Classification64 LNGSILIYOUWEONETWO CANTONESEyue NohneihdeihNhdeihyatyih HAINAN_MINNANnan valuvaneNzy~a7no*|no HAKKAhak NaiNiNaiteuyitly~oN|Ni MANDARINcmn wonimenwomeniel SUZHOU_WUwuu NonESia*nj3ji7lia* A_TONGaot aNnaNniNsani MIKIRmjw nenEngnetumisihini TARAONmhu ha*nu*niNkiNkaiN NAXInbf N3nvN3Ng3d35i CHIANGRAI_MIENium yiameibuayeti HMONG_DAWmww kukopeio SUYONG_HMONGmww ko peiau TAK_HMONGmww kukopeio … …

65 Language Classification65 Swadesh (3500) AJSP

66 Language Classification66 Swadesh (3500) distance matrix AJSP

67 Language Classification67 LG1LG2 LDND MANDARINMIDDLE_CHINESE81.75 MANDARINOLD_CHINESE94.30 MANDARINSUZHOU_WU85.87 MANDARINDHAMMAI97.48 MANDARINA_TONG97.91 MANDARINKAYAH_LI_EASTERN94.75 MANDARINMIKIR 99.05 MANDARINLEPCHA97.24 MANDARINAPATANI92.24 MANDARINBENGNI96.91 MANDARINBOKAR95.28 …

68 Language Classification68 LG1LG2 LDND MANDARINMIDDLE_CHINESE81.75 MANDARINOLD_CHINESE94.30 MANDARINSUZHOU_WU85.87 MANDARINDHAMMAI97.48 MANDARINA_TONG97.91 MANDARINKAYAH_LI_EASTERN94.75 MANDARINMIKIR 99.05 MANDARINLEPCHA97.24 MANDARINAPATANI92.24 MANDARINBENGNI96.91 MANDARINBOKAR95.28 …

69 Language Classification69 LG1LG2 LDND MANDARINMIDDLE_CHINESE81.75 MANDARINOLD_CHINESE94.30 MANDARINSUZHOU_WU85.87 MANDARINDHAMMAI97.48 MANDARINA_TONG97.91 MANDARINKAYAH_LI_EASTERN94.75 MANDARINMIKIR 99.05 MANDARINLEPCHA97.24 MANDARINAPATANI92.24 MANDARINBENGNI96.91 MANDARINBOKAR95.28 …

70 Language Classification70 LG1LG2 LDND MANDARINMIDDLE_CHINESE81.75 MANDARINOLD_CHINESE94.30 MANDARINSUZHOU_WU85.87 MANDARINDHAMMAI97.48 MANDARINA_TONG97.91 MANDARINKAYAH_LI_EASTERN94.75 MANDARINMIKIR 99.05 MANDARINLEPCHA97.24 MANDARINAPATANI92.24 MANDARINBENGNI96.91 MANDARINBOKAR95.28 3500 languages ~ 240.000.000 comp

71 Language Classification71 Processing problems …

72 Language Classification72 Solution: parallel processing

73 Language Classification73 Swadesh (3500) distance matrix AJSP MEGA4 http://www.megasoftware.net/ DNA patterns

74 Language Classification74 Swadesh (3500) distance matrix AJSP MEGA4 Neighbour Joining

75 Language Classification75 SEE COMPLETE TREE-OF-THE-MONTH ON: email.eva.mpg.de/~wichmann/ASJPHomePage

76 LDND + Mega4 Mayan (34 / 69 Ethn)

77 LDND + Mega4 Mayan (34 / 69)

78 cholan Mayan (34 / 69) LDND + Mega4

79 cholan Mayan (34 / 69) tzeltalan cholan LDND + Mega4

80 Mayan (34 / 69) yucatecan tzeltalan cholan LDND + Mega4

81 Mayan (34 / 69) yucatecan tzeltalancholan LDND + Mega4 Ethnologue/experts:

82 Language Classification82 ASJP and genetic classification - Method works at a global level

83 Language Classification83 ASJP and genetic classification - Method works at a global level - Often also at the lowest levels

84 Language Classification84 ASJP and genetic classification - Method works at a global level - Often also at the lowest levels - Refinement necessary at intermediate level

85 Language Classification85 Adding typological data

86 Language Classification86 Trying to improve the fit … Enrich lexical with typological data: Haspelmath, M., M. Dryer,D. Gil & B. Comrie (eds) (2005). The World Atlas Of Language Structures. Oxford: Oxford University Press WALS Online: http://wals.info/WALS Online: http://wals.info/

87 Language Classification87 Swadesh (3500) distance matrix ASJP TREE SFTW WALS (2580) + Lexical plus typological data

88 Language Classification88 distance matrix ASJP TREE SFTW ‘SWALSH’

89 Language Classification89 Improving the fit Enrich lexical with typological data: - NOT 1:1 with ASJP languages

90 Language Classification90 distance matrix ASJP TREE SFTW SWALSH (1250)

91 Language Classification91 Improving the fit Enrich lexical with typological data: - NOT 1:1 with ASJP languages - WALS matrix very UNevenly filled (16%) cf. Cysouw (2008) – STUF 61.3

92 Language Classification92 Improving the fit Enrich lexical with typological data: - NOT 1:1 with ASJP languages - WALS features very unevenly filled  Determine most stable features

93 Language Classification93 Feature Stability Nichols (1995): metric for S(Ftr k ) in G x : pairs (val k =val k ) all pairs

94 Language Classification94 Feature Stability ASJP: metric for S tability Ftr k : For G x : pairs (val k =val k ) all pairs

95 Language Classification95 Feature Stability ASJP: metric for stability Ftr k : For G x : pairs (val k =val k ) all pairs Size differences between G

96 Language Classification96 Feature Stability ASJP: metric for stability Ftr k : S Fk = pairs (val k =val k ) all pairs

97 Language Classification97 Feature Stability ASJP: metric for stability Ftr k : S Fk = pairs (val k =val k ) all pairs pairs (val k =val k ) U all pairs all pairs ‘Background noise’

98 Language Classification98 Feature Stability ASJP: metric for stability Ftr k : S Fk = pairs (val k =val k ) all pairs pairs (val k =val k ) U all pairs all pairs (1 – U) Normalization: S Fk comparable

99 Language Classification99 Most stable WALS features 31. Sex-based and Non-sex-based Gender Systems0.81 118. Predicative Adjectives0.74 30. Number of Genders0.73 119. Nominal and Locational Predication0.71 29. Syncretism in Verbal Person/Number Marking0.71

100 Language Classification100 Most instable WALS features 128. Utterance Complement Clauses0.07 115. Negative Indefinite Pronouns/Predicate Negation0.07 59. Possessive Classification0.01 135. Red and Yellow-0.07 58. Obligatory Possessive Inflection-0.25

101 Language Classification101 Correlation with Ethnologue Min ftrs 20

102 Language Classification102 Correlation with Ethnologue Min ftrs 40 20

103 Language Classification103 Correlation with Ethnologue Min ftrs 60 40 20

104 Language Classification104 Correlation with Ethnologue Min ftrs 80 60 40 20

105 Language Classification105 Correlation with Ethnologue Min ftrs 100 80 60 40 20

106 Language Classification106 Correlation with Ethnologue Min ftrs 100 80 60 40 20 +  Stability  -

107 Language Classification107 Correlation with Ethnologue Min ftrs 100 80 60 40 20

108 Language Classification108 Correlation with Ethnologue Min ftrs 100 80 60 40 20 40

109 Language Classification109 Correlation with Ethnologue Min ftrs 100 80 60 40 20 60

110 Language Classification110 Correlation with Ethnologue Min ftrs 100 80 60 40 20 85

111 Language Classification111 Correlation with Ethnologue Min ftrs 100 80 60 40 20

112 Language Classification112 WALS

113 Language Classification113 WALS Swadesh40

114 Language Classification114 Improving the fit Typological variables* do not perform better than lexical ones to establish genetic relationships *WALS!

115 Language Classification115 Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships What about a combination?

116 Language Classification116 Only WALS FtrsLgs 10079 80109 60139 40218 20341 Only Sw40

117 Language Classification117 Only WALS FtrsLgs 10079 80109 60139 40218 20341 Only Sw40

118 Language Classification118 Only WALS FtrsLgs 10079 80109 60139 40218 20341 Only Sw40 85:15

119 Language Classification119 Only WALS FtrsLgs 10079 80109 60139 40218 20341 Only Sw40 70:30

120 Language Classification120 Only WALS FtrsLgs 10079 80109 60139 40218 20341 Only Sw40 50:50 0.91

121 Language Classification121 Only WALS FtrsLgs 10079 80109 60139 40218 20341 Only Sw40 35:65

122 Language Classification122 Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships A combined, balanced approach is superior, but …

123 Language Classification123 Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships A combined, balanced approach is superior, but … … at a much higher cost per language than just lexicostatistics: 84% WALS to be filled in …

124 Language Classification124 Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships A combined, balanced approach is superior, but … … at a much higher cost per language Continue extension/optimization of lexical method

125 Language Classification125 1. Brown, Cecil H., Eric W. Holman, Søren Wichmann & Viveka Velupillai (2008). Automated Classification of the World’s languages: a description of the method and prelimary results. Sprachtypologie und Universalienforschung 61: 285-308. 2. Holman, E. W., S. Wichmann, C. H. Brown, V. Velupillai, A. Müller & D. Bakker (2008) 'Advances in automated language classification'. In A. Arppe, K. Sinnemäke and U. Nikanne (eds) Quantitative Investigations in Theoretical Linguistics. Helsinki: University of Helsinki, 40-43. 3. Holman, E. W., S. Wichmann, C. H. Brown, V. Velupillai, A. Müller & D. Bakker (2008). ‘Explorations in automated language classification’. Folia Linguistica 42-2, 331-354. 4. Bakker, D., A. Müller, V. Velupillai, S. Wichmann, C. H. Brown, P. Brown, D. Egorov, R. Mailhammer, A. Grant, E. W. Holman (2009). ’Adding typology to lexicostatistics: a combined approach to language classification’. Linguistic Typology 13, 167-179. Publications 2008 - 2009

126 Language Classification126 ?

127 Language Classification127 Overall goal: - Method + Tools for Reconstruction of Language Relationships Derived goals: - Critical assessment and refinement of existing classifications - Classify newly described and unclassified languages - Search for (ir)regularities in family reconstructions - Test hypotheses about families - Experimentally find an optimal dating method - Automatically detect borrowings ASJP


Download ppt "Language Classification1 Adding typology to lexicostatistics: a combined approach to language classification ASJP Consortium."

Similar presentations


Ads by Google