Download presentation
Presentation is loading. Please wait.
Published byMelinda Shannon Anthony Modified over 9 years ago
1
Vladimir Polyakov APPROACHES TO IMPROVEMENT OF SIMILARITY MEASURE, BASED ON THE STRUCTURE OF LANGUAGE DESCRIPTION IN THE DB "LANGUAGES OF THE WORLD"
2
0. Introduction in the DB JM JM is the new tool for linguistic and cognitive researches It allows to carry out researches by new quantitative techniques in typology, historical and areal linguistics It allows to receive scientific results in the field of modeling of evolution of languages It allows to spend diachronic researches on the fact sheet in sphere of an origin of language and its evolution
3
1. Introduction in the problem Similarity measure is a basis for phylogenetic calculations with the purpose of an establishment of genetic relationship between languages Recently (2005-2007) in works (Polyakov and Solovyev; Wichmann et al.) it has been established, that the measures constructed on typological data, reflect also genetic relationship, BUT... = noise in WALS data (mainly because of absence of data) makes strong impact on results of calculations; = areal contacts in DB JM makes strong impact on results of calculations also. Thus, in case of application of data from DB JM, the problem of a choice of a similarity measure as much as possible independent from areal contacts by the current moment is actual.
4
2. Technique of an estimation of quality of a measure Is based on the following aprioristic postulates: At first test set of languages is formed for which there are reliable expert data about genetic relationship. The technique and the formula of an estimation of the quality is offered for quantitative calculation of degree of approximation of the numerical result received by the program and an expert rating. In case of reception of reliable results on test set, the procedure of calculation of a measure of similarity can be transferred on the unstudied languages for check of hypotheses about their origin and genetic similarity.
5
3. The previous results The set of 48 languages (further «A.A. Kibrik's set») has been offered by group «World Languages» from Institute of Linguistics of RAS. The technique of estimations of quality of a similarity measure has been offered, based on ranging of languages concerning prototype language in each of eight families of the test set (Polyakov, Solovyev 2006). The formula of an estimation of quality of a similarity measure has been offered also.
6
3.1. A.A. Kibrik's set (48 languages) NLanguage FamilyGroup 1АБХАЗСКИЙAbkhazNorthwest Caucasian 2АГУЛЬСКИЙAghulNakh-DaghestanianLezgic 3АЗЕРБАЙДЖАНСКИЙAzerbaijaniAltaicTurkic 4АККАДСКИЙAkkadianAfro-AsiaticSemitic 5АНГЛИЙСКИЙEnglishIndo-EuropeanGermanic 6АРМЯНСКИЙArmenianIndo-EuropeanArmenian 7АССАМСКИЙAssameseIndo-EuropeanIndic 8БАГВАЛИНСКИЙBagvalalNakh-DaghestanianAvar-Andic-Tsezic 9БАШКИРСКИЙBashkirAltaicTurkic 10БЕЛОРУССКИЙBelarusanIndo-EuropeanSlavic 11БЕНГАЛЬСКИЙBengaliIndo-EuropeanIndic 12БИРМАНСКИЙBurmeseSino-TibetanBurmese-Lolo 13БОЛГАРСКИЙBulgarianIndo-EuropeanSlavic 14БУРУШАСКИBurushaski
7
3.1. A.A. Kibrik's set (48 languages) 15БУРЯТСКИЙBuriatAltaicMongolic 16ВЕНГЕРСКИЙHungarianUralicUgric 17ВЕПССКИЙVepsUralicFinnic 18ГАЛИСИЙСКИЙGalicianIndo-EuropeanRomance 19ГРУЗИНСКИЙGeorgianKartvelian 20ДАРИDariIndo-EuropeanIranian 21ДАТСКИЙDanishIndo-EuropeanGermanic 22ИСЛАНДСКИЙIcelandicIndo-EuropeanGermanic 23ИСПАНСКИЙSpanishIndo-EuropeanRomance 24ИТАЛЬЯНСКИЙItalianIndo-EuropeanRomance 25ИТЕЛЬМЕНСКИЙItelmenChukotko-Kamchatkan Southern Chukotko- Kamchatkan 26КАЛМЫЦКИЙKalmyk_OiratAltaicMongolic 27КОРЯКСКИЙKoryakChukotko-Kamchatkan Northern Chukotko- Kamchatkan 28ЛЕЗГИНСКИЙLezgiNakh-DaghestanianLezgic
8
29МАКЕДОНСКИЙMacedonianIndo-EuropeanSlavic 30МОГОЛЬСКИЙMogholiAltaicMongolic 31МОНГОРСКИЙTuAltaicMongolic 32НЕМЕЦКИЙGermanIndo-EuropeanGermanic 33НИВХСКИЙGilyakNivkh 34НОРВЕЖСКИЙ Norwegian, Bokmål & NynorskIndo-EuropeanGermanic 35ПЕРСИДСКИЙWestern FarsiIndo-EuropeanIranian 36ПОЛЬСКИЙPolishIndo-EuropeanSlavic 37ПОРТУГАЛЬСКИЙPortugueseIndo-EuropeanRomance 38РУМЫНСКИЙRomanianIndo-EuropeanRomance 39РУССКИЙRussianIndo-EuropeanSlavic 40ТАДЖИКСКИЙTajikIndo-EuropeanIranian 41ТАТАРСКИЙTatarAltaicTurkic 42ТУРЕЦКИЙTurkishAltaicTurkic 43ТУРКМЕНСКИЙTurkmenAltaicTurkic 44ФИНСКИЙFinnishUralicFinnic 45ХАНТЫЙСКИЙKhantyUralicUgric 46ЧУКОТСКИЙChukot Chukotko- Kamchatkan Northern Chukotko- Kamchatkan 47ШУГНАНСКИЙShughniIndo-EuropeanIranian 48ЭСТОНСКИЙEstonianUralicFinnic
9
3.2. The formula of an estimation of quality of a similarity measure GroupLanguagesLanguage- prototype NgКiКi 1 Uralic Hungarian, Veps, Finnish, Khanty, EstonianFinnish5K1K1 2 Turkic Azerbaijani, Bashkir, Tatar, Turkish, TurkmenTurkish5 3 Mongolian Buriat, Kalmyk_Oirat, Mogholi, TuKalmyk_Oira t 4K3K3 4 Slavic Belarusan, Bulgarian, Macedonian, Polish, Russian Belarusan5K4K4 5 Iranian Dari, Western Farsi, Tajik, ShughniWestern Farsi4K5K5 6 Germanian English, Danish, Icelandic, German, (Norwegian, Bokmål & Nynorsk) German5K6K6 7 Romance Galician, Spanish, Italian, Portuguese, Romanian Spanish5K7K7 8Caucasian-1 (Nakh- Daghestanian) Aghul, Bagvalal, LezgiLezgi3K8K8 9 Caucasian-2 Abkhaz, Georgian--- 10 Paleoasian Burushaski, Itelmen, Koryak, Gilyak (Nivkh), Chukot --- 11 Others Akkadian, Burmese, Armenian, Assamese, Bengali --- All languages from A.A.Kibrik’s set were divided on 11 groups according to genetic relationship
10
3.2. The formula of an estimation of quality of a similarity measure After calculation of a measure all languages are sorted eight times relatively to prototype languages in each group. Quality of measure K: K = ( К 1 + К 2 + К 3 + К 4 + К 5 + К 6 + К 7 + К 8 )/8 Ki = N р /Ng Np – a number of related languages placed after prototype language. Ng – a number of related languages in each group. Example See tables with other measures at www.dblang2008.narod.ru
11
3.3. Results of calculations (Polyakov, Solovyev 2006) During DB testing great volume of works has been spent for choice the best variant of a similarity measure and to research of influence of different factors on quality of a measure. Among these factors there are types of features, their frequency, hierarchy in abstract structure, the contribution of various sections of the language description. Calculation of one variant of a measure on the set of 48 languages occupies about 20 minutes on the computer with processor Intel Pentium of 1,6 GHz. Calculation on one section of the language description lasts about 5 minutes. Full calculation on all data base (315 languages) is carried out over 10 hours. It has actually been established, that the best values of the measure quality reaches at simple additive sum of all conterminous features without restrictions on their frequency, hierarchy or an accessory to section of description (see table low). In this case on two groups (Ural, Turkic) it is reached full coincidence to traditional genetic representation and factor of quality K is equal 0,67. All other combinations of features yielded the worst result. The separate measure for each of sections of the description of language in DB is less than total measure under all model.
12
3.3. Results of calculations (Polyakov, Solovyev 2006) All features Only present in the languages Only absent in the languages Only classifying Only factographic Only present classifying Only present factographic K 1 -Uralic (5 lang.)10,50,750,510,250,5 K 2 -Turkic (5 lang.)10,750,50,75 0 K 3 -Mongolian (4 lang.)0,670,331 0,670,33 K 4 -Slavic (5 lang.)0,5 00,250,50,250,75 K 5 -Iranian (4 lang.)0,670,330 0,6700,33 K 6 -Germanian (5 lang.)0,50,750,250,750,50,750,5 K 7 -Romance (5 lang.)0,5 0,250,75 K 8 -Caucasian-1 (3 lang.)0,50 00 K-Total0,670,460,440,490,640,230,49
13
Section of essay All features Only present in the languages Only classifying Only factographic Only present classifying Only present factographic 2.1.1Phonological structure 0,300,350,320,300,230,36 2.1.2Prosody 0,290,340,240,260,190,33 2.1.3.Phonetics 0,090,100,170,06 0,14 2.1.4.The syllable 0,200,160,500,090,290,14 2.2.1.Phonotactics 0,130,220,130,160,190,16 2.2.2.Phonological opposition between morphological categories 0,160,19 0,13 0,16 2.2.3.Morphologically motivated alternations 0,330,170,180,170,090,10 2.3.0Morphological type 0,340,460,330,260,210,23 2.3.1.Criteria for parts of speech assignment 0,07 2.3.2.Nouns 0,230,090,150,230,030,20 2.3.3.Number 0,160,130,210,200,170,20 2.3.4.Case 0,460,440,300,530,260,51 2.3.5.Verbal categories 0,320,290,200,290,140,20 2.3.6.Deictic categories 0,470,290,410,390,360,29 2.3.7.Parts of speech 0,440,560,200,530,070,53 2.4.0.Structure of morphological paradigms 0,420,400,300,320,220,27 2.5.1.Word structure 0,170,300,180,130,170,27 2.5.2.Word formation 0,180,200,170,180,100,17 2.5.3.The simple sentence 0,410,340,360,330,060,34 2.5.4.The complex sentence 0,100,170,180,100,090,20 Total sum in column: 5,265,184,774,733,144,86
14
3.4. Preliminary conclusions The measure reflects genetic similarity The contribution of structure of the description of language is insignificant The contribution of sections is rather essential
15
4. Directions of the further researches
16
5. Aims of the investigation To choose new set of languages (comparable to content WALS, project ASJP and DB JM) To develop new, more thin technique of quality estimation To find new heuristics, allowing to improve quality of a similarity measure
17
5.1. The new set of languages comparable to content of WALS, project ASJP and DB JM The set is offered by Valery Solovyev and specified by Søren Wichmann in 2007 The set includes the list from 39 languages presented in WALS, JM and ASJP Thus there is a possibility not only to estimate quality of a similarity measure calculated on DB JM, but also to compare the genetic trees received from three linguistic sources. Also there is a possibility of quantitative comparison of three projects on degree of coincidence of trees with the etalon.
18
5.2. Alternatives on sets of languages
19
5.2. Set of Solovyev-Wichmann (39 languages) LanguageFamilyGenus 1 Modern HebrewAfro-AsiaticSemitic 2 Chuvash AltaicTurkic 3 Yakut AltaicTurkic 4 Uzbek AltaicTurkic 5 Bashkir AltaicTurkic 6 TatarAltaic Turkic 7 Azerbaijani AltaicTurkic 8 Kirghiz AltaicTurkic 9 Burushaski 10 ChukchiChukotko-Kamchatkan Northern Chukotko- Kamchatkan 11 Itelmen Chukotko-KamchatkanSouthern Chukotko- Kamchatkan 12 BretonIndo-EuropeanCeltic 13 DutchIndo-EuropeanGermanic 14 SwedishIndo-European Germanic 15 Icelandic Indo-EuropeanGermanic 16 Danish Indo-EuropeanGermanic 17 Bengali Indo-European Indic
20
LanguageFamilyGenus 18 Persian Indo-European Iranian 19 French Indo-EuropeanRomance 20 Italian Indo-EuropeanRomance 21 Portugese Indo-EuropeanRomance 22 Catalan Indo-EuropeanRomance 23 Russian Indo-EuropeanSlavic 24 Polish Indo-EuropeanSlavic 25 Bulgarian Indo-EuropeanSlavic 26 Czech Indo-EuropeanSlavic 27 Ukrainian Indo-EuropeanSlavic 28 GeorgianKartvelian 29 LezgianNakh-DaghestanianLezgic 30 Chechen Nakh-DaghestanianNakh 31 AbkhazNorthwest Caucasian 32 KabardianNorthwest Caucasian 33 Finnish UralicFinnic 34 KomiZyrian UralicFinnic 35 Nenets UralicSamoyedic 36 Selkup UralicSamoyedic 37 Hungarian UralicUgric 38 Khanty=Yakut UralicUgric 39 KetYeniseian
21
5.2. Set of Solovyev-Wichmann (39 languages) Examples of trees, built on different data. Tree from JM data Tree from WALS data Tree from ASJP data ASJP tree is the most reliable in its quality. JM tree is placed at the second place and WALS tree is at the third place.
22
5.3. New more thin techniques of an estimation of quality of similarity measures After calculation of a measure all languages are sorted 39 times relatively to each languages. Quality of measure K: K = ∑( К i )/39, i = i…39 Ki = N р /Ng, i = i…39 Np – a number of related languages placed after each language. Ng – a number of related languages in each group.
23
5.3. New more thin techniques of an estimation of quality of similarity measures Also different techniques exist that allow to compare trees immediately. In this case a quality measure is calculated as editorial distance.
24
5.4. Alternatives on techniques of estimation of measure quality
25
5.5. New heuristics, allowing to improve quality of similarity measure Restriction on frequency of features gives increase in a measure to 0,697 (on A.A.Kibrik's set) Restriction on description sections gives increase in a measure to 0,76 (on A.A.Kibrik's s et )
26
5.5. New heuristics, allowing to improve quality of similarity measure The sections of essay were chosen that has a quality value more than 0,25. The list of these sections includes numbers {1,2,7,8,12,13,14,15,16,19}. See table at slide 13.
27
5.5. Alternatives on heuristics
28
6. Conclusions and the future researches New heuristics (frequency and the filter on sections) allow to improve quality of a measure In the future: - It is planned to use such factors as stability and genetic markers; - It is supposed to apply more thin measures of an estimation of quality; - It is more preferable to use the set comparable to other linguistic resources (WALS, ASJP, etc.)
29
Contacts: Vladimir Polyakov Institute of Linguistics of RAS www.dblang.ru www.dblang2008.narod.ru The research is supported by RFBR grant (www.rfbr.ru), № 07-06-00229 а
30
Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.