Vladimir Polyakov APPROACHES TO IMPROVEMENT OF SIMILARITY MEASURE, BASED ON THE STRUCTURE OF LANGUAGE DESCRIPTION IN THE DB "LANGUAGES OF THE WORLD"

Slides:



Advertisements
Similar presentations
José Luis Otárola. Refers to Language family Lgs. That contains similar features of Lexicon, Phonology, Morphology and Syntax.
Advertisements

U.S. Government Language Requirements U.S. Government Language Requirements 7 September 2000 Everette Jordan Department of Defense
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Issues in developing narrative structures Postgraduate writing, seminar 7 John Morgan.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Scaling and Attitude Measurement in Travel and Hospitality Research Research Methodologies CHAPTER 11.
Indo-European Languages
Review of the paper entitled “The development of a phonetically balanced word recognition test in the Ilocano language” written by Renita Sagon, Doctor.
Indo-European Language Branch
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Languages in Action Translating for the European Commission
The Influence of First Language on Reading and Spelling in English Linda Siegel University of British Columbia Vancouver, CANADA
Chapter 6 Language.
September 8, 2015S. Mathews1 Human Geography By James Rubenstein Chapter 5 Key Issue 2 Why Is English Related to Other Languages?
Indo-European Branches
Introducing English Linguistics Charles F. Meyer Chapter 1: the study of language Language Change.
Lindsey Miller and Reid Scholz
Globalization & Diversity: Rowntree, Lewis, Price, Wyckoff 1.
DATA BASE “LANGUAGES OF THE WORLD” DB JM SOFTWARE SURVEY: 2010 Vladimir Polyakov (Institute of Linguistics of RAS )
Chapter Eight The Concept of Measurement and Attitude Scales
Statistical Analysis A Quick Overview. The Scientific Method Establishing a hypothesis (idea) Collecting evidence (often in the form of numerical data)
INF 384 C, Spring 2009 Ontologies Knowledge representation to support computer reasoning.
An crann teanga L'arbre des langues De taal boom kalbos medis Drzewo języka The Language Tree.
Families of Languages. Family of languages  It is a group of languages that are related to one another in terms of (genetic) origin  They share a common.
LANGUAGE FAMILY Groups of languages are related to each other Common ancestry Indo- European Languages Indo- European Languages Vocabulary, phonology,
Analysis of two-way tables - Formulas and models for two-way tables - Goodness of fit IPS chapters 9.3 and 9.4 © 2006 W.H. Freeman and Company.
Language Families BBI 3101-HISTORY OF ENGLISH -LECTURE 1.
Some objectives of the European Day of Languages ( EDL ) To draw public attention to the importance of learning languages, in order to increase plurilingualism.
A Language Independent Method for Question Classification COLING 2004.
Linguistics The first week. Chapter 1 Introduction 1.1 Linguistics.
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
Knowledge Representation of Statistic Domain For CBR Application Supervisor : Dr. Aslina Saad Dr. Mashitoh Hashim PM Dr. Nor Hasbiah Ubaidullah.
Language family 1 BBI LANGUAGE FAMILIES - LECTURE TWO.
An Investigation Into the Impact of Gender on Major Choice at Dartmouth College Shannah Feldman Kate Schuerman Tricia Shalka Kate Wendell.
© 2012 IBM Corporation Introducing IBM Cognos Insight.
European Day of Languages QUIZ. 1. How many living languages are there estimated in the world today? a.Around 500 b.Around 2000 c.Around 6000 d.Around.
Copyright © Cengage Learning. All rights reserved. Chi-Square and F Distributions 10.
Velina Slavova (Bulgaria) Vladimir Polyakov (Russia) THE METRICS OF COMPLEXITY BASED ON SYSTEM OF CASE RELATIONS IN TYPOLOGICAL STRUCTURE OF THE LANGUAGE.
1 European Association for Language Testing and Assessment
“LANGUAGES of the WORLD”: Ongoing projects Andrej A. Kibrik (Institute of Linguistics, RAN) CML-2008 Montenegro, September 2008.
Lecture №4 METHODS OF RESEARCH. Method (Greek. methodos) - way of knowledge, the study of natural phenomena and social life. It is also a set of methods.
1.What is a language family?. A group of languages that came from the same ancestor language and have words in common.
CODILANG: MIXED MODEL OF DIVERGENCE-CONVERGENCE OF LANGUAGES OF THE WORLD Vladimir Polyakov (Russia)
Language Families. A group of languages descended from a single, earlier tongue.
1 CARAP FREPA Awakening to languages - Workshop 22 Novembre 2012 Jean-François de Pietro - Ildikó Lőrincz 1.
 Jan Niecislaw Baudouin de Courtenay (March 13, November 3, 1929) was a Polish linguist and Slavist, best known for his theory of the phoneme.
An Introduction to VOA. Type : international public broadcaster Country: United States Founded: 1942 Headquarter: Washington Owner: Federal government.
英语词汇学课程课件 课件名称:英语词汇的发展 制作人:寻阳、孙红梅 单位:曲阜师范大学外国语学院.
Unit 2 Language Change… English across time…. The Subsystems Phonetics Phonology Morphology Lexicology Syntax Discourse Analysis Semantics Pragmatics.
1 Standardisation supporting cultural diversity: From 5 to 28 STF QD Expanding the language coverage of the ETSI spoken command vocabulary standard. Mike.
Tel: Fax: P.O. Box: 22392, Dubai - UAE
EUROPEAN DAY OF LANGUAGES. The European Year of Languages 2001 was organised by the Council of Europe and the European Union. Its activities celebrated.
Geography (nb: strange names) Families Indo-European: most primary branches Greek (Hellenic) Italic (> Romance) Celtic Germanic Balto-Slavic Albanian.
European Day of Languages (September, 26 ). The European Day of Languages is 26 th September, as proclaimed by the Council of Europe on 6 th December.
Text Linguistics. Definition of linguistics Linguistics can be defined as the scientific or systematic study of language. It is a science in the sense.
Languages of Europe Romance, Germanic, and Slavic.
EUROPEAN DAY OF LANGUAGES QUIZ
European Day of Languages QUIZ
The influence of binomials frequency
European Day of Languages QUIZ
European Day of Languages QUIZ
Indo-European.
Language… Chapter 5 – Key Issue 2.

Why is English Related to Other Languages?
BBI LANGUAGE FAMILIES - LECTURE TWO
AIM: How is English related to other languages?
Part of Speech Tagging with Neural Architecture Search

Presentation transcript:

Vladimir Polyakov APPROACHES TO IMPROVEMENT OF SIMILARITY MEASURE, BASED ON THE STRUCTURE OF LANGUAGE DESCRIPTION IN THE DB "LANGUAGES OF THE WORLD"

0. Introduction in the DB JM JM is the new tool for linguistic and cognitive researches It allows to carry out researches by new quantitative techniques in typology, historical and areal linguistics It allows to receive scientific results in the field of modeling of evolution of languages It allows to spend diachronic researches on the fact sheet in sphere of an origin of language and its evolution

1. Introduction in the problem Similarity measure is a basis for phylogenetic calculations with the purpose of an establishment of genetic relationship between languages Recently ( ) in works (Polyakov and Solovyev; Wichmann et al.) it has been established, that the measures constructed on typological data, reflect also genetic relationship, BUT... = noise in WALS data (mainly because of absence of data) makes strong impact on results of calculations; = areal contacts in DB JM makes strong impact on results of calculations also. Thus, in case of application of data from DB JM, the problem of a choice of a similarity measure as much as possible independent from areal contacts by the current moment is actual.

2. Technique of an estimation of quality of a measure Is based on the following aprioristic postulates: At first test set of languages is formed for which there are reliable expert data about genetic relationship. The technique and the formula of an estimation of the quality is offered for quantitative calculation of degree of approximation of the numerical result received by the program and an expert rating. In case of reception of reliable results on test set, the procedure of calculation of a measure of similarity can be transferred on the unstudied languages for check of hypotheses about their origin and genetic similarity.

3. The previous results The set of 48 languages (further «A.A. Kibrik's set») has been offered by group «World Languages» from Institute of Linguistics of RAS. The technique of estimations of quality of a similarity measure has been offered, based on ranging of languages concerning prototype language in each of eight families of the test set (Polyakov, Solovyev 2006). The formula of an estimation of quality of a similarity measure has been offered also.

3.1. A.A. Kibrik's set (48 languages) NLanguage FamilyGroup 1АБХАЗСКИЙAbkhazNorthwest Caucasian 2АГУЛЬСКИЙAghulNakh-DaghestanianLezgic 3АЗЕРБАЙДЖАНСКИЙAzerbaijaniAltaicTurkic 4АККАДСКИЙAkkadianAfro-AsiaticSemitic 5АНГЛИЙСКИЙEnglishIndo-EuropeanGermanic 6АРМЯНСКИЙArmenianIndo-EuropeanArmenian 7АССАМСКИЙAssameseIndo-EuropeanIndic 8БАГВАЛИНСКИЙBagvalalNakh-DaghestanianAvar-Andic-Tsezic 9БАШКИРСКИЙBashkirAltaicTurkic 10БЕЛОРУССКИЙBelarusanIndo-EuropeanSlavic 11БЕНГАЛЬСКИЙBengaliIndo-EuropeanIndic 12БИРМАНСКИЙBurmeseSino-TibetanBurmese-Lolo 13БОЛГАРСКИЙBulgarianIndo-EuropeanSlavic 14БУРУШАСКИBurushaski

3.1. A.A. Kibrik's set (48 languages) 15БУРЯТСКИЙBuriatAltaicMongolic 16ВЕНГЕРСКИЙHungarianUralicUgric 17ВЕПССКИЙVepsUralicFinnic 18ГАЛИСИЙСКИЙGalicianIndo-EuropeanRomance 19ГРУЗИНСКИЙGeorgianKartvelian 20ДАРИDariIndo-EuropeanIranian 21ДАТСКИЙDanishIndo-EuropeanGermanic 22ИСЛАНДСКИЙIcelandicIndo-EuropeanGermanic 23ИСПАНСКИЙSpanishIndo-EuropeanRomance 24ИТАЛЬЯНСКИЙItalianIndo-EuropeanRomance 25ИТЕЛЬМЕНСКИЙItelmenChukotko-Kamchatkan Southern Chukotko- Kamchatkan 26КАЛМЫЦКИЙKalmyk_OiratAltaicMongolic 27КОРЯКСКИЙKoryakChukotko-Kamchatkan Northern Chukotko- Kamchatkan 28ЛЕЗГИНСКИЙLezgiNakh-DaghestanianLezgic

29МАКЕДОНСКИЙMacedonianIndo-EuropeanSlavic 30МОГОЛЬСКИЙMogholiAltaicMongolic 31МОНГОРСКИЙTuAltaicMongolic 32НЕМЕЦКИЙGermanIndo-EuropeanGermanic 33НИВХСКИЙGilyakNivkh 34НОРВЕЖСКИЙ Norwegian, Bokmål & NynorskIndo-EuropeanGermanic 35ПЕРСИДСКИЙWestern FarsiIndo-EuropeanIranian 36ПОЛЬСКИЙPolishIndo-EuropeanSlavic 37ПОРТУГАЛЬСКИЙPortugueseIndo-EuropeanRomance 38РУМЫНСКИЙRomanianIndo-EuropeanRomance 39РУССКИЙRussianIndo-EuropeanSlavic 40ТАДЖИКСКИЙTajikIndo-EuropeanIranian 41ТАТАРСКИЙTatarAltaicTurkic 42ТУРЕЦКИЙTurkishAltaicTurkic 43ТУРКМЕНСКИЙTurkmenAltaicTurkic 44ФИНСКИЙFinnishUralicFinnic 45ХАНТЫЙСКИЙKhantyUralicUgric 46ЧУКОТСКИЙChukot Chukotko- Kamchatkan Northern Chukotko- Kamchatkan 47ШУГНАНСКИЙShughniIndo-EuropeanIranian 48ЭСТОНСКИЙEstonianUralicFinnic

3.2. The formula of an estimation of quality of a similarity measure GroupLanguagesLanguage- prototype NgКiКi 1 Uralic Hungarian, Veps, Finnish, Khanty, EstonianFinnish5K1K1 2 Turkic Azerbaijani, Bashkir, Tatar, Turkish, TurkmenTurkish5 3 Mongolian Buriat, Kalmyk_Oirat, Mogholi, TuKalmyk_Oira t 4K3K3 4 Slavic Belarusan, Bulgarian, Macedonian, Polish, Russian Belarusan5K4K4 5 Iranian Dari, Western Farsi, Tajik, ShughniWestern Farsi4K5K5 6 Germanian English, Danish, Icelandic, German, (Norwegian, Bokmål & Nynorsk) German5K6K6 7 Romance Galician, Spanish, Italian, Portuguese, Romanian Spanish5K7K7 8Caucasian-1 (Nakh- Daghestanian) Aghul, Bagvalal, LezgiLezgi3K8K8 9 Caucasian-2 Abkhaz, Georgian Paleoasian Burushaski, Itelmen, Koryak, Gilyak (Nivkh), Chukot Others Akkadian, Burmese, Armenian, Assamese, Bengali --- All languages from A.A.Kibrik’s set were divided on 11 groups according to genetic relationship

3.2. The formula of an estimation of quality of a similarity measure After calculation of a measure all languages are sorted eight times relatively to prototype languages in each group. Quality of measure K: K = ( К 1 + К 2 + К 3 + К 4 + К 5 + К 6 + К 7 + К 8 )/8 Ki = N р /Ng Np – a number of related languages placed after prototype language. Ng – a number of related languages in each group. Example See tables with other measures at

3.3. Results of calculations (Polyakov, Solovyev 2006) During DB testing great volume of works has been spent for choice the best variant of a similarity measure and to research of influence of different factors on quality of a measure. Among these factors there are types of features, their frequency, hierarchy in abstract structure, the contribution of various sections of the language description. Calculation of one variant of a measure on the set of 48 languages occupies about 20 minutes on the computer with processor Intel Pentium of 1,6 GHz. Calculation on one section of the language description lasts about 5 minutes. Full calculation on all data base (315 languages) is carried out over 10 hours. It has actually been established, that the best values of the measure quality reaches at simple additive sum of all conterminous features without restrictions on their frequency, hierarchy or an accessory to section of description (see table low). In this case on two groups (Ural, Turkic) it is reached full coincidence to traditional genetic representation and factor of quality K is equal 0,67. All other combinations of features yielded the worst result. The separate measure for each of sections of the description of language in DB is less than total measure under all model.

3.3. Results of calculations (Polyakov, Solovyev 2006) All features Only present in the languages Only absent in the languages Only classifying Only factographic Only present classifying Only present factographic K 1 -Uralic (5 lang.)10,50,750,510,250,5 K 2 -Turkic (5 lang.)10,750,50,75 0 K 3 -Mongolian (4 lang.)0,670,331 0,670,33 K 4 -Slavic (5 lang.)0,5 00,250,50,250,75 K 5 -Iranian (4 lang.)0,670,330 0,6700,33 K 6 -Germanian (5 lang.)0,50,750,250,750,50,750,5 K 7 -Romance (5 lang.)0,5 0,250,75 K 8 -Caucasian-1 (3 lang.)0,50 00 K-Total0,670,460,440,490,640,230,49

Section of essay All features Only present in the languages Only classifying Only factographic Only present classifying Only present factographic 2.1.1Phonological structure 0,300,350,320,300,230, Prosody 0,290,340,240,260,190, Phonetics 0,090,100,170,06 0, The syllable 0,200,160,500,090,290, Phonotactics 0,130,220,130,160,190, Phonological opposition between morphological categories 0,160,19 0,13 0, Morphologically motivated alternations 0,330,170,180,170,090, Morphological type 0,340,460,330,260,210, Criteria for parts of speech assignment 0, Nouns 0,230,090,150,230,030, Number 0,160,130,210,200,170, Case 0,460,440,300,530,260, Verbal categories 0,320,290,200,290,140, Deictic categories 0,470,290,410,390,360, Parts of speech 0,440,560,200,530,070, Structure of morphological paradigms 0,420,400,300,320,220, Word structure 0,170,300,180,130,170, Word formation 0,180,200,170,180,100, The simple sentence 0,410,340,360,330,060, The complex sentence 0,100,170,180,100,090,20 Total sum in column: 5,265,184,774,733,144,86

3.4. Preliminary conclusions The measure reflects genetic similarity The contribution of structure of the description of language is insignificant The contribution of sections is rather essential

4. Directions of the further researches

5. Aims of the investigation To choose new set of languages (comparable to content WALS, project ASJP and DB JM) To develop new, more thin technique of quality estimation To find new heuristics, allowing to improve quality of a similarity measure

5.1. The new set of languages comparable to content of WALS, project ASJP and DB JM The set is offered by Valery Solovyev and specified by Søren Wichmann in 2007 The set includes the list from 39 languages presented in WALS, JM and ASJP Thus there is a possibility not only to estimate quality of a similarity measure calculated on DB JM, but also to compare the genetic trees received from three linguistic sources. Also there is a possibility of quantitative comparison of three projects on degree of coincidence of trees with the etalon.

5.2. Alternatives on sets of languages

5.2. Set of Solovyev-Wichmann (39 languages) LanguageFamilyGenus 1 Modern HebrewAfro-AsiaticSemitic 2 Chuvash AltaicTurkic 3 Yakut AltaicTurkic 4 Uzbek AltaicTurkic 5 Bashkir AltaicTurkic 6 TatarAltaic Turkic 7 Azerbaijani AltaicTurkic 8 Kirghiz AltaicTurkic 9 Burushaski 10 ChukchiChukotko-Kamchatkan Northern Chukotko- Kamchatkan 11 Itelmen Chukotko-KamchatkanSouthern Chukotko- Kamchatkan 12 BretonIndo-EuropeanCeltic 13 DutchIndo-EuropeanGermanic 14 SwedishIndo-European Germanic 15 Icelandic Indo-EuropeanGermanic 16 Danish Indo-EuropeanGermanic 17 Bengali Indo-European Indic

LanguageFamilyGenus 18 Persian Indo-European Iranian 19 French Indo-EuropeanRomance 20 Italian Indo-EuropeanRomance 21 Portugese Indo-EuropeanRomance 22 Catalan Indo-EuropeanRomance 23 Russian Indo-EuropeanSlavic 24 Polish Indo-EuropeanSlavic 25 Bulgarian Indo-EuropeanSlavic 26 Czech Indo-EuropeanSlavic 27 Ukrainian Indo-EuropeanSlavic 28 GeorgianKartvelian 29 LezgianNakh-DaghestanianLezgic 30 Chechen Nakh-DaghestanianNakh 31 AbkhazNorthwest Caucasian 32 KabardianNorthwest Caucasian 33 Finnish UralicFinnic 34 KomiZyrian UralicFinnic 35 Nenets UralicSamoyedic 36 Selkup UralicSamoyedic 37 Hungarian UralicUgric 38 Khanty=Yakut UralicUgric 39 KetYeniseian

5.2. Set of Solovyev-Wichmann (39 languages) Examples of trees, built on different data. Tree from JM data Tree from WALS data Tree from ASJP data ASJP tree is the most reliable in its quality. JM tree is placed at the second place and WALS tree is at the third place.

5.3. New more thin techniques of an estimation of quality of similarity measures After calculation of a measure all languages are sorted 39 times relatively to each languages. Quality of measure K: K = ∑( К i )/39, i = i…39 Ki = N р /Ng, i = i…39 Np – a number of related languages placed after each language. Ng – a number of related languages in each group.

5.3. New more thin techniques of an estimation of quality of similarity measures Also different techniques exist that allow to compare trees immediately. In this case a quality measure is calculated as editorial distance.

5.4. Alternatives on techniques of estimation of measure quality

5.5. New heuristics, allowing to improve quality of similarity measure Restriction on frequency of features gives increase in a measure to 0,697 (on A.A.Kibrik's set) Restriction on description sections gives increase in a measure to 0,76 (on A.A.Kibrik's s et )

5.5. New heuristics, allowing to improve quality of similarity measure The sections of essay were chosen that has a quality value more than 0,25. The list of these sections includes numbers {1,2,7,8,12,13,14,15,16,19}. See table at slide 13.

5.5. Alternatives on heuristics

6. Conclusions and the future researches New heuristics (frequency and the filter on sections) allow to improve quality of a measure In the future: - It is planned to use such factors as stability and genetic markers; - It is supposed to apply more thin measures of an estimation of quality; - It is more preferable to use the set comparable to other linguistic resources (WALS, ASJP, etc.)

Contacts: Vladimir Polyakov Institute of Linguistics of RAS The research is supported by RFBR grant ( № а

Thanks!