NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des Langues et Civilisations Orientales (INALCO), Paris
Armenian: preliminaries an Indo-European language right-branching of an accusative type typically with an SOV structure and dominantly with an agglutinative morphology
Historical Armenia
Republic of Armenia
Periodization prealphabetical alphabetical (405 A.D. – up to present). 1.Old Armenian or Grabar (V-XI); 2. Middle Armenian (XII-XVI); 3.Modern Armenian (XVII – up to present) Western Eastern (based on Constantinople dialect) (based on Ararat dialect) dialects…dialects….
Objective Provide data compatibility and portability between Nooj and Eastern Armenian National Corpus (EANC) platform
What is Eastern Armenian National Corpus Corpus Technologies Michael Daniel, Victoria Khurshudian, Dmitri Levonian, Vladimir Plungian, Alexey Polyakov,Sergey Rubakov
8 Source texts PARSER Annotated texts Annotation algorithm Grammatical dictionary
EANC History Moscow, Russia March 2006: Project Launch July 2007: 1 st Release May 2008: 2 nd Release March 2009: 3 rd release
Eastern Armenian National Corpus (EANC) is: about 110 million tokens morphological and other markup English translations for frequent tokens covers SEA from the mid-19th century to the present both written and oral discourse full-text view for over 100 Armenian classic titles open internet access
Written Discourse over 106 mln. tokens 510 authors ( ) 1039 fiction texts (including 206 translated texts) 7858 press issues non-fiction (scientific and other) texts
Spontaneous discourse Polylogues Task-oriented discourse TV-shows transcripts Movies … ☼ EANC oral corpus has all been recorded and transcribed by the project. Oral Discourse (3.5 mln. tokens)
13 EANC Functionality
14 Search Functionality Token queries Context queries Subcorpus selection
15 Simple token queries: lexeme search wordform search gram search translation search lexeme + gram search Search Functionality
16 Advanced options for token queries: case-sensitivity punctuation marks position in the sentence wildcard (*) logical functions (e.g. ‘or' |) negated features grammatical/lexical homonymy inclusion/exclusion Search Functionality
17 Subcorpus selection by: time author(s) / title(s) genres types of texts (translated vs. original) superposition of any of the above Search Functionality
18 Display options context expanding ‘sort by’ (time, lexeme, wordform etc.) Latin transliteration glossed display KWIC (key word in the context) Search Functionality
19 Transliterated samples:
20 Glossed samples:
21 KWIC samples:
Main Current Tasks: Make Nooj-based Western Armenian morphological annotation compatible with EANC grammatical dictionary structure Make EANC and Nooj Western Armenian platforms interportable Mutual full coverage of Nooj and EANC capacities (e.g. syntactical annotation of Nooj)