Download presentation
Presentation is loading. Please wait.
Published byClifford Hunt Modified over 9 years ago
1
NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des Langues et Civilisations Orientales (INALCO), Paris
2
Armenian: preliminaries an Indo-European language right-branching of an accusative type typically with an SOV structure and dominantly with an agglutinative morphology
3
Historical Armenia
4
Republic of Armenia
5
Periodization prealphabetical alphabetical (405 A.D. – up to present). 1.Old Armenian or Grabar (V-XI); 2. Middle Armenian (XII-XVI); 3.Modern Armenian (XVII – up to present) Western Eastern (based on Constantinople dialect) (based on Ararat dialect) dialects…dialects….
6
Objective Provide data compatibility and portability between Nooj and Eastern Armenian National Corpus (EANC) platform
7
What is Eastern Armenian National Corpus www.eanc.net Corpus Technologies Michael Daniel, Victoria Khurshudian, Dmitri Levonian, Vladimir Plungian, Alexey Polyakov,Sergey Rubakov
8
8 Source texts PARSER Annotated texts Annotation algorithm Grammatical dictionary
9
EANC History Moscow, Russia March 2006: Project Launch July 2007: 1 st Release May 2008: 2 nd Release March 2009: 3 rd release
10
Eastern Armenian National Corpus (EANC) is: about 110 million tokens morphological and other markup English translations for frequent tokens covers SEA from the mid-19th century to the present both written and oral discourse full-text view for over 100 Armenian classic titles open internet access
11
Written Discourse over 106 mln. tokens 510 authors (1841-2009) 1039 fiction texts (including 206 translated texts) 7858 press issues non-fiction (scientific and other) texts
12
Spontaneous discourse Polylogues Task-oriented discourse TV-shows transcripts Movies … ☼ EANC oral corpus has all been recorded and transcribed by the project. Oral Discourse (3.5 mln. tokens)
13
13 EANC Functionality
14
14 Search Functionality Token queries Context queries Subcorpus selection
15
15 Simple token queries: lexeme search wordform search gram search translation search lexeme + gram search Search Functionality
16
16 Advanced options for token queries: case-sensitivity punctuation marks position in the sentence wildcard (*) logical functions (e.g. ‘or' |) negated features grammatical/lexical homonymy inclusion/exclusion Search Functionality
17
17 Subcorpus selection by: time author(s) / title(s) genres types of texts (translated vs. original) superposition of any of the above Search Functionality
18
18 Display options context expanding ‘sort by’ (time, lexeme, wordform etc.) Latin transliteration glossed display KWIC (key word in the context) Search Functionality
19
19 Transliterated samples:
20
20 Glossed samples:
21
21 KWIC samples:
22
Main Current Tasks: Make Nooj-based Western Armenian morphological annotation compatible with EANC grammatical dictionary structure Make EANC and Nooj Western Armenian platforms interportable Mutual full coverage of Nooj and EANC capacities (e.g. syntactical annotation of Nooj)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.