Presentation is loading. Please wait.

Presentation is loading. Please wait.

NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.

Similar presentations


Presentation on theme: "NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des."— Presentation transcript:

1 NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des Langues et Civilisations Orientales (INALCO), Paris

2 Armenian: preliminaries  an Indo-European language  right-branching  of an accusative type  typically with an SOV structure and  dominantly with an agglutinative morphology

3 Historical Armenia

4 Republic of Armenia

5 Periodization  prealphabetical  alphabetical (405 A.D. – up to present). 1.Old Armenian or Grabar (V-XI); 2. Middle Armenian (XII-XVI); 3.Modern Armenian (XVII – up to present) Western Eastern (based on Constantinople dialect) (based on Ararat dialect) dialects…dialects….

6 Objective Provide data compatibility and portability between Nooj and Eastern Armenian National Corpus (EANC) platform

7 What is Eastern Armenian National Corpus www.eanc.net Corpus Technologies Michael Daniel, Victoria Khurshudian, Dmitri Levonian, Vladimir Plungian, Alexey Polyakov,Sergey Rubakov

8 8 Source texts PARSER Annotated texts Annotation algorithm Grammatical dictionary

9 EANC History Moscow, Russia  March 2006: Project Launch  July 2007: 1 st Release  May 2008: 2 nd Release  March 2009: 3 rd release

10 Eastern Armenian National Corpus (EANC) is: about 110 million tokens  morphological and other markup  English translations for frequent tokens  covers SEA from the mid-19th century to the present  both written and oral discourse  full-text view for over 100 Armenian classic titles  open internet access

11 Written Discourse  over 106 mln. tokens  510 authors (1841-2009)  1039 fiction texts (including 206 translated texts)  7858 press issues  non-fiction (scientific and other) texts

12 Spontaneous discourse Polylogues Task-oriented discourse TV-shows transcripts Movies … ☼ EANC oral corpus has all been recorded and transcribed by the project. Oral Discourse (3.5 mln. tokens)

13 13 EANC Functionality

14 14 Search Functionality  Token queries  Context queries  Subcorpus selection

15 15 Simple token queries: lexeme search wordform search gram search translation search lexeme + gram search Search Functionality

16 16 Advanced options for token queries:  case-sensitivity  punctuation marks  position in the sentence  wildcard (*)  logical functions (e.g. ‘or' |)  negated features  grammatical/lexical homonymy inclusion/exclusion Search Functionality

17 17 Subcorpus selection by:  time  author(s) / title(s)  genres  types of texts (translated vs. original)  superposition of any of the above Search Functionality

18 18 Display options  context expanding  ‘sort by’ (time, lexeme, wordform etc.)  Latin transliteration  glossed display  KWIC (key word in the context) Search Functionality

19 19 Transliterated samples:

20 20 Glossed samples:

21 21 KWIC samples:

22 Main Current Tasks:  Make Nooj-based Western Armenian morphological annotation compatible with EANC grammatical dictionary structure  Make EANC and Nooj Western Armenian platforms interportable  Mutual full coverage of Nooj and EANC capacities (e.g. syntactical annotation of Nooj)

23


Download ppt "NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des."

Similar presentations


Ads by Google