Presentation is loading. Please wait.

Presentation is loading. Please wait.

What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.

Similar presentations


Presentation on theme: "What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types."— Presentation transcript:

1 What is a national corpus

2 Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types of texts through making complex lexical grammatical queries. The corpus allows to investigate various linguistic phenomena by observing the possible range of contexts in which they occur.

3 Examples of searchable corpora online British National Corpus Russian National Corpus Eastern Armenian National Corpus Czech National Corpus

4 To show just one example: Eastern Armenian National Corpus about 90 million tokens powerful search engine for making complex lexical morphological queries a diachronic corpus covering SEA texts from the mid-19th century to the present both written discourse and oral discourse open access

5 A national corpus is a large-scale, linguistically diversified and balanced collection of texts provided with a flexible search engine.

6 How large? RNC 150 mln BNC 100 mln EANC 90 mln Essentially, depends on the type of research envisaged

7 How diversified? As diversified as practicable EANC – extension of the press subcorpus to cover early Armenian press, soon to cover internet forums RNC – effort to cover snail mail and electronic communication

8 EANC: subcorpus form

9 How balanced? Balance is a vague notion… At least not disproportionate – less poetry than prose etc. Even a disbalanced corpus can be balanced by creating predefined subcorpora.

10 As an example: EANC

11 Multicomponent corpora Oral subcorpus (RNC, BNC, EANC) Dialectal subcropus (RNC) Poetic subcropus (RNC) Educational subcorpus (RNC) …

12 Library or corpus? electronic library is intended for readers corpus is intended for researchers Difference in target audience and intended usage Implied differences:  corpus must be able to respond to queries  library have major problems related to copyright

13 Technical requirement: reasonable expectation time Functional requirement: complex queries you can not parse texts as you go (on flight)  texts need to contain mark up in large corpora, you can not simply search the markup  you have to index files, create datafiles and use special search algorythms

14 Parsing Сlassification of inflectional types needs to be as exhaustive and formal as a logical calculus. Parser creates a list of endings and a list of stems; when parsing a wordform, it tries to match the ending of the word with an ending in the list, then tries to match the rest with the stem, and checks whether this ending is allowed to be added to this stem. wordlist inflection type attributed to its each item

15 Parsing recent loanwords neologisms elements of code- switching abbreviations proper names technical terms distorted spellings cases of inflectional variance not included into the wordlist scanning errors typos and misspellings in the original texts Some tokens are not recognized at all; these tokens can not be searched by means of lexical or grammatical queries.

16 Parsing Some tokens receive several analyses. The actual applicability of these analyses depend on the context and may not be evaluated by the parser.

17 # of analysesCommentFictionSciencePress Other Written Oral Discourse EANC Total 1unambiguous73,9%65,9%70,4%68,0%63,0%70,9% 2ambiguous (homonimous)15,4%9,8%12,4%12,3%14,1%13,2% 3ambiguous (homonimous)2,7%2,0%1,9%3,8%2,4%2,3% 4 - 7ambiguous (homonimous)1,4%1,8% 1,6%1,5%1,6% Subtotal ambiguous19,5%13,7%16,0%17,7%18,0%17,1% 1?hypothetic (not in dictionary)0,0%1,3%0,6%0,7%0,2%0,5% 0not recognized6,2%12,8%9,9%8,0%13,9%8,9% Special tokens: Cyrillic, Latin, digits0,3%6,3%3,1%5,6%4,9%2,6% Total 100%

18 Search Functionality Once again: the Corpus allows to investigate various linguistic phenomena by observing the range of contexts in which they occur. token queries context queries subcorpus queries

19 Search Functionality Simple token queries: lexeme search wordform search gram search Combined token queries: lexeme + gram search

20 Search Functionality Additional and advanced options for token queries: case-sensitivity punctuation marks position in the sentence wildcard queries logical functions negated features

21 Search Functionality Context queries: a combination of several token queries search for tokens at a specified distance search for tokens within one sentence search for tokens in adjacent sentences increasing the number of tokens ad infinitum

22 Search Functionality Subcorpus selection: searching in a specified type of texts only search within a specific period of time search in texts of specified authors search in specified genres/types of texts

23 Search Functionality Working with the results expanding the context pop-up grammar sort by…

24 Extras Translations (EANC) Disambiguation (RNC) Electronic library (EANC) Syntactic markup Statistics (RNC?)

25 Possible applications  Linguistics (corpus-based grammars projects under way)  Education (www. studiorum. ruscorpora.ru to appear)www. studiorum. ruscorpora.ru  Normative linguistics  Literature and culture studies  etc.


Download ppt "What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types."

Similar presentations


Ads by Google