What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.

Slides:



Advertisements
Similar presentations
© Paradigm Publishing, Inc Word 2010 Level 2 Unit 1Formatting and Customizing Documents Chapter 2Proofing Documents.
Advertisements

 Fundamentals of Web Design.  Describe the history and theory of XHTML  Understand the rules for creating valid XHTML documents  Apply a DTD to an.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Chapter 12 – Strategies for Effective Written Reports
How do we work in a virtual multilingual classroom? A virtual multilingual classroom with Moodle and Apertium Cultural and Linguistic Practices in the.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Lexical and Syntactic Analysis Here, we look at two of the tasks involved in the compilation process –Given source code, we need to first break it into.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
CPSC Compiler Tutorial 9 Review of Compiler.
Engineering Village ™ ® Basic Searching On Compendex ®
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Russian National Corpus today: overview and perspectives Vladimir A. Plungian (Moscow)
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools.
CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Objectives © Paradigm Publishing, Inc. 1 Objectives.
Research methods in corpus linguistics Xiaofei Lu.
English Word Origins Grade 3 Middle School (US 9 th Grade) Advanced English Pablo Sherman The etymology of language.
14: THE TEACHING OF GRAMMAR  Should grammar be taught?  When? How? Why?  Grammar teaching: Any strategies conducted in order to help learners understand,
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Extracting Lexical Features Development of software tools for a search engine 1. convert an arbitrary pile of textual objects into a well-defined corpus.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Researching language with computers Paul Thompson.
NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
How Can Corpora Help Me To Be Successful in CO150?
HYMES (1964) He developed the concept that culture, language and social context are clearly interrelated and strongly rejected the idea of viewing language.
Translation Studies 9. The use of corpora in TS Krisztina Károly, Spring, 2006 Sources: Olohan, 2004; Tirkkonen-Condit, 2005.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Finding high-Quality contents in Social media BY : APARNA TODWAL GUIDED BY : PROF. M. WANJARI.
Daisy Arias Math 382/Lab November 16, 2010 Fall 2010.
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
Corpus lexicography in Russia: recent trends and perspectives Maria Khokhlova St.Petersburg State University Philological Faculty
Detection of Spelling Errors in Swedish Clinical Text Nizamuddin Uddin and Hercules Dalianis Department of Computer and Systems Sciences, (DSV)
Word Editing Tools. Word Automatic Editing Tools §Word has three features that automatically change or insert text and graphics as you type §You can easily.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Parser: CFG, BNF Backus-Naur Form is notational variant of Context Free Grammar. Invented to specify syntax of ALGOL in late 1950’s Uses ::= to indicate.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Text Based Information Retrieval
Computational and Statistical Methods for Corpus Analysis: Overview
Word Editing Tools.
CS 430: Information Discovery
A Systematic Framework for Language Analysis
R.Rajkumar Asst.Professor CSE
(word formation: follow up)
LITERATURE Assessment Criteria Currently Achieving Grade
Using GOLD to Tracking L2 Development
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
Functionalism: the translation process is guided by extra-linguistic factors Texts are embedded in situations or contexts that consist of non-linguistic.
Presentation transcript:

What is a national corpus

Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types of texts through making complex lexical grammatical queries. The corpus allows to investigate various linguistic phenomena by observing the possible range of contexts in which they occur.

Examples of searchable corpora online British National Corpus Russian National Corpus Eastern Armenian National Corpus Czech National Corpus

To show just one example: Eastern Armenian National Corpus about 90 million tokens powerful search engine for making complex lexical morphological queries a diachronic corpus covering SEA texts from the mid-19th century to the present both written discourse and oral discourse open access

A national corpus is a large-scale, linguistically diversified and balanced collection of texts provided with a flexible search engine.

How large? RNC 150 mln BNC 100 mln EANC 90 mln Essentially, depends on the type of research envisaged

How diversified? As diversified as practicable EANC – extension of the press subcorpus to cover early Armenian press, soon to cover internet forums RNC – effort to cover snail mail and electronic communication

EANC: subcorpus form

How balanced? Balance is a vague notion… At least not disproportionate – less poetry than prose etc. Even a disbalanced corpus can be balanced by creating predefined subcorpora.

As an example: EANC

Multicomponent corpora Oral subcorpus (RNC, BNC, EANC) Dialectal subcropus (RNC) Poetic subcropus (RNC) Educational subcorpus (RNC) …

Library or corpus? electronic library is intended for readers corpus is intended for researchers Difference in target audience and intended usage Implied differences:  corpus must be able to respond to queries  library have major problems related to copyright

Technical requirement: reasonable expectation time Functional requirement: complex queries you can not parse texts as you go (on flight)  texts need to contain mark up in large corpora, you can not simply search the markup  you have to index files, create datafiles and use special search algorythms

Parsing Сlassification of inflectional types needs to be as exhaustive and formal as a logical calculus. Parser creates a list of endings and a list of stems; when parsing a wordform, it tries to match the ending of the word with an ending in the list, then tries to match the rest with the stem, and checks whether this ending is allowed to be added to this stem. wordlist inflection type attributed to its each item

Parsing recent loanwords neologisms elements of code- switching abbreviations proper names technical terms distorted spellings cases of inflectional variance not included into the wordlist scanning errors typos and misspellings in the original texts Some tokens are not recognized at all; these tokens can not be searched by means of lexical or grammatical queries.

Parsing Some tokens receive several analyses. The actual applicability of these analyses depend on the context and may not be evaluated by the parser.

# of analysesCommentFictionSciencePress Other Written Oral Discourse EANC Total 1unambiguous73,9%65,9%70,4%68,0%63,0%70,9% 2ambiguous (homonimous)15,4%9,8%12,4%12,3%14,1%13,2% 3ambiguous (homonimous)2,7%2,0%1,9%3,8%2,4%2,3% 4 - 7ambiguous (homonimous)1,4%1,8% 1,6%1,5%1,6% Subtotal ambiguous19,5%13,7%16,0%17,7%18,0%17,1% 1?hypothetic (not in dictionary)0,0%1,3%0,6%0,7%0,2%0,5% 0not recognized6,2%12,8%9,9%8,0%13,9%8,9% Special tokens: Cyrillic, Latin, digits0,3%6,3%3,1%5,6%4,9%2,6% Total 100%

Search Functionality Once again: the Corpus allows to investigate various linguistic phenomena by observing the range of contexts in which they occur. token queries context queries subcorpus queries

Search Functionality Simple token queries: lexeme search wordform search gram search Combined token queries: lexeme + gram search

Search Functionality Additional and advanced options for token queries: case-sensitivity punctuation marks position in the sentence wildcard queries logical functions negated features

Search Functionality Context queries: a combination of several token queries search for tokens at a specified distance search for tokens within one sentence search for tokens in adjacent sentences increasing the number of tokens ad infinitum

Search Functionality Subcorpus selection: searching in a specified type of texts only search within a specific period of time search in texts of specified authors search in specified genres/types of texts

Search Functionality Working with the results expanding the context pop-up grammar sort by…

Extras Translations (EANC) Disambiguation (RNC) Electronic library (EANC) Syntactic markup Statistics (RNC?)

Possible applications  Linguistics (corpus-based grammars projects under way)  Education (www. studiorum. ruscorpora.ru to appear)www. studiorum. ruscorpora.ru  Normative linguistics  Literature and culture studies  etc.