CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of abstract rules by which a natural language is governed or relates to another language. Originally done by hand, corpora are now largely derived by an automated process.
Corpus “Corpus", derived from the Latin word meaning "body", may be used to refer to any text in written or spoken form. In modern Linguistics, this term is used to refer to large collections of texts which represent a sample of a particular variety or use of language(s) that are presented in machine readable form.
Scope of Studies : The possible words, structures or uses in a language Their probable occurrence of an aspect in a language The description and explanation of the nature, structure and use of language with particular matters such as language acquisition, variation and change.
Types of Corpora spoken (transcribed) language, Written language from:- modern or old texts, texts from one language or several languages, texts from whole books, newspapers, journals, speeches, extracts of varying length. Online data
Corpus Linguistics is now seen as the study of linguistics phenomena through large collections of machine-readable texts: corpora. These are used within a number of research areas going from the Descriptive Study of the Syntax of a Language to Language Learning, etc.
List of corpora LIST OF CORPORA
Examples of Corpora Brown Corpus The Brown Corpus of Standard American English was the first of the modern, computer readable, general corpora. It was compiled by W.N. Francis and H. Kucera, Brown University, Providence, RI. The corpus consists of one million words of American English texts printed in 1961. The texts for the corpus were sampled from 15 different text categories to make the corpus a good standard reference. The LOB corpus (British English) and the Kolhapur Corpus (Indian English) are two examples of corpora made to match the Brown corpus. The availability of corpora which are so similar in structure is a valuable resourse for researchers interested in comparing different language varieties, for example.
BNC-British National Corpus The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007.
Sample Corpus Sample
Quranic Corpus
Corpora of CMC
Role of The Computer in Corpus Linguistics To store huge amount of text To quickly retrieve huge amounts of texts To retrieve words, phrases or whole texts in context To sort out linguistic items To increase reliability in searching, counting and sorting linguistic items To provide accurate probability of occurrence of specific linguistic items.
Corpus-Related Research Computational Linguistics Cultural Studies Discourse Analysis and Pragmatics Grammar/Syntax Historical Linguistics Language Acquisition Language Teaching Language Variation Lexicography Linguistics Machine Translation Natural Language Processing (NLP) Psycholinguistics Semantics Social Psychology Sociolinguistics Speech Stylistics
Computational Linguistics (The use of computers to process or produce human language) Corpora are used as a resource to solve various problems.
Cultural Studies The existence of comparable corpora makes it possible to compare the language use in different countries. The result can point to differences in culture.
Grammar/ Syntax The existence of large corpora allows for the study of language as it is produced or to study the performance of people. By confronting the grammar with unrestricted corpus data, it can be tested on its correctness and its completeness.
Historical Linguistics Machine-readable corpora from different times allow historical linguists to conduct research related to development of a language over time
Language Acquisition Could provide data from learners of a target language from different countries, different age etc
Language Teaching -Corpus is used as data driven learning -more for higher level -investigate idiolect, idiosyncrasy, or certain aspects of grammar usage READING ASSIGNMENT Corpus Linguistics: What It Is and How It Can Be Applied to Teaching Daniel Krieger dannykrieger99 [at] Siebold University of Nagasaki (Nagasaki, Japan)
Language Variation To study or compare how language varies between different text types, domains, regions, speakers, writers, etc.
Lexicography Corpora is used for the production of dictionary and grammar books. Examples-Collins Cobuild, British National Corpus (BNC) & Longman Corpus Network.
Linguistics To provide traditional linguistics descriptions.
Psycholinguistics Contribute to the creation of hypothesis about the way the language is processed by the mind.
Semantics Study the meanings of words or utterances by looking at the context in which the words or phrase occurs.
Sociolinguistics To study the speakers’ age, sex, social class, writers’ age, etc.
Speech To be used for speech science and speech technology Speech To be used for speech science and speech technology. To compare spoken and written language. Teaching computers to produce and understand speech. Example- London-Lund Corpus (LLC)
Stylistics To find specific features of text types Stylistics To find specific features of text types. To compare with different texts. To detect changes of styles in authors’ writings.
Computational Stylistics The style of a text is a function of the aggregate of the ratios between the frequencies of its phonological, grammatical and lexical items, and the frequencies of the corresponding items in a contextually related norm Computers are used to study the stylistic characteristics of particular texts, authors, genres, periods etc.
Forensic Linguistics Forensic linguistics is the application of linguistics knowledge, methods and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of applied linguistics. Basically, there are three areas of application for linguists working in forensic contexts – 1) understanding language of the written law, 2) understanding language use in forensic and judicial processes and 3) the provision of linguistic evidence.
BASIC TOOL Concordancer Example of a software used for corpus linguistics What is a concordancer Examples of concordance programs How does it assist in the field of Corpus Linguistics and teaching and learning. Simple demonstration of the usage of a concordancer
Studies on Corpus Linguistics International Journal of Education and Development using Information and Communication Technology (IJEDICT), 2011, Vol. 7, Issue 3, pp. 96-101 EDICT-2011-1303.pdf
Journal of Corpus Linguistics International Journal of Corpus Linguistics EDICT-2011-1303.pdf
Reflection In what ways could the availability of corpus enrich your studies as a BENL student. Include a suggestion for a possible (Corpus Linguistic) topic for your MA thesis