McEnery, T., Xiao, R. and Y.Tono. 2006. Corpus-based language studies. Routledge. Unit A 1. Corpus linguistics: the basics (pp3-11) http://www.cl2011.org.uk/history-of-corpus-linguistics.html
A1.2 CORPUS LINGUISTICS: PAST AND PRESENT The term first appeared in the early 1980s. But corpus-based language study has a substantial history. The early examples of corpus linguistics date to the late 19th century Germany. In 1897, German linguist J. Kading used a large corpus consisting of about 11 million words to analyse distribution of the letters and their sequences in German language.
A1.2 CORPUS LINGUISTICS: PAST AND PRESENT Other early linguists to have used corpus for studying language include Franz Boas (Handbook of Native American Indian Languages, 1911), Zellig Harris (Methods in Structural Linguistics, 1951), Charles C. Fries (The structure of English, 1952),
A1.2 CORPUS LINGUISTICS: PAST AND PRESENT Leonard Bloomfield (Language, 1933), Archibald A. Hill and others, mostly American structural and field linguists (see TERMS AND CONCEPTS). Some of them also started to use corpus in pedagogical study of foreign languages. Thus, the corpus methodology dates back to the pre-Chomskyan period;
A1.2 CORPUS LINGUISTICS: PAST AND PRESENT In the late 1950s the corpus methodology was severely criticized – it became marginalized. Chomsky rejected the use of a corpus as a tool for linguistic studies, arguing that linguists must model language on competence (kompetencija) instead of performance (atliktis).
A1.2 CORPUS LINGUISTICS: PAST AND PRESENT Even “shoeboxes” used instead of computers – their methodology was essentially “corpus-based” (empirical - based on observable data).
A1.2 CORPUS LINGUISTICS: PAST AND PRESENT Using paper slips and human hands and eyes, it was impossible to analyse large bodies of language data. Consequently the corpora of the time could rarely avoid being ‘skewed’.
A1.2 CORPUS LINGUISTICS: PAST AND PRESENT Corpus linguistics was not abandoned completely; however, it was not until the 1980s when linguists began to show an increased interest in the use of corpus for research.
A1.2 CORPUS LINGUISTICS: PAST AND PRESENT But with the development of powerful computers (esp their processing power and massive storage at relatively low cost), the exploitation of massive corpora became possible. The marriage of corpora with computer technology rekindled interest in the corpus methodology.
A1.2 CORPUS LINGUISTICS: PAST AND PRESENT Nowadays, the corpus methodology enjoys widespread popularity. It has opened up or foregrounded many new areas of research. Corpora have revolutionized nearly all branches of linguistics.
A1.3 WHAT IS A CORPUS? Sinclair (1996): “A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.” Thus a corpus is different from a random collection of texts;
A1.3 WHAT IS A CORPUS? Definition of a corpus: “A collection of sampled texts, written or spoken, in machine-readable form which may be annotated with various forms of linguistic information” (McEnery et al. 2006:4)
A1.3 WHAT IS A CORPUS? There are many ways to define a corpus, but there is an increasing consensus that a corpus is a collection of machine-readable authentic texts (including transcripts of spoken data) which is
A1.3 WHAT IS A CORPUS? (3) sampled to be (4) representative of a particular language or language variety. The problem: but what can be counted as representative? (Unit A2)
A1.4 WHY USE COMPUTERS TO STUDY LANGUAGE? Advantages of electronic corpora: The speed of processing data The accuracy of processing data The ease of manipulating data (e.g. searching, selecting, sorting) Minimal costs Computers avoid human bias in an analysis thus the result is more reliable
A1.5 THE CORPUS-BASED APPROACH VS. THE INTUITION-BASED APPROACH By using the intuition-based approach, researchers can invent purer examples for analysis. However, it should be applied with caution. WHY?
A1.5 THE CORPUS-BASED APPROACH VS. THE INTUITION-BASED APPROACH It is possible to be influenced by one’s dialect or sociolect: what appears to be acceptable for one speaker may be not so for another; When one invents an example to support or disprove the argument, the utterance may not represent typical language use.
A1.5 THE CORPUS-BASED APPROACH VS. THE INTUITION-BASED APPROACH The corpus-based approach, in contrast, draws upon authentic or real texts. Results based on introspection (i.e. intuition) are difficult to verify – introspection is not observable. But a corpus can yield reliable quantitative data.
A1.6 CORPUS LINGUISTICS: A METHODOLOGY OR A THEORY? Corpus linguistics is indeed a methodology rather than an independent branch of linguistics (but see Tognini-Bonelli 2001:1); CL is not restricted to a particular aspect of language; CL is a whole system of methods and principles of how to apply corpora in language studies and teaching/learning;
A1.7 CORPUS-BASED VS. CORPUS-DRIVEN APPROACHES Corpus linguistics is a research approach that has developed over the past few decades to support empirical investigations of language variation and use. Corpus studies have used two major research approaches: ‘corpus-based’ (tekstynais paremtas tyrimas) and ‘corpus-driven’ (tekstyno inspiruotas tyrimas).
A1.7 A CORPUS-BASED APPROACH Corpus-based research assumes the validity (pagrįstumas) of linguistic forms and structures derived from linguistic theory. The primary goal of research is to analyse the systematic patterns of variation and use for those pre-defined linguistic features. Corpora are used mainly to test or exemplify theories formulated before large corpora became available.
A1.7 A CORPUS-DRIVEN APPROACH Corpus-driven research is more inductive, so that the linguistic constructs themselves emerge from analysis of a corpus. The theoretical statements reflect directly the evidence provided by the corpus. However, the distinction between ‘corpus-based’ and ‘corpus-driven’ is overstated. In this book the term ‘corpus-based’ is used in a broad sense, encompassing both approaches.
THE SCOPE OF CORPUS LINGUISTICS Kennedy, G. 1998. An introduction to Corpus linguistics. Longman.
Corpus linguistics Corpus linguistics is based on bodies of text as the source of evidence for linguistic description and argumentation. It is a methodology for linguistic description. The focus of study is on performance rather than competence, and on observation of language in use leading to theory rather than vice versa (cf. a corpus-driven approach). It is NOT a separate branch of linguistics.
Corpus-based research In the case of corpus-based research, the evidence of what is possible in a language is derived directly from texts. Work in corpus linguistics is currently associated with several quite different activities.
Activities in Corpus linguistics The first group of researchers consists of corpus makers or compilers. These scholars are concerned with the design and compilation of corpora, the collection of texts and their preparation and storage for later analysis.
Activities in Corpus linguistics A second group of researchers is concerned with the developing tools for the analysis of corpora. A third group of researchers consists of descriptive linguists whose main concern has been to describe reliably the lexicon and grammar of languages.
Concerns of descriptive linguistics Corpus-based descriptive linguistics is concerned with how often particular forms are used. This model allows us to study variation in text types, language change and regional and other varieties of language.
Concerns of descriptive linguistics The corpus provides contexts for the study of meaning in use. The corpus makes it possible to extract linguistic information from texts on a scale previously undreamed of.
Activities in Corpus linguistics A fourth area of activity is concerned with using corpus material for language learning and teaching, and natural language processing by machine, including speech recognition and translation.
Activities in Corpus linguistics Corpus linguistics is also concerned with the statistical distribution of linguistic items in the context of use, e.g. word count to discover the most frequent words and grammatical structures for language teaching purposes.
The current concerns of corpus linguistics The current concerns of corpus linguistics include: improved ways of annotating (i.e. adding short notes to explain something) corpora, the tagging (i.e.attaching a word-class label) of parts of speech and the senses of polysemous word forms, improved automated parsing (i.e. syntactic analysis),
Analyses of the corpus can contribute to: the making of dictionaries, word lists, descriptive grammars (cf. LGSWE), diachronic and synchronic comparative studies of speech varieties, and to stylistic, pedagogical and other applications.
TERMS AND CONCEPTS Annotation – the process of encoding interpretive linguistic information in a corpus annotating (i.e. adding short notes to explain something), tagging (an alternative term for annotation, i.e. attaching a word-class label) parsing (also treebanking or bracketing, i.e. syntactic analysis of the sentence into their constituents) POS: part of speech
TERMS AND CONCEPTS (Chomsky) competence (kompetencija; internalised knowledge of a language) vs. performance (atliktis; external evidence of language competence, its usage on particular occasions) Empirical data - The empirical method is generally taken to mean the approach of using a collection of data to base a theory or derive a conclusion in science.
TERMS AND CONCEPTS Corpus-based (tekstynais paremtas tyrimas) research assumes the validity of linguistic forms and structures derived from linguistic theory. Corpus-driven (tekstyno inspiruotas tyrimas) research is more inductive, so that the linguistic constructs themselves emerge from analysis of a corpus. The theoretical statements reflect directly the evidence provided by the corpus.
TERMS AND CONCEPTS Field linguists (e.g. Boas 1940) - field linguistics is concerned with the description and analysis of previously undescribed languages. Because many undescribed languages are spoken by small groups of people, many field linguists dedicate a great deal of their time to language documentation and language revitalization.
TERMS AND CONCEPTS Structuralism-any approach to linguistic description which views the grammar of a language primarily as a system of relations. Structuralism in this sense derives largely from the work of the Swiss linguist Ferdinand de Saussure (1857-1913). Virtually all 20th century approaches to linguistics are structuralist in this sense.