Corpus 3 Corpus-based Description
Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based research on English
lexical description The most obvious use of corpora for lexical description is in lexicography. Not only to identify the set of different words and show when new types enter the language, but to identify the various senses or uses of particular types and their relative frequencies. e.g. London-Lund Corpus: polysemous word good Table 3.1 Identify neologisms
Pre-Electronic Lexical Description for Pedagogical Purposes Thondike (1921): word frequency on the basis of 4.5 million word corpus of literary works and books read by younger children. The principle of vocabulary control in the design and editing of reading materials owes much to Thorndike's pioneering work. Michael West: General Service List of English Words (1953)
Pre-Electronic Lexical Description for Pedagogical Purposes Description of the most frequent 2,000 words n the written English of the time, supplemented by information on the frequency of the meanings or uses of these words, based on the work of Lorge. Fig. 3.2 Thorndike-Lorge corpus was biased towards more literary and formal styles of writing, and did not include speech at all.
Computer-based studies of lexicon With a computerized corpus and appropriate software, both significant and more trivial but interesting facts about the lexicon of a language can be uncovered. Table 3.2 The rank ordering of the 50 most frequent words in various corpora shows remarkable consistency and systematic differences.
Computer-based studies of lexicon Consistence: all the words except said are function words.
Word Occurrence 40% of the words in a corpus of over five million words occur only once show that a corpus of even that size is not a sound basis for lexicographical studies of low frequency words.
Word Occurrence Sharman found that there was an almost linear relationship between vocabulary size and corpus size. A new word appeared in the text approximately every 30 words on average. The more narrowly focused the corpus, the more content words find their way into the higher frequency levels.
Word Classes Table 3.5 (written English): Relative proportions of major word classes in the Brown and LOB corpora As shown in Table 3.6 (spoken English), fewer nouns and a considerable proportion of discourse items characteristic of spoken English are noteworthy.
Word Classes Table 3.7 shows that some sequences such as adjective + noun or noun + noun are very frequent indeed. Johansson and Hofland: occurrence of the 40 most frequent sequences of word-class tags at the beginnings and ends of sentences. Findings: the ends of sentences may be more predictable grammatically than the beginnings.
Register studies Table 3.8 There are certain characteristics of the vocabulary of scientific English. Certain relational words are disproportionately more frequent in scientific English. Comparative adjectives and adverbs are similarly disproportionately frequent, whereas locative adverbs of space or time are disproportionately less frequent in scientific t4xts than in general written American English. Items witch occur in one variety but are highly unlikely to occur in the other.
Semantic information Longman Dictionary of Contemporary English noun entries 23,800 67% one sense % two sense % three senses % four senses 595
Semantic information Verb % one sense % 2 senses % three senses % four senses 348
Collocation Some words can have a tendency to occur in the company of other words in certain contexts, e.g. Pouring rain, statistically significant, intrinsic value, strong tendency Lexicalized unit: set phrase, idiomatic usage, cliché
Collocation Interest in recurring word combinations: Wong-Fillmore (1976): The strategy of acquiring formulaic speech is central to the learning of language.
Collocation Peters (1980): unanalyzed sequences of words had a significant role among the units of language acquisition and proposed ways for identifying such unanalyzed sequences. Nattinger and De Carrico (1992): since first language learners can be seen to use varying, apparently unanalyzed, prefabricated chunks of speech, then second language teaching might similarly be concentrated around the establishment of what they call lexical phrases.
Collocation Different characteristics of the sequence: Allow for no alteration: it's as easy as falling off a log. Allow certain changes (at the moment/at certain moments) Relatively free within a framework (too...to, n... Of)
Collocation Problems in the definition of collocation: How often does a combination have to recur to be habitual? Who decides what sounds natural? Does a combination have t be well-formed or canonical to be a collocation? Do collocations have o be syntactic or are they primarily semantic? Do collocations have to consist of adjac4en words or can they be discontinuous?
Collocation can a sequence which occurs only once in a particular corpus but which is intuitively recognized by native speakers as a sequence they have heard before be listed as a collocation? How big does a corpus have to be in order to establish that a collocation does exist? Are there degrees of collocationality based on the flexibility of the bonding between words?
Collocation Can we lemmatize collocations so that similar or inflectionally related sequences are coned as a single collocation type? Are degrees of colocationality able to be established on the basis of the number of tokens of a type in a particular corpus
Collocation Sinclair(1991) suggested that a span of up to four words each side of a word is the environment in which collocation is most likely to occur although, of course, computer software makes it possible to explore much larger spans, including size of a whole text.
Tense and aspect of verbs Table 3.14 Rank order of the most frequent simple and complex finite verb forms Table 3.15 Relative frequencies of use of finite verb forms Table 3.16 Perfect and progressive verb forms in the Brown Corpus Table 3.17 Finite and non-finite verb forms Table 3.18 past participle
Modals Tale 3.19 frequency of nine modals Table 3.20 use of models Table 3.21 use of modals in verb-phrase structures
Voice Table 3.22 active and passive predications Table 3.23 use of passives in different regisgters Table 3.24 verb-phrase structure of agentive passives
Verb and particle use Subjunctive Prepositions Conjunctions
Grammatical studies Corpus-based grammatical studies revealed considerable genre differences in the use of syntactic patterns and in sentence length. Syntactic constructions are not in free variation. Grammatical study is more of a challenge than lexical study because the tagging and parsing to facilitate the automatic analysis of texts and the development of softwares has not been widely available or user-friendly.
Sentence length Sentence length is related to genre. The mean number of words per sentence in Informative categories is much greater than imaginative prose. There is much closer consistency in the number of predications per sentence regardless of genre. Table Sentence length and predications
Syntactic processes Clause patterning Table 3.42 Distribution of recurrent verb- complement patterns SVC (adj.) 45% SVO 20.9%
Syntactic processes About half of the clauses are matrix clauses and half are embedded. Of the matrix clauses,97.8% are finite, 1.5% are nonfinite, and 0.7% are elliptical. The vast majority of all informational subject clauses are extraposed (it is necessary that), reflecting a principle of end-focus from a functional sentence perspective or preferences in sentence organization for processing purposes.
Syntactic processes In informative prose the verb which precedes a finite that clause is more likely to be a communication verb such as say, state, whereas in spoken conversation affective or cooperative verbs such as think, fee, hope, tend to predominate.
Noun modification 98% of postmodifying clauses had one or other of the simpler clause patterns SVO(37%), SVO (38%), SVC (38%). Suggesting that embedding tend to favor less complex sentence patterns. 70% of noun phrases function as subjects or prepositional complements and noun phrases with postmodifying clauses tend to be disfavoured in subject functions.
Noun modification Postmodification is less frequent in nonfinal positions of sentences. This is because the subject or topic is familiar enough not to need identification or elaboration through postmodification, or because brief subjects are easier to process.
Causation The marking of causation can be lexicalized ( because, cause), syntactic structure (because of) or implicature. Choice for expressing causation is seldom free, but is influenced by various semantic, pragmatic, stylistic, cognitive and textual variables.
Pragmatics Table 3.5 Distribution of discourse items Comparisons of spoken and writing English Table 3.58 pretty Table 3.59 really just right