Corpus 3 Corpus-based Description. Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based.

Slides:

Advertisements

Similar presentations

An investigation into Corpus-based learning about language inin the primary-school: CLLIP Corpus evidence of the features of childrens literature.

Advertisements

The Structure of Sentences Asian 401

ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES language teaching (1) Bambang Kaswanti Purwo

Diachronic study and language change Corpus Linguistics Richard Xiao

Uses of a Corpus “[E]xplore actual patterns of language use”

Chapter 4 Syntax.

Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.

Modality Lecture 10. Language is not merely used for conveying factual information A speaker may wish to indicate a degree of certainty to try to influence.

Linguistics, Morphology, Syntax, Semantics. Definitions And Terminology.

Introduction: A discourse perspective on grammar

Verbs Longman Student Grammar of Spoken and Written English Biber; Conrad; Leech (2009, p ) Verbs provide the focal point of the clause. The main.

Statistical NLP: Lecture 3

Word Order Choices Chapter 12

What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.

1 Words and the Lexicon September 10th 2009 Lecture #3.

Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically.

The origins of language curriculum development

Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.

Academic Vocabulary.

Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.

1. Introduction Which rules to describe Form and Function Type versus Token 2 Discourse Grammar Appreciation.

Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.

Corpus Linguistics Case study 2 Grammatical studies based on morphemes or words. G Kennedy (1998) An introduction to corpus linguistics, London: Longman,

V ARIATION IN THE VERB PHRASE : TENSE, ASPECT, VOICE AND MODAL USE Longman Student Grammar of Spoken and Written English Biber; Conrad; Leech (2009, p.148-

14: THE TEACHING OF GRAMMAR  Should grammar be taught?  When? How? Why?  Grammar teaching: Any strategies conducted in order to help learners understand,

Memory Strategy – Using Mental Images

Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.

GRAMMAR APPROACH By: Katherine Marzán Concepción EDUC 413 Prof. Evelyn Lugo.

Immediate constituent analysis and translation Identifying autonomous units.

Language Objectives. Planning Teachers should write both content and language objectives Content objectives are drawn from the subject area standards.

Chapter 4 Basics of English Grammar Business Communication Copyright 2010 South-Western Cengage Learning.

McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)

Communicative Language Teaching Vocabulary

Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,

Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

Vocabulary connections

Vocabulary connections:multi- word items in English.

Dr. Monira Al-Mohizea MORPHOLOGY & SYNTAX WEEK 12.

ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo

© Child language acquisition To what extent do children acquire language by actively working out its rules?

Linguistics The first week. Chapter 1 Introduction 1.1 Linguistics.

Phrases and Clauses L/O: to revise/learn how to analyse larger units of language – phrases and clauses to revise/learn how to analyse larger units of language.

Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.

인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.

Linguistic Essentials

1 And yeah, it was really good! Positive stance in native and learner speech Sylive De Cock Centre for English Corpus Linguistics Université catholique.

Corpus search What are the most common words in English

Levels of Linguistic Analysis

SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.

Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.

Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.

A Linguistic Toolkit Grammar Chapter 7: What is grammar? Chapter 8: Clause by Clause Chapter 9: Verb phrases: what’s going on?

1 Variation in English Grammar Linda Thomas U210A Chapter 6.

Non-finite forms of the verb

Esther Daborn, Anneli Williams & Louis Harrison

Collecting Written Data

An Introduction to Linguistics

CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Statistical NLP: Lecture 3

Searching corpora.

Revision Outcome 1, Unit 1 The Nature and Functions of Language

Reading and Frequency Lists

Introduction to Corpus Linguistics: Exploring Collocation

A Systematic Framework for Language Analysis

Levels of Linguistic Analysis

Linguistic Essentials

Applied Linguistics Chapter Four: Corpus Linguistics

The 7Cs: A Pedagogical Framework for Grammar Teaching and Learning

Definition of a corpus Research on written or spoken texts can now be carried out with corpus linguistics. The notion of a corpus as the basis for a form.

Presentation transcript:

Corpus 3 Corpus-based Description

Aspects of corpus-based studies lexis, morphology, syntax and discourse. fig. 3.1 A classification of corpus-based research on English

lexical description The most obvious use of corpora for lexical description is in lexicography. Not only to identify the set of different words and show when new types enter the language, but to identify the various senses or uses of particular types and their relative frequencies. e.g. London-Lund Corpus: polysemous word good Table 3.1 Identify neologisms

Pre-Electronic Lexical Description for Pedagogical Purposes Thondike (1921): word frequency on the basis of 4.5 million word corpus of literary works and books read by younger children. The principle of vocabulary control in the design and editing of reading materials owes much to Thorndike's pioneering work. Michael West: General Service List of English Words (1953)

Pre-Electronic Lexical Description for Pedagogical Purposes Description of the most frequent 2,000 words n the written English of the time, supplemented by information on the frequency of the meanings or uses of these words, based on the work of Lorge. Fig. 3.2 Thorndike-Lorge corpus was biased towards more literary and formal styles of writing, and did not include speech at all.

Computer-based studies of lexicon With a computerized corpus and appropriate software, both significant and more trivial but interesting facts about the lexicon of a language can be uncovered. Table 3.2 The rank ordering of the 50 most frequent words in various corpora shows remarkable consistency and systematic differences.

Computer-based studies of lexicon Consistence: all the words except said are function words.

Word Occurrence 40% of the words in a corpus of over five million words occur only once show that a corpus of even that size is not a sound basis for lexicographical studies of low frequency words.

Word Occurrence Sharman found that there was an almost linear relationship between vocabulary size and corpus size. A new word appeared in the text approximately every 30 words on average. The more narrowly focused the corpus, the more content words find their way into the higher frequency levels.

Word Classes Table 3.5 (written English): Relative proportions of major word classes in the Brown and LOB corpora As shown in Table 3.6 (spoken English), fewer nouns and a considerable proportion of discourse items characteristic of spoken English are noteworthy.

Word Classes Table 3.7 shows that some sequences such as adjective + noun or noun + noun are very frequent indeed. Johansson and Hofland: occurrence of the 40 most frequent sequences of word-class tags at the beginnings and ends of sentences. Findings: the ends of sentences may be more predictable grammatically than the beginnings.

Register studies Table 3.8 There are certain characteristics of the vocabulary of scientific English. Certain relational words are disproportionately more frequent in scientific English. Comparative adjectives and adverbs are similarly disproportionately frequent, whereas locative adverbs of space or time are disproportionately less frequent in scientific t4xts than in general written American English. Items witch occur in one variety but are highly unlikely to occur in the other.

Semantic information Longman Dictionary of Contemporary English noun entries 23,800 67% one sense % two sense % three senses % four senses 595

Semantic information Verb % one sense % 2 senses % three senses % four senses 348

Collocation Some words can have a tendency to occur in the company of other words in certain contexts, e.g. Pouring rain, statistically significant, intrinsic value, strong tendency Lexicalized unit: set phrase, idiomatic usage, cliché

Collocation Interest in recurring word combinations: Wong-Fillmore (1976): The strategy of acquiring formulaic speech is central to the learning of language.

Collocation Peters (1980): unanalyzed sequences of words had a significant role among the units of language acquisition and proposed ways for identifying such unanalyzed sequences. Nattinger and De Carrico (1992): since first language learners can be seen to use varying, apparently unanalyzed, prefabricated chunks of speech, then second language teaching might similarly be concentrated around the establishment of what they call lexical phrases.

Collocation Different characteristics of the sequence: Allow for no alteration: it's as easy as falling off a log. Allow certain changes (at the moment/at certain moments) Relatively free within a framework (too...to, n... Of)

Collocation Problems in the definition of collocation: How often does a combination have to recur to be habitual? Who decides what sounds natural? Does a combination have t be well-formed or canonical to be a collocation? Do collocations have o be syntactic or are they primarily semantic? Do collocations have to consist of adjac4en words or can they be discontinuous?

Collocation can a sequence which occurs only once in a particular corpus but which is intuitively recognized by native speakers as a sequence they have heard before be listed as a collocation? How big does a corpus have to be in order to establish that a collocation does exist? Are there degrees of collocationality based on the flexibility of the bonding between words?

Collocation Can we lemmatize collocations so that similar or inflectionally related sequences are coned as a single collocation type? Are degrees of colocationality able to be established on the basis of the number of tokens of a type in a particular corpus

Collocation Sinclair(1991) suggested that a span of up to four words each side of a word is the environment in which collocation is most likely to occur although, of course, computer software makes it possible to explore much larger spans, including size of a whole text.

Tense and aspect of verbs Table 3.14 Rank order of the most frequent simple and complex finite verb forms Table 3.15 Relative frequencies of use of finite verb forms Table 3.16 Perfect and progressive verb forms in the Brown Corpus Table 3.17 Finite and non-finite verb forms Table 3.18 past participle

Modals Tale 3.19 frequency of nine modals Table 3.20 use of models Table 3.21 use of modals in verb-phrase structures

Voice Table 3.22 active and passive predications Table 3.23 use of passives in different regisgters Table 3.24 verb-phrase structure of agentive passives

Verb and particle use Subjunctive Prepositions Conjunctions

Grammatical studies Corpus-based grammatical studies revealed considerable genre differences in the use of syntactic patterns and in sentence length. Syntactic constructions are not in free variation. Grammatical study is more of a challenge than lexical study because the tagging and parsing to facilitate the automatic analysis of texts and the development of softwares has not been widely available or user-friendly.

Sentence length Sentence length is related to genre. The mean number of words per sentence in Informative categories is much greater than imaginative prose. There is much closer consistency in the number of predications per sentence regardless of genre. Table Sentence length and predications

Syntactic processes Clause patterning Table 3.42 Distribution of recurrent verb- complement patterns SVC (adj.) 45% SVO 20.9%

Syntactic processes About half of the clauses are matrix clauses and half are embedded. Of the matrix clauses,97.8% are finite, 1.5% are nonfinite, and 0.7% are elliptical. The vast majority of all informational subject clauses are extraposed (it is necessary that), reflecting a principle of end-focus from a functional sentence perspective or preferences in sentence organization for processing purposes.

Syntactic processes In informative prose the verb which precedes a finite that clause is more likely to be a communication verb such as say, state, whereas in spoken conversation affective or cooperative verbs such as think, fee, hope, tend to predominate.

Noun modification 98% of postmodifying clauses had one or other of the simpler clause patterns SVO(37%), SVO (38%), SVC (38%). Suggesting that embedding tend to favor less complex sentence patterns. 70% of noun phrases function as subjects or prepositional complements and noun phrases with postmodifying clauses tend to be disfavoured in subject functions.

Noun modification Postmodification is less frequent in nonfinal positions of sentences. This is because the subject or topic is familiar enough not to need identification or elaboration through postmodification, or because brief subjects are easier to process.

Causation The marking of causation can be lexicalized ( because, cause), syntactic structure (because of) or implicature. Choice for expressing causation is seldom free, but is influenced by various semantic, pragmatic, stylistic, cognitive and textual variables.

Pragmatics Table 3.5 Distribution of discourse items Comparisons of spoken and writing English Table 3.58 pretty Table 3.59 really just right