Definition of a corpus Research on written or spoken texts can now be carried out with corpus linguistics. The notion of a corpus as the basis for a form.

Slides:



Advertisements
Similar presentations
Recognizing the Basic Patterns of Organization How does the author of the book or article organize his writing?
Advertisements

An investigation into Corpus-based learning about language inin the primary-school: CLLIP Corpus evidence of the features of childrens literature.
How Language Use Varies
Denotation and connotation denotation and connotation are used to different types of value that we attribute to words.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically.
Young Children Learn a Native English Anat Ninio The Hebrew University, Jerusalem 2010 Conference of Human Development, Fordham University, New York Background:
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
Chapter 3 Constructs, Variables, and Definitions.
Sharif University of Technology Session # 7.  Contents  Systems Analysis and Design  Planning the approach  Asking questions and collecting data 
Deny A. Kwary Internal Structures of Dictionary Entries.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
CHAPTER 1 STATISTICS Statistics is a way of reasoning, along with a collection of tools and methods, designed to help us understand the world.
Hello class !.... And how do you do, today ? Great ! Good to know !...
Adele E. Goldberg. How argument structure constructions are learned.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Communicative and Academic English for the EFL Professional.
1 Ch 1. VOCABULARY SIZE, TEXT COVERAGE & WORD LISTS Nation& Waring.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Discourse Analysis Week 10 Riggenbach (1999) Chapter 1 - Quotes.
Genre and cultural purpose We recognize a genre when a text does something with language that we’re familiar with. Very often we are able state what kind.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Topic The common errors in usage of written cohesive devices among secondary school Malaysian learners of English of intermediate proficiency.
Chapter 5 The Oral Approach.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
AMANY ALKHAYAT PSCW ENG371 INTRODUCTION TO CORPUS PROCESSING Corpus Processing Ch1.
Corpora and language learning
Effects of Reading on Word Learning
عمادة التعلم الإلكتروني والتعليم عن بعد
Collecting Written Data
Lecture 7: Measurements and probability
E303 Part II The Context of Language Research
Vocabulary Module 2 Activity 5.
Collocation – Encouraging Learning Independence
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Introduction to Corpus Linguistics
To Linguistics Introduction Department of English Level Four
Statistical NLP: Lecture 7
Descriptive Grammar – 2S, 2016 Mrs. Belén Berríos Droguett
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Searching corpora.
22. Form-Focused Instruction

Language is the capacity that distinguishes humans from all the other creatures. - the most sophisticated and most important feature  - the most uniquely.
Exploring the BNC Corpus
Helping Children Learn
Introduction to Corpus Linguistics: Exploring Collocation
Differences in comprehension strategies for discourse understanding by native Chinese and Korean speakers learning Japanese Katsuo Tamaoka Graduate.
Title: Validating a theoretical framework for describing computer programming processes 29 November 2017.
Discrete Structure II: Introduction
A CORPUS-BASED STUDY OF COLLOCATIONS OF HIGH-FREQUENCY VERB —— MAKE
Corpus Linguistics I ENG 617
Past and Present: Verb Tenses Across Blog Topics
COMMUNICATIVE LANGUAGE TEACHING
A Systematic Framework for Language Analysis
Introduction & 1.1: Analyzing categorical data
Introduction: Statistics meets corpus linguistics
Introduction to Text Analysis
Applied Linguistics Chapter Four: Corpus Linguistics
Copyright © Cengage Learning. All rights reserved.
Constructivism Constructivism — particularly in its "social" forms — suggests that the learner is much more actively involved in a joint enterprise with.
Copyright © Cengage Learning. All rights reserved.
How to Succeed at Life (and Do Well on the AP English Language and Composition Multiple Choice) Adapted from: English Language and Composition, 3rd Edition.
Unsupervised Learning of Narrative Schemas and their Participants
What is Discourse Analysis
CHPATER 4 DESCRIPTIVE WRITING
Reading Models A reading model is
Presentation transcript:

Definition of a corpus Research on written or spoken texts can now be carried out with corpus linguistics. The notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways. In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition.

Size The term "corpus" also implies a body of text of finite size, for example, 1,000,000 words. This is not universally so - for example, at Birmingham University, John Sinclair's COBUILD team have been engaged in the construction and analysis of a monitor corpus. This "collection of texts" as Sinclair's team prefer to call them, is an open-ended entity - texts are constantly being added to it, so it gets bigger and bigger. Monitor corpora are of interest to lexicographers who can trawl a stream of new texts looking for the occurence of new words, or for changing meanings of old words.

Helsinki Corpus. Text identifier Name of text Author's name Sub-period Date of original Date of manuscript Contemporaneity of original and manuscript Dialect Verse or prose Text type Relationship to foreign original Language of foreign original Relationship to spoken language Sex of author Age of author Author's social status Audience description Participant relationship Interactive/non-interactive Formal/informal Prototypical text category Sample

Corpora in language teaching Resources and practices in the teaching of languages and linguistics tend to reflect the division between the empirical and rationalist approaches. Many textbooks contain only invented examples and their descriptions are based upon intuition or second-hand accounts. Other books, however, are explicitly empirical and use examples and descriptions from corpora or other sources of real life language data. Corpus examples are important in language learning as they expose students to the kinds of sentences that they will encounter when using the language in real life situations.

Frequency counts This is the most straight-forward approach to working with quantitative data. Items are classified according to a particular scheme and an arithmetical count is made of the number of items (or tokens) within the text which belong to each classification (or type) in the scheme. For instance, we might set up a classification scheme to look at the frequency of the four major parts of speech: noun, verb, adjective and adverb. These four classes would constitute our types. Another example inolves the simple one-to-one mapping of form onto classification. In other words, we count the number of times each word appears in the corpus, resulting in a list which might look something like: abandon: 5 abandoned: 3 abandons: 2 ability: 5 able: 28 about: 128 etc.....

Proportions Frequency counts are useful, but they have certain disadvantages. When one wishes to compare one data set with another, for example a corpus of spoken language with a corpus of written language. Frequency counts simply give the number of occurences of each type, they do not indicate the prevalence of a type in terms of a proportion of the total number of tokens in the text. This is not a problem when the two corpora that are being compared are of the same size, but when they are of different sizes frequency counts are little more than useless.

Porportions cont The following example compares two such corpora, looking at the frequency of the word boot Type of corpus Number of words Number of instances of boot English Spoken 50,000 50English Written 500,000 500 A brief look at the table seems to show that boot is more frequent in written rather than spoken English. However, if we calulate the frequency of occurrence of boot as a percentage of the total number of tokens in the corpus (the total size of the corpus) we get: spoken English: 50/50,000 X 100 = 0.1% written English: 500/500,000 X 100 = 0.1%

Collocations The idea of collocations is an important one to many areas of linguistics. Khellmer (1991) has argued that our mental lexicon is made up not only of single words, but also of larger phraseological units, both fixed and more variable. Information about collocations is important for dictionary writing, natural language processing and language teaching. However, it is not easy to determine which co-occurences are significant collocations, especially if one is not a native speaker of a language or language variety.

Collocations 2 Given a text corpus it is possible to empirically determine which pairs of words have a substantial amount of "glue" between them, comparing the probablities that two words occur together as a joint event (i.e. because they belong together) with the probability that they are simply the result of chance. For example, the words riding and boots may occur as a joint event by reason of their belonging to the same multiword unit (riding boots) while the words formula and borrowed may simply occur because of a one-off juxtaposition and have no special relationship.

Collocations 3 We can group similar collocates of words together to help to identify different senses of the word. For example, bank might collocate with words such as river, indicating the landscape sense of the word, and with words like investment indicating the financial use of the word.

Collocations 4 We can discriminate the differences in usage between words which are similar. For example, Church et al (1991) looked at collocations of strong and powerful in a corpus of press reports. Although these two words have similar meanings, their mutual information scores for associations with other words revealed interesting differences. Strong collocated with northerly, showings, believer, currents, supporter and odor, while powerful collocated with words such as tool, minority, neighbour, symbol, figure, weapon and post. Such information about the delicate differences in collocation between the two words has a potentially important role, for example in helping students who learn English as a foreign language.

the thing that started, at least to the naked eye the surface that showed itself to the naked eye smooth and featureless as glass to the naked eye almost too small to see with the naked eye colonies, often quite visible to the naked eye devoid of plants at least to the naked eye small, they are not always visible to the naked eye these marine plants, which the naked eye stars the size of Earth, invisible to the naked eye The tape was defective, even to the naked eye