Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch.2

Slides:



Advertisements
Similar presentations
Digital Italian An overview of Italian corpora. A linguistic corpus: a body of texts / transcripts collected for linguistic purposes, computerized, representative.
Advertisements

Diachronic study and language change Corpus Linguistics Richard Xiao
Uses of a Corpus “[E]xplore actual patterns of language use”
HONG KONG EXAMINATIONS AND ASSESSMENT AUTHORITY PROPOSED HKDSE ENGLISH LANGUAGE ASSESSMENT FRAMEWORK.
Lesson Two Versions of One Narrative
Introduction: A discourse perspective on grammar
Dr. Daniel A. Nkemleke Department of English Ecole Normale Supérieure
What is VOICE? VOICE, the Vienna-Oxford International Corpus of English, is a structured collection of language data, the first computer-readable corpus.
 The factors that affect satisfactory completion are:  Students must achieve a grade higher than 40% on each assessment task.  Students must achieve.
Using an Enhanced MDA Model in study of World Englishes Richard Xiao
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
LIN 3098 – Corpus Linguistics Albert Gatt. In this lecture  Corpora for the study of genre/register variation revisit the concept of representativeness.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
JHLA Junior High Literacy Assessment. The school year saw the first administration of the Junior High Literacy Assessment. The assessment was.
Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3.
Introduction : corpora, corpus use, and the British National Corpus Dr. Ylva Berglund Prytz
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Pedagogic uses of a corpus of student writing and their implications for sampling and annotation Alois Heuboeck University of Reading, UK.
LELA English Corpus Linguistics
1/23 LELA Lecture 2 Corpus-based research in Linguistics See esp. Meyer pp
Daniel Nkemleke, Humboldt Kolleg Kamerun, 30/07/2008 Corpus Linguistics and Language Education: Development and Utility of the Corpus of Cameroon English.
Research methods in corpus linguistics Xiaofei Lu.
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Discussions and Oral Presentations as Teaching Material in English for Medicine Zorica Antic Natasa Milosavljevic English language department Faculty of.
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
School library systems 3.2 Education. Libraries often contain many thousands of books, magazines, CD- ROMs, etc. In fact, some of the largest libraries.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Literacy is...  the quality or state of being literate, esp. the ability to read and write  An individual’s ability to construct, create, and communicate.
Enhancing Teaching and Learning with Podcasts Mico e-Learning Workshop.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Chapter 2 Observation and Assessment
1. 2 Outcomes for the Session To develop an understanding of:  The teaching/learning cycle;  The definitions of terms used in English; and  The components.
Becoming a geographical researcher I will have to be a good ‘hunter-gatherer’ and get myself organised to keep things….. I will need to think like a detective….finding,
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
LIN Corpus Linguistics LIN3098 – Corpus Linguistics Lecture 2 Albert Gatt.
How Can Corpora Help Me To Be Successful in CO150?
Corpus approaches to discourse
RESEARCH DESIGN & CORPUS COMPILATION. Corpus design is intrinsic and a fundamental part of the analysis. It is guided by the RQ and affects the results.
Language and Society II Ethnic dialect An ethnic dialect is a social dialect of a language that is mainly spoken by a less privileged population.
Communicative and Academic English for the EFL Professional.
Getting Data ● Kinds of data for linguistics – Written – Spoken – Visual (ASL, body language) ● Phonetics – Implosives-larynx lowering, rounding, x-ray.
Corpus search What are the most common words in English
Rebalancing corpora Disentangling effects of unstratified sampling and multiple variables in corpus data Sean Wallis Survey of English Usage University.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
LECTURE 3 1 APPROACHES TO THE STUDY OF LANGUAGE IN SOCIETY.
GCSE English Language 8700 GCSE English Literature 8702 A two year course focused on the development of skills in reading, writing and speaking and listening.
Using Corpora in TEFL By Terri Yueh. WhyWhy Work With Corpora? Why  From Vocabulary to Corpus  Choosing a Corpus Choosing a Corpus  Examples of Word.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 King Faisal University.
COGS Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 2 Corpus Design Issues I.
Forms of Literature Language Arts Standard: 7E1c.1 Discuss the purposes and characteristics of different forms of written text…
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
English for EAP Practice activities Reading more efficiently Lesson 4 Different text types English for Academic Purposes Practice activities Reading more.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Corpus Linguistics Anca Dinu February, 2017.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Deny A. Kwary Airlangga University

Corpus Linguistics I ENG 617
Introduction to Corpus Linguistics: Exploring Collocation
AICE AS English Language (9093)
APPROACHES TO THE STUDY OF LANGUAGE IN SOCIETY
Applied Linguistics Chapter Four: Corpus Linguistics
Presentation transcript:

Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch.2 CF Meyer, English Corpus Linguistics, Ch. 2

What is a corpus? Corpus (pl. corpora) = ‘body’ Collection of written text or transcribed speech Usually but not necessarily purposefully collected Usually but not necessarily structured Usually but not necessarily annotated (Usually stored on and accessible via computer) Corpus ~ text archive

Issues in corpus design General purpose vs specialized Dynamic (monitor) vs static Representativeness and balance Size Storage and access Permission Text capture and markup Organizations

General purpose vs specialized Probably obvious how to assemble specialized corpus: appropriateness of texts for inclusion is self-defined General-purpose corpus implies very careful planning to ensure balance Implies making some assumptions about the nature of language, even though (as corpus linguists) that may go against the grain

Dynamic vs static Static corpus will give a snapshot of language use at a given time Easier to control balance of content May limit usefulness, esp. as time passes (eg Brown corpus now of historical interest, in some respects BNC already out of date) Dynamic corpus ever-changing Called “monitor” corpus because allows us to monitor langauge change over time But more or less impossible to ensure balance

Planned balance: example of BNC Sampling and representativeness very difficult to ensure BNCdesigners very explicit about their assumptions Acknowledge that many decisions are subjective in the end 100 m words of contemporary spoken and written British English Representative of BrE “as a whole” Balanced with regard to genre, subject matter and style Also designed to be appropriate for a variety of uses: lexicography, education, research, commercial applications (computational tools)

BNC 4,124 texts: 90% written, 10% spoken Largest collection of spoken English ever collected (10m words), but reflects typical imbalance in favour of written text (for understandable practical reasons) Written portion: 75% informative, 25% imaginative Amount of fiction is slightly disproportionately high compared to amount published during the sampling period, justified because of cultural importance of fiction and creative writing

Subject coverage Planned to reflect pattern of book publishing in UK over last 20 years Subject   Number of texts % of total written Imaginative 625 22 World affairs 453 18 Social science 510 15 Leisure 374 11 Applied science 364 8 Commerce 284 8 Arts 259 8 Natural science 144 4 Belief & thought 146 3 Unclassified 50 3

Sources of written material 60% books 25% periodicals 5% brochures and other ephemera eg bus tickets, produce containers, junk mail 5% unpublished letters, essays, minutes 5% plays, speeches (written to be spoken)

Register “levels” 30% literary or technical “high” 45% “middle” 25% informal “low” Obvious difficulty of how to judge levels a priori

Spoken corpus Context-governed material Lectures, tutorials, classrooms News reports Product demonstrations, consultations, interviews Sermons, political speeches, public meetings, parliamentary debates Sports commentaries, phone-ins, chat shows Samples from 12 different regions

Spoken corpus Ordinary conversation 2000 hrs from 124 volunteers, 38 different regions Four different socio-economic groupings Equal male and female, age range 15 to 60+ All conversations over a 2-day period recorded No secret recording, and allowed to erase Systematic details kept of time, location, details of participants (sex, age, race, occupation, education, social group, ), topic, etc. Transcription issues: include false starts, hesitations, etc. some paralinguistic features (shouting, whispering), use of dialect words/grammar but no phonetic information

Another example: ICE Collection of samples of English as spoken/written around the world Common design (as well as common annotation scheme, and shared tools for exploitation) 500 texts of approximately 2,000 words each 60% spoken, 40% written Specific domains and genres prescribed Prescribing common design in this way makes the corpora comparable

ICE text categories Each sample should be 2000 words Spoken (300) Dialogues (180) Private  (100) Conversations (90) Phone calls (10)   Public (80) Class lessons (20) Broadcast discussions (20) Broadcast interviews (10) Parliamentary debates (10) Cross-examinations (10) Business transactions (10) Monologues (120) Unscripted (70) Commentaries (20) Unscripted speeches (30) Demonstrations (10) Legal presentations (10) Scripted (50) Broadcast news (20) Broadcast talks (20) Non-broadcast talks (10) Written (200) Non-printed (50) Student writing  (20) Student essays (10) Exam scripts (10) Letters (30) Social letters (15) Business letters (15) Printed (150) Academic  (40) Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10) Popular  (40) Reportage (20) Press reports (20) Instructional (20) Administrative writing (10) Skills/hobbies (10) Persuasive (10) Editorials (10) Creative (20) Novels (20)

Length of corpus Resources available to create and manage corpus determine how long it can be Funding, researchers, computing facilities Speech is easy to capture, but much more time-consuming to process that written language Transcription and annotation requires 6 person-hours per 1 minute of speech (Santa Barbara Corpus of Spoken American English) 4 person-hours per 1,000 words of written sample, but between 5 and 10 person-hours per 1,000 words of speech (more for dialogues due to overlapping speech) (International Corpus of English) On this basis, American component of ICE would take one researcher working 40 hrs/week 3 years to complete BNC is 100 times bigger than that

Length of corpus Length is also determined on use to which it will be put Corpora for lexicographic use need to be (much) bigger Early corpora (1m words) seemed huge, mainly due to limitations of computers to process them Sinclair (1991) described a 20m word corpus as “small but nevertheless useful” Even in a billion-word corpus, data for some words/constructions would be sparse How many tokens of a linguistic item are needed for descriptive adequacy? Typically 40-50% of all word types occur only once in a given text (or corpus) For polysemous words at least half of the possible meanings will occur only once (if at all)

“Type” and “token” “Token” means individual occurrence of a word “Type” means instance of a given word The man saw the girl with the telescope 8 tokens, 6 types “Type” may refer to lexeme, or individual word form run, runs, ran, running: 1 or 4 types?

Also, how much data can a lexicographer absorb? Some attempts to base corpus size on known statistics of existing corpora Biber (1993): “reliable information” on frequently occurring linguistic items such as nouns can be got from 120k-word sample, while an infrequently occurring construction such as conditional clause would need 2.4m words How are such figures arrived at? Observe point at which measures stabilise Also, how much data can a lexicographer absorb?