Introduction : corpora, corpus use, and the British National Corpus Dr. Ylva Berglund Prytz

Slides:



Advertisements
Similar presentations
A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
Advertisements

“I Can” Learning Targets
Uses of a Corpus “[E]xplore actual patterns of language use”
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Dr Rachel Hawkes Secondary Regional Languages Conference Leicester, March 2014 Keynote.
Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch.2
Lou Burnard BNC-XML: an introduction.
The BNC XML edition Guy Aston
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services The British National Corpus: where did we go wrong?
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
LELA English Corpus Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Deny A. Kwary Internal Structures of Dictionary Entries.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
National Curriculum Key Stage 2
Memory Strategy – Using Mental Images
The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.
Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services Introducing the British National Corpus.
Educator’s Guide Using Instructables With Your Students.
ESL Phases & ESL Scale Curriculum Corporation 1994.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Real-Time Speech Recognition Subtitling in Education Respeaking 2009 Dr Mike Wald University of Southampton.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
ELA Common Core Shifts. Shift 1 Balancing Informational & Literary Text.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Guy Aston, Ylva Berglund Prytz, & Lou Burnard, Exploring BNC-XML with Xaira.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
“I Can” Learning Targets 4 th English/Writing 5th Six Weeks.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
LIN Corpus Linguistics LIN3098 – Corpus Linguistics Lecture 2 Albert Gatt.
How Can Corpora Help Me To Be Successful in CO150?
RESEARCH DESIGN & CORPUS COMPILATION. Corpus design is intrinsic and a fundamental part of the analysis. It is guided by the RQ and affects the results.
Building and analysing your own corpus 1. Building a corpus.
Communicative and Academic English for the EFL Professional.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
GCSE English Language 8700 GCSE English Literature 8702 A two year course focused on the development of skills in reading, writing and speaking and listening.
“I Can” Learning Targets 4 th English/Writing 6th Six Weeks.
Using Corpora in TEFL By Terri Yueh. WhyWhy Work With Corpora? Why  From Vocabulary to Corpus  Choosing a Corpus Choosing a Corpus  Examples of Word.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
INTRODUCTION TO APPLIED LINGUISTICS
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Lou Burnard RESEARCH TECHNOLOGIES SERVICE Oxford University Computing Services BNC-XML and Xaira.
Corpus Linguistics Anca Dinu February, 2017.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Text Based Information Retrieval

Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Listening listen attentively to spoken language and show understanding by joining in and responding explore the patterns and sounds of language through.
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Presentation transcript:

Introduction : corpora, corpus use, and the British National Corpus Dr. Ylva Berglund Prytz

Outline  Presentation: Corpora, corpus use, and the BNC  Demonstration: How to use BNC with Xaira  Hands-on: BNC with Xaira  Presentation: Using the BNC for teaching and research  More hands-on: exploring more  Questions and answers

At the end of today you should  have a basic working knowledge about  corpora and corpus use  the BNC  Xaira  feel confident using Xaira  be able to explore area on your own  know where to turn for help and advice

Approaches to linguistic study Intuition “Feel” what is right/wrong/possible One person’s language Subjective Study of usage Examine what is actually said/written Several people Objective

How do you study usage?  Examine naturally occurring language  Draw conclusions Need a sample of language, produced by different people in various contexts Find a corpus!

What is a corpus?  A collection of naturally occurring language data compiled to mirror a language/language variety  (Usually) computer-readable  (Usually) contains more than text (annotation, meta-data)

What is a corpus? – some definitions A corpus can be defined as a collection of texts assumed to be representative of a given language. (Tognini-Bonelli 2001: 2) A corpus is a collection of naturally-occurring language text, chosen to characterise a state or variety of language. (Sinclair 1991: 171) All the material included in a corpus, whether spoken, written […] is assumed to be taken from genuine communications of people going about their normal business. (ibid: 55)

How can a corpus help?  Look for patterns to see regularities  Quantify  See several examples  Real language – language in use  Based on a variety of sources

Balanced corpora (= Reference or general corpora) Specialised corpora  Genre-specific, LSP (e.g. English for Academic Purposes) …  Varieties (dialectal, social, historical)  Learner language, English as a Lingua Franca Multilingual corpora  Parallel corpora (translations; alignable)  Comparable corpora (similar texts) Fixed size / monitor corpora Mode and medium  Written, spoken and transcribed, spoken with audio, video Types of corpora

Famous corpora  Brown family (Brown, LOB, FLOB)  1 million words, different text categories  Bank of English  Monitor corpus, grows with time  International Corpus of English (ICE)  Different national varieties of English. 1 million words written and spoken  British National Corpus  Reference corpus, fixed, 100 million words, written and spoken

British National Corpus (BNC)

What is the BNC?  A snapshot of British English, taken at the end of the 20 th century  100 million words in approx 4,000 different text samples, both spoken (10%) and written (90%)  Synchronic ( ), sampled, general purpose corpus  Available under licence; latest edition is BNC XML edition (March 2007)

More than text  Metadata  About text, author/speaker, audience  Structural & typographical information  Paragraph, sentence, heading, list, bolds  Extra-linguistic information  Voice quality, noise, pauses, overlap  Linguistic information  Part-of-speech

Who produced the BNC and why?  a consortium of dictionary publishers and academic researchers  OUP, Longman, Chambers  OUCS, UCREL, BL R&D  with funding from DTI/ SERC under JFIT  Lexicographers, NLP researchers,  But not language teachers!

Stated Project Goals  A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production  of non-opportunistic design, for generic applicability  with word class annotation  and contextual information

Actual (?) project goals  Better ELT dictionaries  authoritative  both speech and writing  A model for European corpus work  design, and encoding  Industrial-academic co-operation  A REALLY BIG corpus

Production of the BNC  took three years (at least)  cost GBP 1.6 million (at least)  came about through an unusual coincidence of interests amongst:  Lexicographical publishers  Government (DTI)  Engineering and Science Research Council

Project consequences  industrial-scale text production system  necessary compromises?  technically over-ambitious?  IPR and profitability The BNC looks back to Brown and LOB in its design and markup, and forward to the Web in its scope and indeterminacy

How was the corpus created?

 Corpus design  Text selection  Clearance  Capture  Add additional information  Merge  (documentation)  Distribution

The BNC “sausage machine” OUP Written (OUP/Chambers )‏ Spoken (Longman)‏ Initial CDIF Conversion and Validation (OUCS)‏ Word Class Annotation (UCREL)‏ Header generation and final validation (OUCS)‏ Selection, clearance, and captureEnrichment and encoding Documentation, distribution, maintenance

Text selection  Design criteria  Types of texts  Sources  Number of samples  Size of samples  Descriptive criteria  Additional information where available

Selection criteria: written texts Domain imaginative (c 25%) informative Medium Book, periodicals, misc. published, unpublished, written to be spoken Time ( , )

“Descriptive” criteria: written texts  Sample size (number of words) and extent (start and end points)  Topic or subject of the text  Author's name, age, gender, region of origin, and domicile  Target age group and gender  "Level" of writing (reading difficulty) : the more literary or technical a text, the "higher" its level

Selection criteria: spoken texts demographic (spoken conversation)  transcriptions of spontaneous natural conversations made by recruited volunteers  original recordings are available from British Library context-governed (other spoken material)  transcriptions of recordings made at specific types of meeting and event.

Spoken texts: context-governed Four broad categories of social context: Educational and informative events, such as lectures, news broadcasts, classroom discussion, tutorials Business events such as sales demonstrations, trades union meetings, consultations, interviews Institutional and public events, such as sermons, political speeches, council meetings Leisure events, such as sports commentaries, after- dinner speeches, club meetings, radio phone-ins

Descriptive criteria: spoken texts  Features relating to the speaker (age, sex, social class, dialect)  Context of recording (place, time)  Features of the recording (non-verbal events, paralinguistic phenomena, unclear instances)  Included when known  Sometimes provided by respondent

What is the result?

What is the BNC?  4,000+ texts  Ca. 100,000,000 words  10% spoken  Information about  the texts  the speakers/writers  the words  Delivered with a search tool: XAIRA

What's in the BNC?

What topics?‏

Post-hoc text-type classification

Format Corpus header (1) Corpus texts (4,000+) Text Text header …

Annotation, encoding, markup A means of making explicit, and thus processable:  structure texts, sections, paragraphs, turns, sentences, words...  metadata text-type, situational parameters, context  analysis morphology, syntactic function, translation

Word class annotation  CLAWS (Leech, Garside et al) approach  What counts as a word?  In BNC-XML, each word is explicitly marked and annotated with  a root form or lemma  an automatically assigned C5 word class code  a simplified POS code This isn't prima facie obvious, in spite of spelling conventions.

Example: word class annotation Difficulty is being expressed with the method to be used to launch the scheme.

Difficulty is being expressed with the method to be used to launch the scheme. c5 = detailed part-of-speech hw = head word (new) pos = simple part-of-speech (new)

Some BNC-XML elements  or  = section  = paragraph or = utterance  = “sentence”  = word and = punctuation  = multiword unit

What is the markup for?  It makes it possible for you to  distinguish aids=SUBST from aids=VERB  distinguish occurrences in writing from ones in speech  distinguish occurrences in headings from ones in paragraphs  identify contextual units like sentences and paragraphs  FACTSHEET WHAT IS AIDS? AIDS (Acquired Immune Deficiency Syndrome) is a condition caused by a virus called HIV (Human Immuno Deficiency Virus).

Who uses the BNC (and how?)  Linguists  Research on (English) language  Teachers  Reference, Generate teaching materials, In classroom  Publishers  Dictionaries, EFL text books  Language engineers  Language + computer tools, AI, NLP  Students/language learners  Computer scientists  Information retrieval  Psychologists/neurologists  General ‘norm’ or reference Lexicographers NLP researchers

What makes the BNC so special?  Size  Design  General availability  Standardized markup system  Structural annotation  Word class annotation  Contextual information  Model for other projects...in these respects, the BNC remains distinctive, twenty years on!

How to use the BNC (with Xaira)

The BNC can be used in different ways and with different tools  User needs to know  What information is available  Where/how is information coded XAIRA can help

Search for  Words or phrases  Word class information  Annotation/mark-up  or a combination of them

Display  Search term with context  with or without mark-up  Information about text  Collocations (co-occurring words)  Distribution across parts of the corpus and much more

XAIRA – XML-aware retrieval application  Searches an index of the corpus  Uses information in the headers and the texts  Often more than one way to make a search  Can be used with other corpora (if they are indexed first)

Introduction : corpora, corpus use, and the British National Corpus Dr. Ylva Berglund Prytz