Lou Burnard BNC-XML: an introduction.

Slides:



Advertisements
Similar presentations
A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
Advertisements

“I Can” Learning Targets
A learner corpus of students’ examination work in English language (a project) Sylwia Twardo Centre for Foreign Language Teaching, Warsaw University, Poland.
Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch.2
Introduction: A discourse perspective on grammar
M I S Dr. Ernst-Gerd vom Kolke 1 Web Design - Introduction n Design for printed and electronic information isn’t very different n Special aspects for web.
The BNC XML edition Guy Aston
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
Why are we revising writing?
Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3.
Introduction : corpora, corpus use, and the British National Corpus Dr. Ylva Berglund Prytz
New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.
Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services The British National Corpus: where did we go wrong?
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
The origins of language curriculum development
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Research methods in corpus linguistics Xiaofei Lu.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Tutorial 1: Getting Started with HTML5
What is it? What is it? IELTS. So, what is it? IELTS is a test of English. It’s a way to check if people are ready to work or study in English. There.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services Introducing the British National Corpus.
Educator’s Guide Using Instructables With Your Students.
ESL Phases & ESL Scale Curriculum Corporation 1994.
Analyzing the Persuasive and Informational Genres of the W2 Writing Standard  GPS Review: Comparing/contrasting W1 and W2 Language of the Standards (LOTS)
METS-Based Cataloging Toolkit for Digital Library Management System Dong, Li Tsinghua University Library
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
Chapter 7 Foregrounding Written Communication. Teaching Interactive Second Language Writing in Content- Based Classes Teachers should include a wide range.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
English Language.
ELA Common Core Shifts. Shift 1 Balancing Informational & Literary Text.
Guy Aston, Ylva Berglund Prytz, & Lou Burnard, Exploring BNC-XML with Xaira.
A short guide to publishing in European Journal of Soil Science EJSS wileyonlinelibrary.com/journal/ejss.
Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES)
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Interdisciplinary Writing Unit: Narrative Kim Stewart READ 7140.
“I Can” Learning Targets 4 th English/Writing 5th Six Weeks.
How Can Corpora Help Me To Be Successful in CO150?
HYMES (1964) He developed the concept that culture, language and social context are clearly interrelated and strongly rejected the idea of viewing language.
4th grade Expository, biography Social Studies- Native Americans
1 Document Writing and Presentations. 2 Writing reports and project documentation u Approaches to writing u Writing style u References u Other tips u.
RESEARCH DESIGN & CORPUS COMPILATION. Corpus design is intrinsic and a fundamental part of the analysis. It is guided by the RQ and affects the results.
1 Branches of Linguistics. 2 Branches of linguistics Linguists are engaged in a multiplicity of studies, some of which bear little direct relationship.
 Programming - the process of creating computer programs.
Chapter 6 Acquiring knowledge for L2 use
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
LECTURE 3 1 APPROACHES TO THE STUDY OF LANGUAGE IN SOCIETY.
“I Can” Learning Targets 4 th English/Writing 6th Six Weeks.
LITERACY TEST STRATEGIES. Literacy Test Format  The literacy test has a variety of reading selections and questions Types of Questions  Multiple choice.
INTRODUCTION TO THE WIDA FRAMEWORK Presenter Affiliation Date.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Documenting Conversation Toshihide Nakayama Documentary Linguistics Workshop 2016.
Key Stage 2 Portfolio. Llafaredd / Oracy Darllen / Reading Ysgrifennu / Writing Welsh Second Language.
Fall CS-EE 480 Lillevik 480f06-l7 University of Portland School of Engineering Senior Design Lecture 7 Functional specifications Technical meetings.
T H E D I R E C T M E T H O D DM. Background DM An outcome of a reaction against the Grammar- Translation Method. It was based on the assumption that.
Searching the Web for academic information Ruth Stubbings.
Lou Burnard RESEARCH TECHNOLOGIES SERVICE Oxford University Computing Services BNC-XML and Xaira.
GCSE ENGLISH ENGLISH LANGUAGE Unit 1 group Oracy task 21 st and 22 nd November 2016 Unit 2 exam 6 th June 2017 Unit 3 exam 12 th June 2017 ENGLISH LITERATURE.
Best Practices in Implementing the 2010 ELA Standards
Academic writing.
Corpus Linguistics Anca Dinu February, 2017.
Advanced Higher Modern Languages
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Computational and Statistical Methods for Corpus Analysis: Overview
Using GOLD to Tracking L2 Development
Presentation transcript:

Lou Burnard BNC-XML: an introduction

What is the BNC?  a snapshot of British English, taken at the end of the 20 th century  100 million words in approx 4000 different text samples, both spoken (10%) and written (90%)‏  synchronic (1990-4), sampled, general purpose corpus  available under licence; latest edition is BNC-XML (13 mar 2007)‏

Production of the BNC  managed by an academic-industrial consortium  with significant government funding  took three years (at least)‏  cost GBP 1.6 million (at least)‏  came about through an unusual coincidence of interests amongst:  Lexicographical publishers  Government (DTI)‏  Engineering and Science Research Council  Target audience: Lexicographers, NLP researchers,  But not language teachers!

Remember the Nineties?  WinWord or WP5? the choice is yours  On your desk … a 386 with 50 Mb diskspace (just about enough to run Windows 3)‏  In your lab... a VAX or a Sparc for serious work  On the WWW (maybe)... Mosaic for X  Little text in digital format  Text encoding (under development)‏  TEI  SGML

Corpus linguistics 90s-style  a world without the web!  corpus linguistics  Traditionalists (ICAME)‏  Expansionists (LDC, monitor corpora)‏  text encoding theory  language engineering and NLP  the JFIT mentality

Project Goals  Stated  A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production  of non-opportunistic design, for generic applicability  with word class annotation and contextual information  Unstated  better, more authoritative, learner dictionaries  a new template for European language resources  a REALLY BIG corpus

The BNC “sausage machine” OUP Written (OUP/Chambers )‏ Spoken (Longman)‏ Initial CDIF Conversion and Validation (OUCS)‏ Word Class Annotation (UCREL)‏ Header generation and final validation (OUCS)‏ Selection, clearance, and captureEnrichment and encoding Documentation, distribution, maintenance

Distinctive features of the BNC  non-opportunistic design  standardized markup system  structural annotation  word class annotation  contextual information  general availability...in these respects, the BNC remains distinctive, twenty years on!

Why BNC XML? The BNC is still widely used... but the technology has moved on XML tools are everywhere... so using the corpus is much easier Conversion to XML was easy and (fairly) automatic... but with more tractable markup some dusty corners needed sweeping out

What's in the BNC?

Needles and haystacks  The BNC has an extraordinary range  travel agent brochures, weather reports, formal invitations, advertising, publicity leaflets, children's talk, academic discourse, doctor's consultations, marketing meetings, oral history, jokes and anecdotes, high literature, best- sellers, business letters, personal diaries and correspondence...  The problem is finding the specific texts you want  Selection criteria  Descriptive criteria  Post-hoc categorization  (or use the WLD principle)‏

BNC Design  Criteria for written texts (90%)‏ Medium (books, newspapers, unpublished…)‏ Domain (informative, entertaining…)‏  Criteria for transcribed speech events (10%)‏  Context governed half predefined list of speech situations  Demographically sampled half 200 volunteers, sampled for age, sex, region  These selection criteria make up a taxonomy, which is defined in the corpus header

What topics?‏

Descriptive criteria  spoken texts  speaker occupation, perceived accent, education level, personal relationship…  speech domain, region, locale …  written texts  author age, sex, type  audience, circulation, status  text-type classification  These criteria were used to maximize variation once selectional constraints had been applied

Post-hoc text-type classification

Annotation, encoding, markup A means of making explicit, and thus processable:  structure texts, sections, paragraphs, turns, sentences, words...  metadata text-type, situational parameters, context  analysis morphology, syntactic function, translation  Adopting a single framework facilitates integration and sharing of fragmentary resources  thus enhancing research outcomes  also makes tool development much easier

BNC structure wtext teiHeader bncdoc bnc stext teiHeader bncdoc bncDoc

p p p p div 1 div s s s s s s s wtext stext div u u u u w w w w w w w 6,026,284 98,363, ,484 1,599,692 BNC-XML structure

Word class annotation  CLAWS (Leech, Garside et al) approach  What counts as a word?  In BNC-XML, each word is explicitly marked and annotated with  a root form or lemma  an automatically assigned C5 word class code  a simplified POS code This isn't prima facie obvious, in spite of spelling conventions.

Words and multiwords  English orthography can be misleading In BNC XML, some “multiwords” are explicitly marked: in spite of... in spite of common sense... it wasn't me it was n't me

Structure of written texts  Most written texts are organized hierarchically into various kinds of division, shown by headings or other features:  Some divisions are typed: e.g. chapter, section, story, subsection, column, front, part, recipe, leaflet...  all spoken texts are divided into “conversations”...

Features of written texts  Paragraph-like  marks paragraphs  marks headings or captions  marks lists  marks quotes  marks verse lines  Paragraph-parts  for typographic highlighting  for corrected passages  for deliberate omissions  for page breaks

Speech in writing... Mr. Skinner... That millionaire mammy 's boy — Interruption Mr. Speaker Order. That is not wholly unparliamentary.

Structure of spoken texts   marks a stretch of speech initiated by speaker identified as XXX   marks a synchronization point  detailed information on speakers is given in the text header  other features of transcribed speech are also marked...

Features of spoken texts  marks changes in voice quality e.g. whispering, laughing, etc., both as discrete events and as changes in voice quality affecting passages within an utterance.  marks non-verbal but vocalised sounds e.g. coughs, humming noises etc.  marks non-verbal and non-vocal events e.g. passing lorries, animal noises, and other matters considered worthy of note.  marks significant pauses silence, within or between utterances, longer than was judged normal for the speaker or speakers.  marks unclear passages whole utterances or passages within them which were inaudible or incomprehensible for a variety of reasons.

baby baby burped baby cries baby cry baby crying baby crying in background baby gurgling baby laughing baby noise baby noises baby screaming baby shouting baby shouting over the top baby shouts baby speaking baby squealing baby talk baby talking background chatter background chatter in pub background chatting shuffling etcetera background conversation event description

Vocal descriptions

Contextual information  each text has a TEI header  identification and classification  specific details (e.g. speakers)‏  all common data in the corpus header  classification(s) in header are pointed to by individual texts

Structure of the TEI Header  File Description Title Statement Responsibility Statement/s Edition Statement Extent Publication Statement Identification numbers Source Description  Encoding Description Tagging Declaration  Profile Description Creation [Participant Description] Text Classification  Revision Description

The title Statement How we won the open: the caddies' stories. Sample containing about words from a book (domain: leisure) Harlow Women's Institute committee meeting. Sample containing about 246 words speech recorded in public context 32 conversations recorded by `Frank' (PS09E) between 21 and 28 February 1992 with 9 interlocutors, totalling 3193 s-units, words, and 3 hours 22 minutes 23 seconds of recordings. [Leaflets advertising goods and products]. Sample containing about words of miscellanea (domain: commerce) The age of capital Sample containing about words from a book (domain: world affairs) Data capture and transcription Oxford University Press

The edition statement BNC XML Edition, December tokens; w-units; 1436 s-units Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium. This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at for full licencing and distribution conditions. J0P AgeCap

The source description 1 The age of capital Hobsbawm, E J Abacus London

The source description 2 <recording xml:id="KE5RE000" n="035201" date=" " time="11:50+" type="Walkman"/> <recording xml:id="KE5RE001" n="035202" date=" " time="11:50+" type="Walkman"/> <recording xml:id="KE5RE002" n="035203" date=" " time="17:05+" type="Walkman"/> <recording xml:id="KE5RE003" n="035204" date=" " type="Walkman"/>

The encoding description

The profile description (written)‏ W nonAc: humanities arts History, Modern - 19th century Capitalism - History - 19th century World,

Classification codes  Codes used are predefined in the Corpus header Written Domain Imaginative Natural and pure sciences Applied sciences...

The profile description (spoken)‏ <person ageGroup="Ag1" xml:id="PS0X2" role="self" sex="m" soc="DE" dialect="XSS"> 20 Wayne unemployed Central South-west England.... <setting xml:id="KE5SE000" n="035201" who="PS000 PS0X2"> Hampshire: Andover local shop visiting friends...

Has English moved on?  types of text   web pages / blogs  SMS  personal letters  topics  globalization  internet  Elvis  Word Perfect

Out of date?  The composition (and date) of any corpus affects inferences drawn from it  There aren't many alternatives  Web-as-corpus  sources of spoken texts?  monitor corpora are non-replicable  copyright permissions unrepeatable  Quantitative and qualitative comparative evaluations of BNC coverage are needed  but “it's surprising how much is there”

Why is it still useful?  The BNC is a problematizing resource...  complements (and corrects) intuition  increases learner autonomy  critiques the myth of the native speaker ... for teacher and learner alike  XML makes it more usable by non-specialist software  Its range and availability make it unique

Where can I get one?  BNC XML:  now available on DVD  standalone single user licence or institutional licence  existing licensees should renew  XAIRA  Delivered free with the BNC (and also available free from  Usable with any XML corpus  Usable/ish on any platform