Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lou Burnard BNC-XML: an introduction.

Similar presentations


Presentation on theme: "Lou Burnard BNC-XML: an introduction."— Presentation transcript:

1 Lou Burnard http://www.natcorp.ox.ac.uk BNC-XML: an introduction

2 What is the BNC?  a snapshot of British English, taken at the end of the 20 th century  100 million words in approx 4000 different text samples, both spoken (10%) and written (90%)‏  synchronic (1990-4), sampled, general purpose corpus  available under licence; latest edition is BNC-XML (13 mar 2007)‏

3 Production of the BNC  managed by an academic-industrial consortium  with significant government funding  took three years (at least)‏  cost GBP 1.6 million (at least)‏  came about through an unusual coincidence of interests amongst:  Lexicographical publishers  Government (DTI)‏  Engineering and Science Research Council  Target audience: Lexicographers, NLP researchers,  But not language teachers!

4 Remember the Nineties?  WinWord or WP5? the choice is yours  On your desk … a 386 with 50 Mb diskspace (just about enough to run Windows 3)‏  In your lab... a VAX or a Sparc for serious work  On the WWW (maybe)... Mosaic for X  Little text in digital format  Text encoding (under development)‏  TEI  SGML

5 Corpus linguistics 90s-style  a world without the web!  corpus linguistics  Traditionalists (ICAME)‏  Expansionists (LDC, monitor corpora)‏  text encoding theory  language engineering and NLP  the JFIT mentality

6 Project Goals  Stated  A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production  of non-opportunistic design, for generic applicability  with word class annotation and contextual information  Unstated  better, more authoritative, learner dictionaries  a new template for European language resources  a REALLY BIG corpus

7 The BNC “sausage machine” OUP Written (OUP/Chambers )‏ Spoken (Longman)‏ Initial CDIF Conversion and Validation (OUCS)‏ Word Class Annotation (UCREL)‏ Header generation and final validation (OUCS)‏ Selection, clearance, and captureEnrichment and encoding Documentation, distribution, maintenance

8 Distinctive features of the BNC  non-opportunistic design  standardized markup system  structural annotation  word class annotation  contextual information  general availability...in these respects, the BNC remains distinctive, twenty years on!

9 Why BNC XML? The BNC is still widely used... but the technology has moved on XML tools are everywhere... so using the corpus is much easier Conversion to XML was easy and (fairly) automatic... but with more tractable markup some dusty corners needed sweeping out

10 What's in the BNC?

11 Needles and haystacks  The BNC has an extraordinary range  travel agent brochures, weather reports, formal invitations, advertising, publicity leaflets, children's talk, academic discourse, doctor's consultations, marketing meetings, oral history, jokes and anecdotes, high literature, best- sellers, business letters, personal diaries and correspondence...  The problem is finding the specific texts you want  Selection criteria  Descriptive criteria  Post-hoc categorization  (or use the WLD principle)‏

12 BNC Design  Criteria for written texts (90%)‏ Medium (books, newspapers, unpublished…)‏ Domain (informative, entertaining…)‏  Criteria for transcribed speech events (10%)‏  Context governed half predefined list of speech situations  Demographically sampled half 200 volunteers, sampled for age, sex, region  These selection criteria make up a taxonomy, which is defined in the corpus header

13 What topics?‏

14 Descriptive criteria  spoken texts  speaker occupation, perceived accent, education level, personal relationship…  speech domain, region, locale …  written texts  author age, sex, type  audience, circulation, status  text-type classification  These criteria were used to maximize variation once selectional constraints had been applied

15 Post-hoc text-type classification

16 Annotation, encoding, markup A means of making explicit, and thus processable:  structure texts, sections, paragraphs, turns, sentences, words...  metadata text-type, situational parameters, context  analysis morphology, syntactic function, translation  Adopting a single framework facilitates integration and sharing of fragmentary resources  thus enhancing research outcomes  also makes tool development much easier

17 BNC structure wtext teiHeader bncdoc bnc stext teiHeader 4049 908 bncdoc bncDoc

18 p p p p div 1 div s s s s s s s wtext stext div u u u u w w w w w w w 6,026,284 98,363,784 784,484 1,599,692 BNC-XML structure

19 Word class annotation  CLAWS (Leech, Garside et al) approach  What counts as a word?  In BNC-XML, each word is explicitly marked and annotated with  a root form or lemma  an automatically assigned C5 word class code  a simplified POS code This isn't prima facie obvious, in spite of spelling conventions.

20 Words and multiwords  English orthography can be misleading In BNC XML, some “multiwords” are explicitly marked: in spite of... in spite of common sense... it wasn't me it was n't me

21 Structure of written texts  Most written texts are organized hierarchically into various kinds of division, shown by headings or other features:  Some divisions are typed: e.g. chapter, section, story, subsection, column, front, part, recipe, leaflet...  all spoken texts are divided into “conversations”...

22 Features of written texts  Paragraph-like  marks paragraphs  marks headings or captions  marks lists  marks quotes  marks verse lines  Paragraph-parts  for typographic highlighting  for corrected passages  for deliberate omissions  for page breaks

23 Speech in writing... Mr. Skinner... That millionaire mammy 's boy — Interruption Mr. Speaker Order. That is not wholly unparliamentary.

24 Structure of spoken texts   marks a stretch of speech initiated by speaker identified as XXX   marks a synchronization point  detailed information on speakers is given in the text header  other features of transcribed speech are also marked...

25 Features of spoken texts  marks changes in voice quality e.g. whispering, laughing, etc., both as discrete events and as changes in voice quality affecting passages within an utterance.  marks non-verbal but vocalised sounds e.g. coughs, humming noises etc.  marks non-verbal and non-vocal events e.g. passing lorries, animal noises, and other matters considered worthy of note.  marks significant pauses silence, within or between utterances, longer than was judged normal for the speaker or speakers.  marks unclear passages whole utterances or passages within them which were inaudible or incomprehensible for a variety of reasons.

26 baby baby burped baby cries baby cry baby crying baby crying in background baby gurgling baby laughing baby noise baby noises baby screaming baby shouting baby shouting over the top baby shouts baby speaking baby squealing baby talk baby talking background chatter background chatter in pub background chatting shuffling etcetera background conversation event description

27 Vocal descriptions

28 Contextual information  each text has a TEI header  identification and classification  specific details (e.g. speakers)‏  all common data in the corpus header  classification(s) in header are pointed to by individual texts

29 Structure of the TEI Header  File Description Title Statement Responsibility Statement/s Edition Statement Extent Publication Statement Identification numbers Source Description  Encoding Description Tagging Declaration  Profile Description Creation [Participant Description] Text Classification  Revision Description

30 The title Statement How we won the open: the caddies' stories. Sample containing about 36083 words from a book (domain: leisure) Harlow Women's Institute committee meeting. Sample containing about 246 words speech recorded in public context 32 conversations recorded by `Frank' (PS09E) between 21 and 28 February 1992 with 9 interlocutors, totalling 3193 s-units, 20607 words, and 3 hours 22 minutes 23 seconds of recordings. [Leaflets advertising goods and products]. Sample containing about 23409 words of miscellanea (domain: commerce) The age of capital 1848-1875. Sample containing about 41650 words from a book (domain: world affairs) Data capture and transcription Oxford University Press

31 The edition statement BNC XML Edition, December 2006 41650 tokens; 41573 w-units; 1436 s-units Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium. This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at http://www.natcorp.ox.ac.uk for full licencing and distribution conditions. J0P AgeCap

32 The source description 1 The age of capital 1848-1875. Hobsbawm, E J Abacus London 1977 203-316

33 The source description 2 <recording xml:id="KE5RE000" n="035201" date="1992-02-20" time="11:50+" type="Walkman"/> <recording xml:id="KE5RE001" n="035202" date="1992-02-20" time="11:50+" type="Walkman"/> <recording xml:id="KE5RE002" n="035203" date="1992-02-23" time="17:05+" type="Walkman"/> <recording xml:id="KE5RE003" n="035204" date="1992-02-22" type="Walkman"/>

34 The encoding description

35 The profile description (written)‏ W nonAc: humanities arts History, Modern - 19th century Capitalism - History - 19th century World, 1848-1875

36 Classification codes  Codes used are predefined in the Corpus header Written Domain Imaginative Natural and pure sciences Applied sciences...

37 The profile description (spoken)‏ 1992-02-23 <person ageGroup="Ag1" xml:id="PS0X2" role="self" sex="m" soc="DE" dialect="XSS"> 20 Wayne unemployed Central South-west England.... <setting xml:id="KE5SE000" n="035201" who="PS000 PS0X2"> Hampshire: Andover local shop visiting friends...

38 Has English moved on?  types of text  e-mail  web pages / blogs  SMS  personal letters  topics  globalization  internet  Elvis  Word Perfect

39 Out of date?  The composition (and date) of any corpus affects inferences drawn from it  There aren't many alternatives  Web-as-corpus  sources of spoken texts?  monitor corpora are non-replicable  copyright permissions unrepeatable  Quantitative and qualitative comparative evaluations of BNC coverage are needed  but “it's surprising how much is there”

40 Why is it still useful?  The BNC is a problematizing resource...  complements (and corrects) intuition  increases learner autonomy  critiques the myth of the native speaker ... for teacher and learner alike  XML makes it more usable by non-specialist software  Its range and availability make it unique

41 Where can I get one?  BNC XML: http://www.natcorp.ox.ac.uk  now available on DVD  standalone single user licence or institutional licence  existing licensees should renew  XAIRA  Delivered free with the BNC (and also available free from http://xaira.sf.net)‏http://xaira.sf.net  Usable with any XML corpus  Usable/ish on any platform


Download ppt "Lou Burnard BNC-XML: an introduction."

Similar presentations


Ads by Google