Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services Introducing the British National Corpus.

Similar presentations


Presentation on theme: "Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services Introducing the British National Corpus."— Presentation transcript:

1 Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services http://info.ox.ac.uk/bnc/ Introducing the British National Corpus

2 What is a corpus? u How do we find out what words mean? –algorithm –authority –usage u Corpus linguistics re-centres the last of these

3 What is a corpus? 1. The body of a man or animal. (Cf. corpse.) Formerly frequent; now only humorous or grotesque... 2. Phys. A structure of a special character or function in the anima body, as corpus callosum,... 3. A body or complete collection of writings or the like; the whole body of literature on any subject.... 4.The body of written or spoken material upon which a linguistic analysis is based... 5. The body or material substance of anything; principal, as opposed to interest or income.

4 Salience vs. typicality subject. 1727-51 Chambers Cycl. s.v., Corpus is also used in matters of learning, for s d, and bound together.. We have also a corpus of the Greek poets.. The corpus of the ci also a corpus of the Greek poets.. The corpus of the civil law is composed of the diges 16 Bound up inseparably with the whole corpus of Christian tradition. 4. The body of wr e informant.. and in particular upon a corpus of material, of which a large proporti al objection one may make against the `corpus' method is that two investigators operati lore the possibilities and problems of corpus-based research by reference to first-h incurred they ought to be paid out of corpus and not out of income. phr. corpus delic of corpus and not out of income. phr. corpus delicti (see quot. 1832); also, in lay u, esp. the body of a murdered person. corpus juris: a body of law; esp. the body of Rom ; esp. the body of Roman or civil law (corpus juris civilis). 1891 Fortn. Rev. Sept. ev. Sept. 338 The translation.. of the Corpus Juris into French. 1922 Joyce Ulysses o.) We have here damning evidence, the corpus delicti, my lord, a specimen of my mature r, dam and hollow log in search of the corpus delicti, found some important evidence important evidence in a fallen tree. corpus vile Pl. corpora vilia Orig. in phr. (se ugh who would submit to serve as the corpus vile for their charitable treatment. 1953 E (OED citations)

5 Typicality vs. salience FLY 49 working for the British National Corpus and they are looking at the GT9 0 At Oxford he designed Corpus Christi College, built in 1512-18. F98 104 difficult topic of the Pauline corpus I concluded that the evidence was equally F98 135 in the chronology of the entire corpus. In the study of literature, quantitative methods H47 6 which will go towards making a corpus of information from which will draw the F98 56 The aim is to provide a corpus of 100 million words of contemporary spoken J2H 20 lasers have provided a rich corpus of both experimental and theoretical work : F98 100 sent time, that in the Pauline corpus only Romans, 1 and 2 Corinthians, and KCN 22 unusual, ? be good. These are corpus records. For the er. Stick it in your bedroom. F98 54 studies is the British National Corpus. This is a collaborative venture, F98 112 presented by the Aristotelian corpus, which contains two ethical treatises (BNC Sampler)

6 What is the BNC? u 100 million words of modern British English u produced by a consortium of dictionary publishers and academic researchers –OUP, Longman, Chambers –OUCS, UCREL, BL R&D u funded as pre-competitive resource by DTI/ SERC under JFIT 1990-1994

7 Production of the BNC u took three years u cost approximately GBP 1.2 million –publishers – DTI –Research Councils u would be cheaper now...

8 Project management u The BNC Consortium –Oxford University Press (co-ordinator) –Addison-Wesley Longman –Oxford University (OUCS) –Lancaster University (UCREL) –British Library R&D Dept u The Advisory Board

9 Task groups u permissions u selection, design criteria u encoding and markup u enrichment and annotation u retrieval software

10 Project Goals u A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production u of non-opportunistic design, for generic applicability u with word class annotation u and contextual information

11 Who needs this? u lexicographers u NLP researchers u teachers and learners of english u social science, cultural studies...... in short, anywhere there is a need for real life data about the English language

12 The BNC “sausage machine” OUP Written (OUP/Chambers) Written (OUP/Chambers) Spoken (Longman) Spoken (Longman) Initial CDIF Conversion and Validation (OUCS) Initial CDIF Conversion and Validation (OUCS) Word Class Annotation (UCREL) Header generation and final validation (OUCS) Header generation and final validation (OUCS) Text selection and capture Text enrichment and encoding Text analysis and distribution

13 Through-put (million words/quarter)

14 "non opportunistic design" u representative –of what? u sampled –how? u uniform encoding –how and what

15 “language” Language In Use abstraction selection Texts

16 Sampling issues u what kinds of texts u how many texts u which texts u which parts of texts u how much of each text

17 Representativenes u what is the population? u what are the variables? "the extent to which a sample includes the full range of variability in a population’ (Biber 1993)

18 BNC Composition u written – predefined proportions of »different media (books, newspapers, unpublished…) »different domains (informative, entertaining…) – maximum sample size 45000 words –all texts incomplete u spoken –context-governed –demographically-sampled

19 Sampling frame u Production or reception? –context-governed is needed to balance the demographically sampled –titles for written part are selected to maximize variability according to a range of descriptive criteria (target audience, popularity, region, author etc.).

20 Encoding u CDIF, TEI, SGML, CES... u Features marked up –basic text structure –metadata and annotation –segmentation and tokenization –paralinguistic features in speech u Use of SGML –XML remains impractical?

21 Architecture text header bncdoc bnc stext header 4124 863

22 Basic structure... p p p p p p p p div 1 s s s s s s s s s s s s s s text stext div u u u u u u u u w w w w w w w w w w w w w w 6,250,000 100,106,008

23 Sample written text CAMRA FACT SHEET No 1 How beer is brewed Beer seems such a simple drink that we tend to take it for granted.

24 Sample spoken text Mm yes I told Paul that he can bring a lady up at Christmas-time. Is he not going home then ? No and erm I 'm leaving a turkey in the freezer Paul is quite good at cooking standard cooking.

25 Corpus Annotation u automatic part-of-speech tagging –CLAWS4 u 99% accuracy claimed u now improved u tokenization problems

26 Word tagging The Queen ‘s real annus horribilis began Sunday. u word-pos pair u white space problems u validation problems

27 Metadata u each text has a TEI header –identification and classification –specific details (e.g. speakers) u all common data in the corpus header u classification(s) in header pointed to by individual texts

28 Text classifications u spoken texts –age, sex, class (of respondent) –domain, region, type u written texts –author age, sex, type –audience, circulation, status –medium, domain

29 Written text: medium

30 Spoken texts: region

31 Availability issues u distribution within EU under licence –commercial exploitation of the corpus is forbidden –commercial exploitation of derived works is permitted, subject to veto by the consortium u mounting pressure to distribute outside the EU

32 Availability of the BNC u within EU only (at present)  licence and order forms at http://info.ox.ac.uk/bnc u online service http://sara.natcorp.ox.ac.uk/  Coming shortly …. BNC World Edition

33 Distribution methods u 100 million words is a lot of data u all or nothing distribution policy u non profit distribution requirement u the options are... –install it yourself –online access –the sampler

34 Install it yourself u You need... –£220 for a licence –£3000 for a Unix box with min 6 Gb disk –some Unix expertise u You get... –access to the whole corpus –using the tools of your choice –configurable for a local network

35 BNC Online service u You need... –access to the Internet from a PC running Windows u You get... –free (but limited) access using web browsers –free (temporary) access using SARA –software and documentation for a small annual fee http://sara.natcorp.ox.ac.uk

36 The BNC Sampler u You need... –$50 for a CD –A PC with a CD drive and (preferably) 90 Mb disk space u You get... –2% sample, half written, half spoken –four different search engines –documentation now in beta test, for release 1 March 1999

37 BNC World Edition u DTI has now agreed to worldwide distribution u However, some 30-40 texts do not have world rights cleared u Tagging enhancement project at Lancaster u Much improved version of SARA u … new release now planned for summer 1999


Download ppt "Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services Introducing the British National Corpus."

Similar presentations


Ads by Google