Download presentation
Presentation is loading. Please wait.
Published byElisabeth Russell Modified over 9 years ago
1
English Corpora and Language Learning Tamás Váradi varadi@nytud.hu
2
English Corpora and Language Learning 2 Outline What is a Corpus? Compiling a corpus First generation of corpora: BROWN, LOB The Age of Mega Corpora British National Corpus International Corpus of English International Corpus of Learner English The Web as a corpus? Availability
3
English Corpora and Language Learning 3 Corpora? (1) A collection of texts especially if complete and self contained; the corpus of Anglo-Saxon verse (2) In linguistics and lexicography, a body of texts, utterances or other specimens considered more or less representative of a language and usually stored as an electronic database (The Oxford Companion to the English Language 1992) A collection of naturally occurring language text chosen to characterize a state or variety of a language John Sinclair Corpus Concordance Collocation OUP 1991
4
English Corpora and Language Learning 4 The pre-electronic era Huge, painstaking manual effort Covering a closed body of texts Bible Concordance Shakespeare Concordance Attempt to capture the whole language
5
English Corpora and Language Learning 5 Compiling a corpus Aim provide solid empirical evidence about language Design geographical and chronological bounds speakers, genres, defined by future use Representative corpora? Annotation Output
6
English Corpora and Language Learning 6 Corpus Linguistics: the early phase Early Sixties BROWN Corpus 500 texts of 2000 words each LOB corpus British counterpart Classic reference works Part of speech tagged
7
English Corpora and Language Learning 7 Survey of English Usage A major undertaking at UCL led by Sidney Greenbaum 1 m word compilation very careful annotation 500 words spoken material LONDON-LUND Corpus
8
English Corpora and Language Learning 8 Structure of SEU
9
English Corpora and Language Learning 9 LOB corpus: a sample A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**'._. A01 3 ^ by_IN Trevor_NP Williams_NP._. A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN A01 4 nominating_VBG any_DTI more_AP labour_NN A01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR._.
10
English Corpora and Language Learning 10 Concordance output
11
English Corpora and Language Learning 11 The age of Mega Corpora COBUILD John Sinclair at University of Birmingham originally 20 m words now over 300 m word BANK of English the more the better no fixed size: the idea of a Monitor corpus
12
English Corpora and Language Learning 12 A major undertaking in the mid-nineties Birmingham, Lancaster – OUP,Longman,Chambers 100 m words carefully compiled 10 m words spoken data ! up-to-date standarg SGML encoding still the paradigm example of a reference corpus
13
English Corpora and Language Learning 13 Accessing the BNC
14
English Corpora and Language Learning 14 BNC-Baby
15
English Corpora and Language Learning 15 Searching LOB/BROWN
16
English Corpora and Language Learning 16 International Corpus of English A network of corpora corvering regional variaties of English Project organized by UCL London Each containing cc. 1 m. words GB, Hong-Kong Australia, East-Africa more in preparation
17
English Corpora and Language Learning 17 ICE-HK
18
English Corpora and Language Learning 18 ICE-GB: sociolinguistic variation
19
English Corpora and Language Learning 19 ICE-GB: syntactic annotation
20
English Corpora and Language Learning 20 Treebanks Geoffrey Sampson Meticulously hand-crafted syntactic annotation SUSANNE CHRISTINE LUCY Penn-Treebank University of Pennsyvania Massive amounts of utomatically annotated data aimed for natural language processing work
21
English Corpora and Language Learning 21 International Corpus of Learner English International Centre of English Corpus Linguistics Catholic University of Louvain led by Sylviane Granger collection of essays student profiles Hungarian-English in preparation
22
English Corpora and Language Learning 22 Susanne Corpus Aims of the Scheme comprehensive — covering all features of surface and logical English grammar that are definite enough to be susceptible of formal annotation, and including all phenomena that occur in practice in modern English explicit — if two researchers at separate sites are given the same sample of English and asked to annotate it according to the SUSANNE standards, their annotations should be identical nonpartisan — where aspects of grammar are the subject of theoretical controversy, the SUSANNE scheme aims to embody a neutral analysis which rival theoreticians can interpret in their own preferred terms
23
English Corpora and Language Learning 23 The Web as a corpus Why sample when you can access the whole? Huge and ever changing The ultimate in authenticity? Not necessarily …
24
English Corpora and Language Learning 24 The Webcorp project
25
English Corpora and Language Learning 25 http://devoted.to/corpora
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.