Presentation is loading. Please wait.

Presentation is loading. Please wait.

BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

Similar presentations


Presentation on theme: "BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:"— Presentation transcript:

1 BTANT 129 w5 Introduction to corpus linguistics

2 BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained: the corpus of Anglo-Saxon verse The Oxford Companion to the English Language The modern view –A collection of naturally occurring language text chosen to characterize a state or variety of a language John Sinclair Corpus Concordance Collocation OUP

3 BTANT 129 w5 Corpus vs. archive Text archive Collection of texts in their original format (Oxford Text Archive: http://ota.ox.ac.uk/) http://ota.ox.ac.uk/ Corpus texts collected and processed in a unified, systematic manner British National Corpus: http://www.natcorp.ox.ac.uk/ http://www.natcorp.ox.ac.uk/

4 BTANT 129 w5

5

6 Short history Brief mention of just a select few! Brown Corpus (Brown university) –1 m words –15 genres –500 samples 2000 words each –Area: US –Time: 1961 LOB Corpus (Lancaster-Bergen-Oslo) –GB replica of Brown

7 BTANT 129 w5 Cobuild Major corpus initiative by Collins and Birmingham Univ. John Sinclair 1991 20 m -> Bank of English currently 450 m words http://www.cobuild.collins.co.uk

8 BTANT 129 w5 British National Corpus 100 m words careful selection 10 % spoken material time span 1960 (fiction) – 1975 non-ficion) 40-50 000 word texts TEI compliant SGML coding http://www.comp.lancs.ac.uk/ucrel/bncind ex/

9 BTANT 129 w5

10 International Corpus of English 20 corpora of 1 m words devoted to varieties of English around the world 500 texts (300 written 200 spoken) of 2000 words each time span: 1990-0996 ICE-GB available in demo version syntactic annotation, graphical tool ICECUP

11 BTANT 129 w5

12 Corpus processing: tokenization Preprocessing –tokenization segmenting the text into sentences sometimes tricky: sentence delimiters in mid- sentence positions words multi-word units – problem –Normalization restoring clitics, abbreviations ("can't", "I've")

13 BTANT 129 w5 Corpus processing: tagging Tagging –labelling every word with its Part of Speech category –Problem: ambiguity out of context, words can belong to different part of speech or have different analysis within the same POS –set N vs. set V –bánt 'bánik' VBD vagy 'bánt' VBZ

14 BTANT 129 w5 Corpus processing: disambiguation Disambiguation –defining the correct analysis in context Two approaches: both needs manually corrected training corpus –statistical Hidden Markov model calculating probability within a span of usually one or two words rate of success can be around 98% –rule-based

15 BTANT 129 w5 Syntactic annotation Difficult to do on such a scale shallow parsing Treebank: collection of syntactically analyzed sentences Penn treebank http://www.cis.upenn.edu/~treebank/

16 BTANT 129 w5 Recent trends Word sense ambiguation (SENSEVAL) http://www.itri.brighton.ac.uk/events/senseval/ Message understanding http://www.itl.nist.gov/iaui/894.02/related_pro jects/muc/index.htmlhttp://www.itl.nist.gov/iaui/894.02/related_pro jects/muc/index.html SEMANTIC WEB making information on the web understandable for machines a vision requiring a huge effort, not clear whether feasible at all

17 BTANT 129 w5 Representative sample? A corpus any size is inevitably a sample Of what? Two approaches –sampling speakers – demographic sampling –sampling their output – text type sample

18 BTANT 129 w5 The notion of representativeness Sample vs. population sample should be proportional to the population for a given feature –example for demographic sampling if we know from census figures that 48% of people in living in Budapest are male we should compile our sample so that 48% of the informants are male -> our sample is representative of Budapest residents for gender

19 BTANT 129 w5 Trouble with representativeness What should be the units of sampling? Registers, text types, genres etc. But no independent evidence about their ratio in the totality of language output -> representativeness is an ideal but impossible to implement

20 BTANT 129 w5 Approaches to Representativeness Douglas Biber: Rejects notion of proportional sampling Sample should be as varied as possible Representativeness measured in terms of wide variety of text types included in the sample

21 BTANT 129 w5 The Web as a corpus? Pro: immense database dynamically growing ideal 'quick and dirty' method Cons: lots of rubbish, irrelevant data difficult to extract hits no language analysis only string query, which is crude

22 BTANT 129 w5 One quick example Representativity or representativeness Throw the two words at Google and have a look at the figures Think about the conclusions There are special front-end sites

23 BTANT 129 w5

24

25

26


Download ppt "BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:"

Similar presentations


Ads by Google