Download presentation
Presentation is loading. Please wait.
Published byRobert Bell Modified over 8 years ago
1
What is a Corpus? What is not a corpus? the Web collection of citations a text Definition of a corpus “A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.” (Sinclair, 2005)
2
Corpus design (some notions) Criteria (external): language or language variety, mode, text type, domain, text location, etc. – Criteria will form cells Sampling Balance – Range of text categories in the corpus Representativeness “Representativeness refers to the extent to which a sample includes the full range of variability in population.” (Biber, 1993)
3
Corpus size There is no maximum size! The minimum size depends on: The kind of query (e.g. frequent words, technical terms) The methodology used for studying the data Zipf’s law: half of the words in a corpus occur once only, a quarter twice only, etc. Words occurring once only (hapax legomena) are unlikely to be of interest for more general study of a language Words → sequence of words (frequency dropping) Studies of collocations require bigger corpora Corpora for specialised studies can be much smaller
4
Why use a corpus? “As language teachers and professionals, we often have strong intuitions about language use… Corpus-based research, however, shows us that our intuitions are often completely wrong.” (Biber 2005) Even if our intuition is correct, the language we produce may not represent typical language use (McEnery et al, 2006) Corpus-based research: authentic data Using a computer to study language: quick processing of data accurate and consistent non-biased allows enriching data with additional information
5
What can corpus be used for? Quantitative analysis – what can be counted? characters word-forms parts of speech sentences paragraphs sections, chapters… utterances turns Qualitative analysis meaning patterns semantic prosody
6
Types of corpora (1) Reference corpus – large, include spoken and written texts representing various social and situational strata Monitor corpus – growing regularly, reflects language changes Balanced corpus – balanced according to text type, genre, or domain Sampled corpus – finite collection of carefully selected texts Annotated corpus – enhanced with various types of linguistics information Unannotated (raw) corpus – contains only plain texts with no additional linguistic information
7
Types of corpora (2) General (represents a language or language variety) and specialized corpora (domain or genre specific) Monolingual and multilingual corpora; parallel corpora Comparable corpora Spoken and written corpora Synchronic and diachronic corpora Native speaker and learner corpora The British National Corpus (BNC) The Michigan Corpus of Academic Spoken English (MICASE)
8
Basic notions in corpus linguistics type / token Example: A corpus is a collection of pieces of language text in electronic form, 13 tokens 11 types (a, corpus, is, collection, of, pieces, language, text, in, electronic, form) word-form / lemma play, plays, playing, played (word-forms of lemma play)
9
Types of output/analyses Word/phrase frequency: wordlists, N-grams (clusters, lexical bundles)wordlistsN-grams Concordance (node, KWIC, sorting, expanded context) Concordance Collocation (span, T-score, Mutual information) Collocation Keywords
10
Building your own corpus (1) Design: Research question Criteria → cells Identifying sources of data Existing corpora Data archives: the Oxford Text Archive Other: Nexis UK, Lexis Nexis, WebBootCaT Nexis UK Copyright
11
Building your own corpus (2) Data collection Downloading Recording Scanning Keyboarding Documentation Preparing data: – data conversion to txt format (including transcription) – character encoding – clean-up – mark-up – alignment (parallel data)
12
Building your own corpus (3) Selecting corpus tool for analysis Adding value to data – annotation: part-of-speech (POS) tagging, tokenization part-of-speech (POS) tagging lemmatization parsing other: semantic annotation, pragmatic annotation, error tagging, etc.
13
Corpus tools Freely available vs commercial tools Standalone vs online tools AntConc (free, standalone) WordSmith Tools (not free, standalone) The SketchEngine (not free, online)
14
What corpora cannot tell us? No negative evidence Corpora rarely provide explanations Their usefulness depends on the research question Findings cannot be generalised (unless the corpus is representative)
15
Suggested reading Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Kennedy, G. D. 1998. An Introduction to Corpus Linguistics. Harlow: Longman. McEnery, T. and Wilson, A. 2001. Corpus Linguistics. (2nd Ed.) Edinburgh: Edinburgh University Press. McEnery, T., Xiao, R. and Tono, Y. 2006. Corpus-based language studies : an advanced resource book. London: Routledge. Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.
16
Corpora ACORN (Aston students and staff only) http://acorn.aston.ac.uk/private/language.php http://acorn.aston.ac.uk/private/language.php Collins COBUILD Corpus - http://www.collins.co.uk/Corpus/CorpusSearch.aspx [56 Million Words] http://www.collins.co.uk/Corpus/CorpusSearch.aspx British National Corpus (BNC): http://sara.natcorp.ox.ac.uk/lookup.htmlhttp://sara.natcorp.ox.ac.uk/lookup.html CORPUS.BYU.EDU (Brigham Young University): http://view.byu.edu/http://view.byu.edu/ MICASE (Michigan Corpus of Academic Spoken English): http://quod.lib.umich.edu/m/micase/ http://quod.lib.umich.edu/m/micase/ International Corpus of English (ICE) – need to send the agreement (no fee) to get the password: http://www.ucl.ac.uk/english-usage/ice/index.htmhttp://www.ucl.ac.uk/english-usage/ice/index.htm VLC, Hong Kong: http://www.edict.com.hk/concordance/ [English, French, Bilingual, Chinese, Japanese]http://www.edict.com.hk/concordance/
17
Corpus Tools, and more Sketch Engine: http://www.sketchengine.co.uk/http://www.sketchengine.co.uk/ AntConc: http://www.antlab.sci.waseda.ac.jp/software.html http://www.antlab.sci.waseda.ac.jp/software.html WordSmith Tools: http://www.lexically.net/wordsmith/http://www.lexically.net/wordsmith/ Websites with general information about corpora and corpus linguistics: http://tiny.cc/corpora http://www.corpus-linguistics.de/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.