What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.

Slides:



Advertisements
Similar presentations
Part Two: Using Xaira to explore corpora Richard Xiao
Advertisements

Corpora in grammatical studies
Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Corpus linguistics an introduction ENG 447. Key points Basic notions historical development: two competing approacheshistorical development: two competing.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Using Corpora in Linguistics Introduction to WordSmith Tools for Beginners Íde O’Sullivan Regional Writing Centre
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
LELA English Corpus Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpora and Language Teaching
The application of corpus analysis and concordance feedback to collegiate EFL writing Presenter: Wen-Shuenn Wu (Michael Wu) Chung Hua University, Hsinchu,
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
1 Vocab Assessment & Corpora and Concordancing Major vocabulary assessment tools Major corpora and concordancers.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
The Translational English Corpus: A practical approach to corpus building.
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
Reflections on Using Corpora Data in EFL Teaching CHEN BO Chongqing Jiaotong University 2006.
Researching language with computers Paul Thompson.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
practical aspects1 Translation Tools Translation Memory Systems Text Concordance Tools Useful Websites.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
Creating Authentic EFL Materials Using English Corpora: Some Benefits of Corpus for the Layman Tyler Barrett Kure City ALT
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
How Can Corpora Help Me To Be Successful in CO150?
Corpus approaches to discourse
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
Building and analysing your own corpus 1. Building a corpus.
Colorado State University
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
AMANY ALKHAYAT PSCW ENG371 INTRODUCTION TO CORPUS PROCESSING Corpus Processing Ch1.
Corpus Linguistics Anca Dinu February, 2017.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Using Corpora in Linguistics
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Corpora and Concordancers in ESL/EFL Class:
Corpus-Based ELT CEL Symposium Creating Learning Designers
(word formation: follow up)
Using GOLD to Tracking L2 Development
Corpus processing tools
Presentation transcript:

What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.” (Sinclair, 2005)

Corpus design (some notions) Criteria (external): language or language variety, mode, text type, domain, text location, etc. – Criteria will form cells Sampling Balance – Range of text categories in the corpus Representativeness “Representativeness refers to the extent to which a sample includes the full range of variability in population.” (Biber, 1993)

Corpus size There is no maximum size! The minimum size depends on:  The kind of query (e.g. frequent words, technical terms)  The methodology used for studying the data Zipf’s law: half of the words in a corpus occur once only, a quarter twice only, etc. Words occurring once only (hapax legomena) are unlikely to be of interest for more general study of a language Words → sequence of words (frequency dropping) Studies of collocations require bigger corpora Corpora for specialised studies can be much smaller

Why use a corpus? “As language teachers and professionals, we often have strong intuitions about language use… Corpus-based research, however, shows us that our intuitions are often completely wrong.” (Biber 2005) Even if our intuition is correct, the language we produce may not represent typical language use (McEnery et al, 2006) Corpus-based research: authentic data Using a computer to study language:  quick processing of data  accurate and consistent  non-biased  allows enriching data with additional information

What can corpus be used for? Quantitative analysis – what can be counted?  characters  word-forms  parts of speech  sentences  paragraphs  sections, chapters…  utterances  turns Qualitative analysis  meaning  patterns  semantic prosody

Types of corpora (1) Reference corpus – large, include spoken and written texts representing various social and situational strata Monitor corpus – growing regularly, reflects language changes Balanced corpus – balanced according to text type, genre, or domain Sampled corpus – finite collection of carefully selected texts Annotated corpus – enhanced with various types of linguistics information Unannotated (raw) corpus – contains only plain texts with no additional linguistic information

Types of corpora (2) General (represents a language or language variety) and specialized corpora (domain or genre specific) Monolingual and multilingual corpora; parallel corpora Comparable corpora Spoken and written corpora Synchronic and diachronic corpora Native speaker and learner corpora The British National Corpus (BNC) The Michigan Corpus of Academic Spoken English (MICASE)

Basic notions in corpus linguistics type / token Example: A corpus is a collection of pieces of language text in electronic form, 13 tokens 11 types (a, corpus, is, collection, of, pieces, language, text, in, electronic, form) word-form / lemma play, plays, playing, played (word-forms of lemma play)

Types of output/analyses Word/phrase frequency: wordlists, N-grams (clusters, lexical bundles)wordlistsN-grams Concordance (node, KWIC, sorting, expanded context) Concordance Collocation (span, T-score, Mutual information) Collocation Keywords

Building your own corpus (1) Design:  Research question  Criteria → cells Identifying sources of data  Existing corpora  Data archives: the Oxford Text Archive  Other: Nexis UK, Lexis Nexis, WebBootCaT Nexis UK Copyright

Building your own corpus (2) Data collection  Downloading  Recording  Scanning  Keyboarding Documentation Preparing data: – data conversion to txt format (including transcription) – character encoding – clean-up – mark-up – alignment (parallel data)

Building your own corpus (3) Selecting corpus tool for analysis Adding value to data – annotation:  part-of-speech (POS) tagging, tokenization part-of-speech (POS) tagging  lemmatization  parsing  other: semantic annotation, pragmatic annotation, error tagging, etc.

Corpus tools Freely available vs commercial tools Standalone vs online tools AntConc (free, standalone) WordSmith Tools (not free, standalone) The SketchEngine (not free, online)

What corpora cannot tell us? No negative evidence Corpora rarely provide explanations Their usefulness depends on the research question Findings cannot be generalised (unless the corpus is representative)

Suggested reading Hunston, S Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Kennedy, G. D An Introduction to Corpus Linguistics. Harlow: Longman. McEnery, T. and Wilson, A Corpus Linguistics. (2nd Ed.) Edinburgh: Edinburgh University Press. McEnery, T., Xiao, R. and Tono, Y Corpus-based language studies : an advanced resource book. London: Routledge. Sinclair, J Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Corpora ACORN (Aston students and staff only) Collins COBUILD Corpus - [56 Million Words] British National Corpus (BNC): CORPUS.BYU.EDU (Brigham Young University): MICASE (Michigan Corpus of Academic Spoken English): International Corpus of English (ICE) – need to send the agreement (no fee) to get the password: VLC, Hong Kong: [English, French, Bilingual, Chinese, Japanese]

Corpus Tools, and more Sketch Engine: AntConc: WordSmith Tools: Websites with general information about corpora and corpus linguistics: