Corpus Linguistics I ENG 617

Slides:



Advertisements
Similar presentations
Uses of a Corpus “[E]xplore actual patterns of language use”
Advertisements

Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
CALL – computer assisted language learning A short course delivered by Dr. Klaus Schwienhorst. MITE January 2002.
Resources for Using Corpus Linguistics in ELT Kenji Kitao Doshisha University Kyoto, Japan S. Kathleen Kitao Doshisha Women ’ s College Kyoto, Japan.
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Memory Strategy – Using Mental Images
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
How Can Corpora Help Me To Be Successful in CO150?
RESEARCH DESIGN & CORPUS COMPILATION. Corpus design is intrinsic and a fundamental part of the analysis. It is guided by the RQ and affects the results.
Building and analysing your own corpus 1. Building a corpus.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Information Retrieval in Practice
contrastive linguistics
Collecting Written Data
Vocabulary Module 2 Activity 5.
TYPES OF TRANSLATION.
Corpus Linguistics Anca Dinu February, 2017.
Introduction to Corpus Linguistics
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

8. Translation resources
Computational and Statistical Methods for Corpus Analysis: Overview
ALE161 國際行銷英文簡報技巧 International Marketing Presentation Techniques
عمادة التعلم الإلكتروني والتعليم عن بعد
Today’s goals Introduce rhetorical context
Introduction to Corpus Linguistics: Exploring Collocation
Topics in Linguistics ENG 331
Corpus Linguistics I ENG 617
Introduction to Corpus Linguistics: Key Word Analysis
Corpus Linguistics I ENG 617
contrastive linguistics
Corpus Linguistics I ENG 617
Corpora and Concordancers in ESL/EFL Class:
Topics in Linguistics ENG 331
Corpus-Based ELT CEL Symposium Creating Learning Designers
European Network of e-Lexicography
Corpus Linguistics I ENG 617
Topics in Linguistics ENG 331
Corpus Linguistics I ENG 617
Today’s goals Introduce rhetorical context
Introduction to Corpus Linguistics ENG 331
Topics in Linguistics ENG 331
(word formation: follow up)
Computational Linguistics: New Vistas
Statistical n-gram David ling.
Using GOLD to Tracking L2 Development
A Corpus-Based Approach to Adapting Authentic Military Material
Applied Linguistics Chapter Four: Corpus Linguistics
COMPARATIVE Linguistics 2018/2019
contrastive linguistics
contrastive linguistics
Presentation transcript:

Corpus Linguistics I ENG 617 Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) rsabbagh@alsun.asu.edu.eg Week 2

Recap Last time, we talked about: Corpus is a collection of real-world texts. Corpus linguistics cannot answer questions the why questions. Corpus Linguistics as a quantitative, descriptive field of study. There is a difference between corpus-based and corpus-driven studies. Computers are used to assist corpus analysis, but they are not essential. Corpus linguistics is used in many fields such as translation, lexicography. Week 2

Types of Corpora: General vs. Specialized Corpora 1 There are many types of corpora and which one to select depends on what type of questions you are trying to answer. For example, if you want to know the most frequent words in American English, you would better use a general corpus of American English. A general corpus is a collection of texts from different genres (i.e. academic, business, legal, newspapers, social media posts, etc.) and registers (i.e. spoken and written). One example of a general corpus is the Brown Corpus. Week 2

Types of Corpora: General vs. Specialized Corpora 2 However, if you want to know the most frequently used words in academic discourse, then you need a specialized corpus of academic discourse. A specialized corpus is a collection of texts from one genre and one register. One example of a specialized corpus is the BioScope Corpus. Week 2

Types of Corpora: Synchronic vs. Diachronic Corpora If you want to study language at a specific period of time, then you probably need a synchronic corpus. A synchronic corpus is a collection of texts that belong to one period of time that can be either a part or a present period. One example of a synchronic corpus is the Corpus of Contemporary American English (COCA). However, if you want to trace the changes in word usage, for instance, then you need a diachronic – or a historical – corpus that comprises texts from more than one period of time. One example of a diachronic corpus is the Corpus of Historical American English (COHA). Week 2

Types of Corpora: Raw vs. Annotated If you are only interested in word frequencies, then a raw corpus can be enough. A raw corpus is a corpus without any linguistic analysis; only plain text. One example is the Charles Dickens Corpus from the Gutenberg Project. However, if you need to know the frequencies of a particular grammatical class or a certain syntactic structure, then you need an annotated corpus. An annotated corpus is a corpus that has undergone some sort of linguistic analysis. An example of an annotated corpus is the Quranic Arabic Corpus. Week 2

Types of Corpora: Monolingual vs. Multilingual If you are interested in studying just one single language, then a monolingual corpus of collections of texts from one language is enough. One example of English monolingual corpora is the British National Corpus. However, if you are doing a contrastive study or you need to know how specific words are translated, then you need a multilingual corpus, which is a collection of texts in more than one language. Multilingual corpora can be either parallel or comparable. Week 2

Types of Corpora: Parallel vs. Comparable Parallel corpora comprise texts that are exact translations of one another. One example is the MultiJur Parallel Corpus of Legal Texts. Comparable corpora comprise texts that tackle the same topics in multiple languages; yet, the texts are not exact translations of one another. One well-known comparable corpus is the Wikipedia Corpus. Week 2

Types of Corpora: Monitor and Learner Corpora Monitor corpora are dynamically growing corpora. They are set to be regularly updated such as the Bank of English. Monitor corpora are frequently referred to as diachronic corpora. Learner corpora are compiled from the writings of language learners for pedagogical purposes. An example is the Arabic Learner Corpus. Week 2

Quiz Read the description of each corpus and then answer the questions: News on the Web (NOW): it comprises 5.1 billion words from Web-based newspapers and magazine from 2010 to the present time. It’s updated on daily basis. TIME Magazine Corpus: it is based on 100 million words of text in about 275,000 articles from TIME magazine from 1923 – 2006. Handsard corpus: it has speeches from the British Parliament from 1803 – 2015. True or False? All corpora can be considered monitor corpora. All corpora are general corpora. Week 2

Quiz True or False? Raw corpora are linguistically analyzed. Translation studies use comparable corpora. To find archaic words, we can use synchronic corpora. A corpus of political newspaper articles is a general corpus. Learner corpora are typically used for educational purposes. A corpus of clinical discharge reports is a specialized corpus. A corpus of UN resolutions is an example of a comparable corpus. Specialized corpora comprise one single genre unlike general corpora. A corpus of English, Swedish, and German texts is a multilingual corpus. A corpus in which each word is labeled for its grammatical category is an annotated corpus. Week 2

Finding Off-The-Shelf Corpora Off-the-shelf corpora – also known as ready-made corpora – can be found in: Enterprises such as: Linguistic Data Consortium (LDC) European Association for Language Resources (ELRA) Free online corpora such as: Brigham Young University Arabic Corpus Tool Off-the-shelf corpora can also be obtained by contacting individual researchers. Week 2

Criteria of Well-Designed Corpora Now, when we pick a corpus for our research, we need to ask ourselves two main questions: What do we want to do with that corpus? Because this will help you pick the right type of corpus. Is the corpus I picked well designed? So what are the criteria of a well-designed corpus. Week 2

Criteria of Well-Designed Corpora: Machine-Readable Although we said that corpus analysis can be done manually, it is ideal to use a computer software to do the analysis for you. Since computers are preferable, the texts of the corpus must be in a machine- readable format; that is, in a format that the computer can process. The formats that corpus analysis software can process are: Plain text files Tab-delimited files Comma Value Separate (CVS) files eXchange Markup Language (XML) files How about Word and PDF files? Week 2

Criteria of Well-Designed Corpora: Authentic Authenticity means that the texts of the corpus must have happened in in natural communicative settings without manipulating it for the purposes of the researcher. By definition, newspaper articles, movie scripts, songs, novels, poetry, etc. are authentic. Why? Because their writers did not tailor their language usage to match the purposes of any given study. Week 2

Criteria of Well-Designed Corpora: Representative The texts of the corpus must reflect real-world variation. For example, if we want to know who swears more frequently on social media: men or women, our corpus should include posts from both men and women. If we want to know the most frequent word in American English, then our corpus should comprise as many genres and registers as possible. Week 2

Criteria of Well-Designed Corpora: Balanced Balanced means that every variation should be equally represented. Again, if we want to know who swears more frequently on social media: men or women, our corpus should include the same number of posts from both men and women. If we want to know the most frequent word in American English, then our corpus should comprise as many genres and registers as possible; AND, each genre and register should have the same number of words. Week 2

Criteria of Well-Designed Corpora: Large 1 Since we live in the era of BIG DATA, the more is always the merrier. However, to decide on the ideal size of your corpus depends on a number of factors: What you are studying: If you are studying a very common phenomenon such as prepositions; then a few thousands of words are enough. However, if you are studying idioms, then maybe you need millions or even billions of words. How accessible the data is: sometimes, there are restrictions on certain types of data such as the results of the ILETS. Week 2

Criteria of Well-Designed Corpora: Large 2 What type of data you want: for example, the Quran is only a few thousands of words, there are no more Qurans to enlarge the corpus. How much time and money you have: sometimes, corpus compilation needs both money and time. Why? Week 2

Quiz True or False? PDF files are the best format to store your corpus. Authenticity is crucial for a well-designed corpus. A corpus of 100 posts from men and 50 posts from women is skewed. A two-sentence corpus is large enough to study sentence structure in English. Sometimes, there are logistic restrictions that can prevent you from compiling the ideal corpus. Week 2