Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpus Linguistics I ENG 617

Similar presentations


Presentation on theme: "Corpus Linguistics I ENG 617"— Presentation transcript:

1 Corpus Linguistics I ENG 617
Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) Week 2

2 Recap Last time, we talked about:
Corpus is a collection of real-world texts. Corpus linguistics cannot answer questions the why questions. Corpus Linguistics as a quantitative, descriptive field of study. There is a difference between corpus-based and corpus-driven studies. Computers are used to assist corpus analysis, but they are not essential. Corpus linguistics is used in many fields such as translation, lexicography. Week 2

3 Types of Corpora: General vs. Specialized Corpora 1
There are many types of corpora and which one to select depends on what type of questions you are trying to answer. For example, if you want to know the most frequent words in American English, you would better use a general corpus of American English. A general corpus is a collection of texts from different genres (i.e. academic, business, legal, newspapers, social media posts, etc.) and registers (i.e. spoken and written). One example of a general corpus is the Brown Corpus. Week 2

4 Types of Corpora: General vs. Specialized Corpora 2
However, if you want to know the most frequently used words in academic discourse, then you need a specialized corpus of academic discourse. A specialized corpus is a collection of texts from one genre and one register. One example of a specialized corpus is the BioScope Corpus. Week 2

5 Types of Corpora: Synchronic vs. Diachronic Corpora
If you want to study language at a specific period of time, then you probably need a synchronic corpus. A synchronic corpus is a collection of texts that belong to one period of time that can be either a part or a present period. One example of a synchronic corpus is the Corpus of Contemporary American English (COCA). However, if you want to trace the changes in word usage, for instance, then you need a diachronic – or a historical – corpus that comprises texts from more than one period of time. One example of a diachronic corpus is the Corpus of Historical American English (COHA). Week 2

6 Types of Corpora: Raw vs. Annotated
If you are only interested in word frequencies, then a raw corpus can be enough. A raw corpus is a corpus without any linguistic analysis; only plain text. One example is the Charles Dickens Corpus from the Gutenberg Project. However, if you need to know the frequencies of a particular grammatical class or a certain syntactic structure, then you need an annotated corpus. An annotated corpus is a corpus that has undergone some sort of linguistic analysis. An example of an annotated corpus is the Quranic Arabic Corpus. Week 2

7 Types of Corpora: Monolingual vs. Multilingual
If you are interested in studying just one single language, then a monolingual corpus of collections of texts from one language is enough. One example of English monolingual corpora is the British National Corpus. However, if you are doing a contrastive study or you need to know how specific words are translated, then you need a multilingual corpus, which is a collection of texts in more than one language. Multilingual corpora can be either parallel or comparable. Week 2

8 Types of Corpora: Parallel vs. Comparable
Parallel corpora comprise texts that are exact translations of one another. One example is the MultiJur Parallel Corpus of Legal Texts. Comparable corpora comprise texts that tackle the same topics in multiple languages; yet, the texts are not exact translations of one another. One well-known comparable corpus is the Wikipedia Corpus. Week 2

9 Types of Corpora: Monitor and Learner Corpora
Monitor corpora are dynamically growing corpora. They are set to be regularly updated such as the Bank of English. Monitor corpora are frequently referred to as diachronic corpora. Learner corpora are compiled from the writings of language learners for pedagogical purposes. An example is the Arabic Learner Corpus. Week 2

10 Quiz Read the description of each corpus and then answer the questions: News on the Web (NOW): it comprises 5.1 billion words from Web-based newspapers and magazine from 2010 to the present time. It’s updated on daily basis. TIME Magazine Corpus: it is based on 100 million words of text in about 275,000 articles from TIME magazine from 1923 – 2006. Handsard corpus: it has speeches from the British Parliament from 1803 – True or False? All corpora can be considered monitor corpora. All corpora are general corpora. Week 2

11 Quiz True or False? Raw corpora are linguistically analyzed.
Translation studies use comparable corpora. To find archaic words, we can use synchronic corpora. A corpus of political newspaper articles is a general corpus. Learner corpora are typically used for educational purposes. A corpus of clinical discharge reports is a specialized corpus. A corpus of UN resolutions is an example of a comparable corpus. Specialized corpora comprise one single genre unlike general corpora. A corpus of English, Swedish, and German texts is a multilingual corpus. A corpus in which each word is labeled for its grammatical category is an annotated corpus. Week 2

12 Finding Off-The-Shelf Corpora
Off-the-shelf corpora – also known as ready-made corpora – can be found in: Enterprises such as: Linguistic Data Consortium (LDC) European Association for Language Resources (ELRA) Free online corpora such as: Brigham Young University Arabic Corpus Tool Off-the-shelf corpora can also be obtained by contacting individual researchers. Week 2

13 Criteria of Well-Designed Corpora
Now, when we pick a corpus for our research, we need to ask ourselves two main questions: What do we want to do with that corpus? Because this will help you pick the right type of corpus. Is the corpus I picked well designed? So what are the criteria of a well-designed corpus. Week 2

14 Criteria of Well-Designed Corpora: Machine-Readable
Although we said that corpus analysis can be done manually, it is ideal to use a computer software to do the analysis for you. Since computers are preferable, the texts of the corpus must be in a machine- readable format; that is, in a format that the computer can process. The formats that corpus analysis software can process are: Plain text files Tab-delimited files Comma Value Separate (CVS) files eXchange Markup Language (XML) files How about Word and PDF files? Week 2

15 Criteria of Well-Designed Corpora: Authentic
Authenticity means that the texts of the corpus must have happened in in natural communicative settings without manipulating it for the purposes of the researcher. By definition, newspaper articles, movie scripts, songs, novels, poetry, etc. are authentic. Why? Because their writers did not tailor their language usage to match the purposes of any given study. Week 2

16 Criteria of Well-Designed Corpora: Representative
The texts of the corpus must reflect real-world variation. For example, if we want to know who swears more frequently on social media: men or women, our corpus should include posts from both men and women. If we want to know the most frequent word in American English, then our corpus should comprise as many genres and registers as possible. Week 2

17 Criteria of Well-Designed Corpora: Balanced
Balanced means that every variation should be equally represented. Again, if we want to know who swears more frequently on social media: men or women, our corpus should include the same number of posts from both men and women. If we want to know the most frequent word in American English, then our corpus should comprise as many genres and registers as possible; AND, each genre and register should have the same number of words. Week 2

18 Criteria of Well-Designed Corpora: Large 1
Since we live in the era of BIG DATA, the more is always the merrier. However, to decide on the ideal size of your corpus depends on a number of factors: What you are studying: If you are studying a very common phenomenon such as prepositions; then a few thousands of words are enough. However, if you are studying idioms, then maybe you need millions or even billions of words. How accessible the data is: sometimes, there are restrictions on certain types of data such as the results of the ILETS. Week 2

19 Criteria of Well-Designed Corpora: Large 2
What type of data you want: for example, the Quran is only a few thousands of words, there are no more Qurans to enlarge the corpus. How much time and money you have: sometimes, corpus compilation needs both money and time. Why? Week 2

20 Quiz True or False? PDF files are the best format to store your corpus. Authenticity is crucial for a well-designed corpus. A corpus of 100 posts from men and 50 posts from women is skewed. A two-sentence corpus is large enough to study sentence structure in English. Sometimes, there are logistic restrictions that can prevent you from compiling the ideal corpus. Week 2


Download ppt "Corpus Linguistics I ENG 617"

Similar presentations


Ads by Google