Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpus Linguistics I ENG 617

Similar presentations


Presentation on theme: "Corpus Linguistics I ENG 617"— Presentation transcript:

1 Corpus Linguistics I ENG 617
Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) Week 3

2 Corpora from Brigham Young University 1
Brigham Young University (BYU) is a private research university in Provo, Utah, USA. It has developed a large number of online corpora processors mostly for English. Online corpus processors have Web interfaces to enable users to search the corpora without the need to download those large texts on their local Pros: (1) portable, (2) saves memory, (3) free, and (4) ready to use Cons: (1) restricted to certain functions and (2) limited to certain texts Week 3

3 Corpora from Brigham Young University 2
For illustration purposes, we will use three of the BYU corpora: The Corpus of Contemporary American English (COCA): It is a corpus representing American English. It comprises 520 million words of text (20 million words each year 1990 – 2015) and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. The Corpus of Historical American English (COHA): It is a corpus representing historical American English with more than 400 million words from the 1810s-2000s. The corpus is balanced by genre decade by decade. The Arabic Corpus Tools: It is a corpus of 173 million words of text representing Modern Standard and Egyptian Arabic. Week 3

4 COCA: Signing Up Although the Web interfaces of COCA, COHA, and Arabic Corpus are available for free, you need to sign up to use them. Week 3

5 COCA: Searching for Single Words
To search for a single word in COCA, all you need to do is to type it inside the search box. Each word you look for inside the corpus is called a query. For example, typing ‘jump’, we get the following result: What does the figure 19,993 stand for? Week 3

6 COCA: Single Words and Raw Frequencies
It is the raw frequency of the word ‘jump’ in COCA. It means that the word ‘jump’ has been repeated 19,993 times in the corpus. Put differently, it means that the word ‘jump’ has occurred in 19,993 contexts in the corpus. How to get these 19,993 contexts? By clicking the word itself Week 3

7 COCA: Filtering by Part of Speech
When we type ‘jump’ as a query, the results will include all the parts of speech of jump; i.e. jump as a verb and as a noun. What if we want the raw frequency of jump as a noun? You will need to use the ‘POS’ option to the right of the search box. Week 3

8 COCA: Searching for Phrases
You can search for phrases the same way you search for single words. For example, if you search for ‘kick the bucket’ 24 is the raw frequency of the entire phrase. To view the contexts in which your query phrase is used, you click the phrase itself. However, with phrases we can’t use the POS filter. Week 3

9 COCA: Searching with the Wildcard 1
What if you want to search for: kick the bucket, kicks the bucket, kicked the bucket, and kicking the bucket. One option is to enter each phrase as a separate query. This is tedious. Another option is to use the asterisks or the wildcard (*) as in: kick* the bucket The wildcard means anything: anything that comes in the position of the wildcard. Week 3

10 COCA: Searching with the Wildcard 2
If a wildcard means anything, then we can use it to know identify fixed and flexible expressions. Fixed expressions do not allow any other words to come in-between. For example, kick the bucket will always be the same; never kick the big bucket or kick the last bucket. To make sure, try this query in COCA: kick the * bucket How about the expression ‘at first glance’? Is it fixed or flexible? Can ‘first’ be replaced by something else? To know, we can try ‘at * glance’. What did you get? Week 3

11 COCA: Searching with the Wildcard 3
Wildcards can also be used to do morphological searches. What if we want to know which words are used with the suffix ‘-icity’? To know the answer, we can use the wildcard as in *icity. Notice the difference between: *icity and *˽icity. What different results does each one of them give you? Week 3

12 Quiz True or False? POS stands for Point of Selection.
Fixed expressions are unmodifiable expressions. Online corpus processors come with Web interfaces. Online corpus processors use corpora stored on some servers. Raw frequency is the total number of occurrences of a word in a corpus. The wildcard is a good idea to look for all the derivations of pick in one step. Use COCA online corpus processor to get the raw frequency of: book (v. & n.) book (v.) book a ticket books, booking, and booked combined Week 3

13 Quiz Use COCA Web interface to find out:
the top three frequent words starting with the prefix ‘anti-’ the most frequent stem attached to suffix ‘–ness’ whether ‘once upon a time’ is a fixed or a flexible expression Week 3

14 COCA: Searching for Parts of Speech
What if we want to know the most common noun in COCA? We can search for parts of speech using the tags provided in the interface. If we want the most common noun in COCA, we can use the following: Try it and see what is the most common noun Week 3

15 COCA: Searching for Lemmas
Although wildcards can be used to find word derivations, they only find affix-based derivations, but what about zero-affix derivations such as ate ? To find all the derivations of a given lemma, including both affix-based and zero-affix derivations, we can try the following: What is the result of your query? Week 3

16 COCA: Searching for Synonyms
We can search COCA for synonyms as well. To do so, all we need is the following: Do you see something wrong in the results? How can we get better results? Week 3

17 Quiz Use COCA Web interface to get: the synonym of skip as a verb
the derivations of speak the most frequent preposition in COCA Week 3

18 COCA: Searching Genres and Periods of Time 1
COCA is a general corpus with many genres including spoken, fiction, magazine, newspaper, and academic genres. What if we want to know the frequency of Egypt in each genre? COCA also includes texts from different periods of time: 1990 – 2015. What if we want to know the frequency of Egypt in each period? The best way to do so it to use the chart option. Week 3

19 COCA: Searching Genres and Periods of Time 2
There are three different numbers in the chart of Egypt Freq. stands for raw frequency. Size (M) stands for the size of the texts in a given genre/period of time. How about Per MiL? Week 3

20 COCA: Raw vs. Normalized Frequencies
Per Mil is the normalized frequency per million. Raw frequency is the number of occurrences in the corpus. It does not always give an accurate idea about which word is more frequent. Hence, we typically use normalized frequency which is calculated as follows: Normalized Frequency (w) = 𝐶(𝑤) 𝑁 ∗𝑐𝑜𝑚𝑚𝑜𝑛 𝑏𝑎𝑠𝑒 where C(w) is the raw frequency of the given word, N is the total number of words in the corpus, and the common base ranges from 10 to 1,000,000 depending on the size of the corpus. Week 3

21 Quiz In a corpus of 2,000 words, book as a noun has been repeated 120 times, whereas book as a verb has been repeated 30 times. Calculate the normalized frequency of book as a noun. In a corpus of 300,000 words, withdrawal has been repeated 20 times. Calculate the normalized frequency of withdrawal. What are the raw and normalized frequencies of Cairo in the different genres of COCA? Week 3

22 COCA: Key Word In Content (KWIC) 1
The Key Word In Content (KWIC) is the concordance function which display up to 1,000 random contexts of the query word. Two questions: What if I want to see more than 1,000 contexts? What is the difference between the KWIC and clicking the word frequency to see the contexts? The main difference is that with the KWIC, we get the context with the parts of speech encoded in colors. Week 3

23 COCA: Key Word In Content (KWIC) 2
What do these colors stand for? Week 3


Download ppt "Corpus Linguistics I ENG 617"

Similar presentations


Ads by Google