Download presentation
Presentation is loading. Please wait.
Published byConstance Hudson Modified over 6 years ago
1
Introduction to Corpus Linguistics: Exploring Collocation
John Corbett and Wendy Anderson
2
So far on corpus linguistics…
General introduction … what are corpora and why are they useful? Frequencies and how to understand them… Concordance lines and how to interpret them… Familiarising ourselves with online corpora such as BNC, COCA, TIME, etc.
3
This session Understanding collocation What is collocation?
Measuring collocation Searching for collocates and interpreting the data
4
What is collocation? Collocation is the statistical measurement of the strength of likelihood that expressions (or types of expression) will co-occur. For example… What is it likely that fish will co-occur with…? What is it likely student will co-occur with …?
5
Collocates of ‘fish’ in BNC
6
Searching for collocates of ‘student’
7
Collocates of ‘student’ in BNC and CoCA
8
Measuring collocation
Collocation is a statistical measure that is a more objective method of looking at concordance lines than manual analysis and interpretation. Usually collocation programs look at a span of four or five words to the left and right of the node, eg: music. The fear, the terror of detection, each time Programs can be set to look only at the specific word- form ‘terror’ or the lemma ‘[terror]’ (which would include related word-forms, in this case terror and terrors.)
9
A lemma search
10
Frequency and Mutual Information
11
What can collocation tell us?
12
Measuring collocation
MI score (Mutual information) t-score z-score
13
MI, t-score and z-score All calculations ask:
How many instances of the collocate are found in the designated span of the node word? (= the Observed) How many instances of the collocate might be expected to fall into that span, given the frequency of the collocate in the corpus as a whole? (= the Expected)
14
t-score How many instances of the collocate are found in the designated span of the node word? (= the Observed) How many instances of the collocate might be expected to fall into that span, given the frequency of the collocate in the corpus as a whole? (= the Expected) Subtract the Expected from the Observed and divide the result by the standard deviation (ie consider the probability of co-occurrence of node and number of tokens in the designated span) A t-score of 2 or higher is normally considered significant.
15
MI (Mutual Information)
How many instances of the collocate are found in the designated span of the node word? (= the Observed) How many instances of the collocate might be expected to fall into that span, given the frequency of the collocate in the corpus as a whole? (= the Expected) Compare the actual co-occurrence of node and collocate with expected co-occurrence, if the words in the corpus were to occur in totally random order. An MI score of 3 and above is said to be significant.
16
z-score (or z-test) How many instances of the collocate are found in the designated span of the node word? (= the Observed) How many instances of the collocate might be expected to fall into that span, given the frequency of the collocate in the corpus as a whole? (= the Expected) Compare the Observed with the Expected, and calculate the number of standard deviations from the mean frequency. The higher the z-score, the greater the strength of collocation.
17
Differences between t-score, MI and z-score
MI score does not depend on the size of the corpus and MI scores can be compared across corpora of different sizes. Absolute t-scores cannot be compared across corpora, though t- score rankings can. t-score tends to give information about a word’s grammatical behaviour; MI and z-scores tends to give information about its lexical behaviour. (z-scores can over-estimate significance of infrequent collocations.) MI, z-scores and t-scores thus have different uses.
18
Exploring MI online Go to http://corpus.byu.edu Choose TIME
Click on Collocates Enter [terror] Put an * in the Collocates box Set the span to 4 and 4 Sort by FREQUENCY Set MINIMUM MI to 3.
19
Results
20
Understanding the results
You chose to sort the results by frequency. What happens if you choose to sort them by MI? Remember to put an * in the Collocates box. Why is it important to consider both MI and frequency statistics?
21
Not by MI alone…
22
Not by MI alone… ‘Afghan-trained’ collocates with [terror] 100% of the time – it is therefore a strong collocation BUT it only co-occurs with ‘terror’ once in 100 million words. On the other hand ‘reign’ has a significant MI – and it co-occurs with ‘terror’ 183 times in the corpus. For this reason we might want to set a MINIMUM FREQUENCY of, say 10 occurrences, before we consider the results of an MI search. We can also limit the co-occurrences searched for to a particular part of speech, eg nouns.
23
Refining a collocate search
24
Results of a refined collocate search
Sorted by MI, not frequency Collocates restricted to nouns
25
Take-home messages Statistical measures of collocation are a more objective measure of co- occurrence of expressions within a designated span than manual reading of concordance lines, especially in a large corpus. Different statistical tests measure degree of ‘association’ between collocates: frequency of co-occurrence, t-score, MI score and z-score. These different measures are good for different types of word (eg open and closed class items). The analysis of collocation can tell you different things, eg about the grammatical environment of an item (t-score) or its lexical environment (MI, z-score). The analysis of lexical environments in individual texts and larger corpora can shed light on recurrent themes that are important in a particular work, a group of texts (eg disciplinary texts) or a culture as a whole.
26
You should now be able to…
Search for co-occurrences of words or phrases Choose between a specific word-form or a lemma Change the span in which the co-occurrence is found Sort the results by frequency (setting a minimum MI) Sort the results by MI (setting a minimum frequency) Look at the results and begin to find patterns of co-occurrence, e.g. In British and American English In and across different registers (speech, academic writing, fiction, and so on) In different periods of time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.