Introduction to Corpus Linguistics: Exploring Collocation

Slides:



Advertisements
Similar presentations
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Advertisements

Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Chapter 10: Estimating with Confidence
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
QUANTITATIVE DATA ANALYSIS
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
Appraisal and Its Application to Counseling COUN 550 Saint Joseph College For Class # 3 Copyright © 2005 by R. Halstead. All rights reserved.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Unit 5: Estimating with Confidence Section 10.1 Confidence Intervals: The Basics.
Corpus search What are the most common words in English
A. Chi-square (Goodness of fit) Question answered: Is the actual distribution of items into categories different from what you could get by chance?
Sample Size Mahmoud Alhussami, DSc., PhD. Sample Size Determination Is the act of choosing the number of observations or replicates to include in a statistical.
The T-Test Are our results reliable enough to support a conclusion?
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Inferential Statistics. Population Curve Mean Mean Group of 30.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Tom.h.wilson Department of Geology and Geography West Virginia University Morgantown, WV.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
AMANY ALKHAYAT PSCW ENG371 INTRODUCTION TO CORPUS PROCESSING Corpus Processing Ch1.
751-3.
Basic statistics for corpus linguistics
Normal Distribution ••••••••••••••••••••••••••••••••••
9.3 Hypothesis Tests for Population Proportions
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Statistical NLP: Lecture 7
Inference and Tests of Hypotheses
Searching corpora.
Exploring the BNC Corpus
عمادة التعلم الإلكتروني والتعليم عن بعد
Distribution of the Sample Means
Introduction to Corpus Linguistics: Applications Lexicography
Introduction to Corpus Linguistics: Dispersion/concordance plots
Corpus Linguistics I ENG 617
Introduction to Corpus Linguistics: Key Word Analysis
Introduction to Corpus Linguistics: Colligation
Week 10 Chapter 16. Confidence Intervals for Proportions
Inferential statistics,
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 7-9
How to Start This PowerPoint® Tutorial
Geology Geomath Chapter 7 - Statistics tom.h.wilson
Essential Statistics Introduction to Inference
Hypothesis tests for the difference between two proportions
Chapter 10: Estimating with Confidence
Chapter 8: Estimating with Confidence
Basic Practice of Statistics - 3rd Edition Introduction to Inference
Semantics and discourse: Collocations, keywords and reliability of manual coding Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide.
Applied Linguistics Chapter Four: Corpus Linguistics
Chapter 8: Estimating with Confidence
Chapter 7 (Probability)
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Advanced Algebra Unit 1 Vocabulary
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
BYU COCA: CORPUS OF CONTEMPORARY AMERICAN ENGLISH
MGS 3100 Business Analysis Regression Feb 18, 2016
Some Key Ingredients for Inferential Statistics
Presentation transcript:

Introduction to Corpus Linguistics: Exploring Collocation John Corbett and Wendy Anderson

So far on corpus linguistics… General introduction … what are corpora and why are they useful? Frequencies and how to understand them… Concordance lines and how to interpret them… Familiarising ourselves with online corpora such as BNC, COCA, TIME, etc.

This session Understanding collocation What is collocation? Measuring collocation Searching for collocates and interpreting the data

What is collocation? Collocation is the statistical measurement of the strength of likelihood that expressions (or types of expression) will co-occur. For example… What is it likely that fish will co-occur with…? What is it likely student will co-occur with …?

Collocates of ‘fish’ in BNC

Searching for collocates of ‘student’

Collocates of ‘student’ in BNC and CoCA

Measuring collocation Collocation is a statistical measure that is a more objective method of looking at concordance lines than manual analysis and interpretation. Usually collocation programs look at a span of four or five words to the left and right of the node, eg: music. The fear, the terror of detection, each time Programs can be set to look only at the specific word- form ‘terror’ or the lemma ‘[terror]’ (which would include related word-forms, in this case terror and terrors.)

A lemma search

Frequency and Mutual Information

What can collocation tell us?

Measuring collocation MI score (Mutual information) t-score z-score

MI, t-score and z-score All calculations ask: How many instances of the collocate are found in the designated span of the node word? (= the Observed) How many instances of the collocate might be expected to fall into that span, given the frequency of the collocate in the corpus as a whole? (= the Expected)

t-score How many instances of the collocate are found in the designated span of the node word? (= the Observed) How many instances of the collocate might be expected to fall into that span, given the frequency of the collocate in the corpus as a whole? (= the Expected) Subtract the Expected from the Observed and divide the result by the standard deviation (ie consider the probability of co-occurrence of node and number of tokens in the designated span) A t-score of 2 or higher is normally considered significant.

MI (Mutual Information) How many instances of the collocate are found in the designated span of the node word? (= the Observed) How many instances of the collocate might be expected to fall into that span, given the frequency of the collocate in the corpus as a whole? (= the Expected) Compare the actual co-occurrence of node and collocate with expected co-occurrence, if the words in the corpus were to occur in totally random order. An MI score of 3 and above is said to be significant.

z-score (or z-test) How many instances of the collocate are found in the designated span of the node word? (= the Observed) How many instances of the collocate might be expected to fall into that span, given the frequency of the collocate in the corpus as a whole? (= the Expected) Compare the Observed with the Expected, and calculate the number of standard deviations from the mean frequency. The higher the z-score, the greater the strength of collocation.

Differences between t-score, MI and z-score MI score does not depend on the size of the corpus and MI scores can be compared across corpora of different sizes. Absolute t-scores cannot be compared across corpora, though t- score rankings can. t-score tends to give information about a word’s grammatical behaviour; MI and z-scores tends to give information about its lexical behaviour. (z-scores can over-estimate significance of infrequent collocations.) MI, z-scores and t-scores thus have different uses.

Exploring MI online Go to http://corpus.byu.edu Choose TIME Click on Collocates Enter [terror] Put an * in the Collocates box Set the span to 4 and 4 Sort by FREQUENCY Set MINIMUM MI to 3.

Results

Understanding the results You chose to sort the results by frequency. What happens if you choose to sort them by MI? Remember to put an * in the Collocates box. Why is it important to consider both MI and frequency statistics?

Not by MI alone…

Not by MI alone… ‘Afghan-trained’ collocates with [terror] 100% of the time – it is therefore a strong collocation BUT it only co-occurs with ‘terror’ once in 100 million words. On the other hand ‘reign’ has a significant MI – and it co-occurs with ‘terror’ 183 times in the corpus. For this reason we might want to set a MINIMUM FREQUENCY of, say 10 occurrences, before we consider the results of an MI search. We can also limit the co-occurrences searched for to a particular part of speech, eg nouns.

Refining a collocate search

Results of a refined collocate search Sorted by MI, not frequency Collocates restricted to nouns

Take-home messages Statistical measures of collocation are a more objective measure of co- occurrence of expressions within a designated span than manual reading of concordance lines, especially in a large corpus. Different statistical tests measure degree of ‘association’ between collocates: frequency of co-occurrence, t-score, MI score and z-score. These different measures are good for different types of word (eg open and closed class items). The analysis of collocation can tell you different things, eg about the grammatical environment of an item (t-score) or its lexical environment (MI, z-score). The analysis of lexical environments in individual texts and larger corpora can shed light on recurrent themes that are important in a particular work, a group of texts (eg disciplinary texts) or a culture as a whole.

You should now be able to… Search for co-occurrences of words or phrases Choose between a specific word-form or a lemma Change the span in which the co-occurrence is found Sort the results by frequency (setting a minimum MI) Sort the results by MI (setting a minimum frequency) Look at the results and begin to find patterns of co-occurrence, e.g. In British and American English In and across different registers (speech, academic writing, fiction, and so on) In different periods of time