LIN 3098 Corpus Linguistics Lecture 6 Albert Gatt.

Slides:



Advertisements
Similar presentations
Cognitive Linguistics Croft & Cruse 9
Advertisements

Semantics Semantics is the branch of linguistics that deals with the study of meaning, changes in meaning, and the principles that govern the relationship.
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
Statistical Issues in Research Planning and Evaluation
Albert Gatt LIN3021 Formal Semantics Lecture 5. In this lecture Modification: How adjectives modify nouns The problem of vagueness Different types of.
Chapter 8 Linear Regression © 2010 Pearson Education 1.
Semantic similarity, vector space models and word- sense disambiguation Corpora and Statistical Methods Lecture 6.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
1 Collocation and translation MA Literary Translation- Lesson 2 prof. Hugo Bowles February
Language, Mind, and Brain by Ewa Dabrowska Chapter 2: Language processing: speed and flexibility.
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
Collocations 09/23/2004 Reading: Chap 5, Manning & Schutze (note: this chapter is available online from the book’s page
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
Outline What is a collocation? Automatic approaches 1: frequency-based methods Automatic approaches 2: ruling out the null hypothesis, t-test Automatic.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
Albert Gatt Corpora and Statistical Methods – Part 2.
Hypothesis Testing:.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Statistical Natural Language Processing Diana Trandabăț
1 Introduction to Natural Language Processing ( ) Words and the Company They Keep AI-lab
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 15 Inference for Counts:
Near East University Department of English Language Teaching Advanced Research Techniques Correlational Studies Abdalmonam H. Elkorbow.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Albert Gatt Corpora and Statistical Methods. In this lecture Corpora and Statistical Methods We have considered distributions of words and lexical variation.
Quantitative Analysis. Quantitative / Formal Methods objective measurement systems graphical methods statistical procedures.
From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
1 Psych 5500/6500 Standard Deviations, Standard Scores, and Areas Under the Normal Curve Fall, 2008.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Data Analysis Econ 176, Fall Populations When we run an experiment, we are always measuring an outcome, x. We say that an outcome belongs to some.
Rules, Movement, Ambiguity
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Differences between EAP and EGP Features of EAP. Categories for the main distinguishing features of Academic English Complexity Formality Precision Objectivity.
CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
Making Sense of Statistics: A Conceptual Overview Sixth Edition PowerPoints by Pamela Pitman Brown, PhD, CPG Fred Pyrczak Pyrczak Publishing.
Plan for Today’s Lecture(s)
Vocabulary Module 2 Activity 5.
Statistical NLP: Lecture 7
Unit 5: Hypothesis Testing
CHAPTER 4 Designing Studies
Introduction to Corpus Linguistics: Exploring Collocation
Corpora and Statistical Methods
Inferential statistics,
CHAPTER 4 Designing Studies
15.1 Goodness-of-Fit Tests
CHAPTER 4 Designing Studies
Significance Tests: The Basics
CHAPTER 4 Designing Studies
Inferential Statistics
CHAPTER 4 Designing Studies
Introduction to Semantics
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
Chapter 26 Comparing Counts.
CHAPTER 4 Designing Studies
Vocabulary/Lexis LEXIS: n., collective, uncountable
CHAPTER 4 Designing Studies
MGS 3100 Business Analysis Regression Feb 18, 2016
F test for Lack of Fit The lack of fit test..
Presentation transcript:

LIN 3098 Corpus Linguistics Lecture 6 Albert Gatt

In this lecture…  More on corpora for lexicography collocations as a window on lexical semantics uses of collocations:  distinguishing near-synonyms  cross-register variation case study: synonymy and the “contextual” view of meaning

Part 1 What is a collocation?

The empiricist tradition in lexical semantics  Main exponent: Firth (1957)  Fundamental position: the meaning of words is best discovered through an analysis of the context in which they occur  Contrast to more traditional, rationalist approaches: meaning is usually defined in terms of concepts or features words can be distinguished based on distinctions among their features

Collocations and collocational strength  Example 1: Adjective-noun combinations large numberbig number large distinctionbig distinction Why are large and big not equally acceptable with different nouns?  Example 2: Noun compounds computer scientist computer terminal computer desk Are these compounds equally well-established?

Uses of collocations  Collocations can tell us something about: distinctions in word meaning between apparently synonymous words whether certain expressions should be considered as “frozen” or nearly so  We should view such phrases as falling on a continuum: one extreme: simple, syntactic combination (kick the door) other extreme: fully frozen idiomatic expressions (kick the bucket) plenty of intermediate cases

Properties of collocations 1.Frequency and regularity 2.Textual proximity 3.Limited compositionality 4.Non-substitutability 5.Non-modifiability 6.Category restrictions

Frequency and regularity  We know that language is regular (non-random) and rule-based. this aspect is emphasised by rationalist approaches to grammar  We also need to acknowledge that frequency of usage is an important factor in language development. why do big and large collocate differently with different nouns?

Regularity/frequency  f(strong tea) > f(powerful tea)  f(credit card) > f(credit bankruptcy)  f(white wine) > f(yellow wine) (even though white wine is actually yellowish)

Narrow window (textual proximity)  Usually, we specify an n-gram window within which to analyse collocations: bigram: credit card, credit crunch trigram: credit card fraud, credit card expiry …  The idea is to look at co-occurrence of words within a specific n-gram window  We can also count n-grams with intervening words: federal (.*) subsidy matches: federal farm subsidy, federal manufacturing subsidy…

Textual proximity (continued)  Usually collocates of a word occur close to that word. may still occur across a span  Examples: bigram: white wine, powerful tea >bigram: knock on the door; knock on X’s door

Non-compositionality  white wine not really “white”, meaning not fully predictable from component words + syntax  signal interpretation a term used in Intelligent Signal Processing: connotations go beyond compositional meaning  Similarly: regression coefficient good practice guidelines  Extreme cases: idioms such as kick the bucket meaning is completely frozen

Non-substitutability  If a phrase is a collocation, we can’t substitute a word in the phrase for a near-synonym, and still have the same overall meaning.  E.g.: white wine vs. yellow wine powerful tea vs. strong tea …

Non-modifiability  Often, there are restrictions on inserting additional lexical items into the collocation, especially in the case of idioms.  Example: kick the bucket vs. ?kick the large bucket  NB: this is a matter of degree! non-idiomatic collocations are more flexible

Category restrictions  Frequency alone doesn’t indicate collocational strength: by the is a very frequent phrase in English not a collocation  Collocations tend to be formed from content words: A+N: powerful tea N+N: regression coefficient, mass demonstration N+PREP+N: degrees of freedom

Part 2 Distinguishing near-synonyms: a case study (from Biber et al 1993)

Near-synonyms  What’s the difference between: big, large, great  A traditional dictionary (OED online): large adj. of considerable or relatively great size, extent, or capacity big adj. of considerable size, physical power, or extent great adj. of an extent, amount, or intensity considerably above average  Is this informative enough?

The frequency of the adjectives (Longman-Lancaster corpus)  overall (5.7m words): f(big) = 1,319 f(large) = 2,342 f(great) = 2,254  academic prose (2.7m words): f(big) = 84 f(large) = 1,641 f(great) = 772  fiction (3m words) f(big) = 1,235 f(large) = 701 f(great) = 1,482 large >> great >> big great >> big >> large

Immediate right collocate: big  academic prose: big enough (2.2 / m)big traders (1.1 / m)  fiction: big man (9.6 / m)big enough (8.9 / m) big house (7.6 / m)  Seems to be used for physical size or object, person, or organisation big enough is usually used for size as well the house is big enough  Also occurs often with descriptive adjectives: big black X etc.

Immediate right collocates: large  Academic prose: large number (48.3/m) large numbers (31.3/m) large scale (29.4/m)  Fiction large black (4.3 / m) large enough (3.6 / m) large room (2.7 / m) large number (2.3 / m)  Used more often than big for quantities or proportions. large enough is usually used for such quantities too the ratio is large enough Lemmatisation would allow us to combine these

Immediate right collocates: great  academic prose: great deal (44.6 / m)great importance (12.6 / m) great variety (7.0 / m)great detail (2.6 / m)  fiction: great deal (40.4 / m)great man (6.6 /m)  In academic prose, mostly used for amount or quantity. Rather like large, but also occurs with deal. Great also used for intensity:great importance, great care…  In fiction, mostly used for amounts: a great deal of apple juice

Salient differences  This is a very brief overview of uses and senses of the three adjectives.  It helps explain the different frequency distribution across registers: fiction often contains physical descriptions (thus, big is more frequent than in academic prose) academic prose more often concerned with proportions, amounts, quantities (thus, great is more frequent here)

Widening the window  Two words can co-occur regularly even with a few words between them:  academic prose: large X oflarge X in large X openlarge X that  fiction: large X oflarge X and large X inlarge X eyes

Widening the window - II  The most frequent collocate of large in a three-word window is of.  What nouns intervene between large and of? large amounts of, large numbers of… again, typically quantities or proportions  Large X eyes is very frequent in fiction (not academic prose) his large hazel eyes… confirms earlier conclusion that fiction has more physical descriptions

Interim summary  This brief overview shows that: collocations help to distinguish between near-synonyms can also help to discover patterns of variation across registers

Part 3 The contextual theory of synonymy and similarity: Corpus-based and psycholinguistic evidence

Synonymy  Different phonological words with highly related meanings: sofa / couch boy / lad żgħir (small) / ċkejken (little)  Traditional definition: w1 is synonymous with w2 if w1 can replace w2 in a sentence, salva veritate Is this ever the case? Can we replace one word for another and keep our sentence identical?

Imperfect synonymy  Synonyms often exhibit slight differences, espcially in connotations żgħir (“small”) is fairly neutral with respect to the thing spoken of ċkejken (“small”/”little”) might be used for a little child, but not a teenager  may carry connotations of dependence, cuteness, etc...

The importance of register  With near-synonyms, there are often register-governed conditions of use.  E.g. naive vs gullible vs ingenuous gullible / naive seem critical, or even offensive ingenuous more likely in a formal context

Synonymy vs. Similarity  The contextual theory of synonymy: based on the work of Wittgenstein (1953), and Firth (1957) You shall know a word by the company it keeps (Firth 1957)  Under this view, perfect synonyms might not exist.  But words can be judged as highly similar if people put them into the same linguistic contexts, and judge the change to be slight.

Synonymy vs. similarity: example  Miller & Charles 1991: Weak contextual hypothesis: The similarity of the context in which 2 words appear contributes to the semantic similarity of those words. E.g. snake is similar to [resp. synonym of] serpent to the extent that we find snake and serpent in the same linguistic contexts. It is much more likely that snake/serpent will occur in similar contexts than snake/toad NB: this is not a discrete notion of synonymy, but a continuous definition of similarity

The Miller/Charles experiment  Subjects were given sentences with missing words; asked to place words they felt were OK in each context.  Method to compare words A and B: find sentences containing A find sentences containing B delete A and B from sentences and shuffle them ask people to choose which sentences to place A and B in.  Showed that people will put similar words in the same context, and this is highly correlated with occurrence in similar contexts in corpora.

Issues with similarity  “Similar” is a much broader concept than “synonymous”: “Contextually related, though differing in meaning”:  man / woman  boy / girl  master / pupil “Contextually related, but with opposite meanings”:  big / small  clever / stupid

Part 4 Bonus Topic: Mutual Information for ranking collocations

General idea  Suppose we identify several multiword units in a corpus (e.g. several N-N compounds).  We would like to know to what extent the words making them up are “strongly collocated”.  Could be that these words occur together purely by chance.

An analogy  Suppose Tom and Viv are an item turn up everywhere unless they’re together.  From your point of view: Seeing Tom increases your certainty (your information) that Viv is around. Seeing Viv does the same with respect to Tom.  Your assumptions would be very different if you knew that Tom and Viv had never been able to stand eachother… Or you only knew them separately, and had no idea they had a relationship

The reasoning (I)  Example: collocations involving post  A search through a corpus throws up lots of co-occurring words, e.g.: the post post in post office post mortem  We don’t want to call all these collocations.  E.g.: the is extremely frequent, and this is probably why it occurs very frequently with post. (remember Zipf’s law)

The reasoning (II)  Suppose we suspect a strong relationship between post and mortem.  There are two possibilities: 1.post + mortem just co-occur randomly, so they’re as likely to occur together as separately. 2.post + mortem is indeed a collocation, so finding mortem should increase our certainty that we’ll also find post in its immediate environment.  thus, the two words have high mutual information  Given the word w, the mutual information score tells us how much our certainty increases that post is in the vicinity.

Mutual information: post mortem 1.compute the frequency of post, mortem and post mortem. Denoted f(post), f(mortem), f(post mortem) 2.compute the probability of these this is just the frequency divided by the corpus size: f(post)/N etc a better estimate, because proportional we denote these p(post), p(mortem) etc

Mutual information: post mortem 3.Compare the probability of finding post + mortem to the probability of finding either word on its own: Probability of finding the two words together within a certain window. Probability of the two words independently.

Mutual information: post mortem 4.Finally, we turn this probability ratio into a measure of information. 5.Information is measured in bits. 6.A probability estimate is turned into an information value by taking its log (usually to base 2). The amount of information about post increases by this amount if we know that there is an accompanying word mortem.

Interpreting MI  If MI is positive, and reasonably high (usually 2 or higher), then the two words are strongly collocated.  If MI is negative, then the two words are actually unlikely to occur together.  If MI is approximately zero, then the two words tend to occur independently.

Summary  This lecture has focused on another use of corpora for lexicography dominant paradigm is the empiricist view of meaning and language takes a very different approach to issues of synonymy than rationalist approaches  We have also introduced the concept of Mutual Information, as a way of measuring collocational strength.