Download presentation
Presentation is loading. Please wait.
Published byBraydon Gerrard Modified over 10 years ago
1
1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem and Review of relevant literature (not just the study you are going to replicate, but related things too) –Description and discussion of your own results First part (1000-1500 words) due in Friday 25 April Second part (1500-2000 words) due in Friday 9 May No overlap allowed with LELA30122 projects –Though you are free to use that list of topics for inspiration –See LELA30122 WebCT page, “project report”
2
Church et al. 1991 K Church, W Gale, P Hanks, D Hindle (1991) Using Statistics in Lexical Analysis, in U Zernik (ed) Lexical Acquisition: Exploiting on- line resources to build a lexicon. Hillsdale NJ (1991): Lawrence Erlbaum, pp. 115-164.
3
3/32 Background Corpora were becoming more widespread and bigger Computers becoming more powerful But tools for handling them still relatively primitive Use of corpora for lexicology Written for the First International Workshop on Lexical Acquisition, Detroit 1989 In fact there was no “Second IWLA” But this paper (and others in the collection) become much cited and well known
4
4/32 The problem Assuming a lexicographer has at their disposal a reference corpus of considerable size, … A typical concordance listing only works well with –words with just two or three major sense divisions –preferably well distinct –and generating only a pageful of hits Even then, the information you may be interested in may not be in the immediate vicinity
5
5/32
6
6/32 The solution Information Retrieval faces a comparable problem (overwhelming data), and suggests a solution 1.Choose an appropriate statistic to highlight information “hidden” in the corpus 2.Preprocess the corpus to highlight properties of interest 3.Select an appropriate unit of text to constrain the information extracted
7
7/32 Mutual Information MI: a measure of similarity Compares the joint probability of observing two words together with the probabilities of observing them independently (chance) If there is a genuine association, I(x;y)>>0 If no association, P(x,y) P(x)P(y), I(x;y) 0 If complementary distribution, I(x;y)<<0
8
8/32 Top ten scoring pairs of strong y and powerful y Data from AP corpus, N=44.3m words
9
9/32 Mutual Information Can be used to demonstrate a strong association Counts can be based on immediate neighbourhood, as in previous slide, or on co- occurrence within a window (to left or right or both), or within same sentence, paragraph, etc. MI shows strongly associated word pairs, but cannot show the difference between, eg strong and powerful
10
10/32 t-test A measure of dissimilarity How to explain relative strength of collocations such as –strong tea ~ powerful tea –powerful car ~ strong car The less usual combination is either rejected, or has a marked contrastive meaning Use example of {strong|powerful} support because tea rather infrequent in AP corpus
11
11/32 {strong|powerful} support MI can’t help: very difficult to get value for I(powerful;support)<<0 because of size of corpus –Say x and y both occur about 10 times per 1m words in a corpus –P(x) = P(y) = 10 -5 and chance P(x)P(y) = 10 -10 –I(powerful;support)<<0 means P(x)P(y) << 10 -10 –ie much less than 1 in 10,000,000,000 –Hard to say with confidence
12
12/32 Rephrase the question Can’t ask “what doesn’t collocate with powerful?” Also, can’t show that powerful support is less likely than chance: in fact it isn’t –I(powerful;support)=1.74 –3 x greater than chance! Try to compare what words are more likely to appear after strong than after powerful Show that strong support relatively more likely than powerful support
13
13/32 t-test Null hypothesis (H0) –H0 says that there is no significant difference between the scores H0 can be rejected if –Difference of at least 1.65 sd’s –95% confidence –ie the difference is real
14
14/32 t-test Comparison of powerful support with chance is not significant t = 0.99 (less than 1 sd!) But if we compare powerful support with strong support, t = –13 Strongly suggests there is a difference
15
15/32
16
16/32 MI and t-score show different things
17
17/32 How is this useful? Helps lexicographers recognize significant patters Especially useful for learners’ dictionaries to make explicit the difference in distribution between near synonyms eg what is the difference between a strong nation and a powerful nation? –Strong as in strong defense, strong economy, strong growth –Powerful as in powerful posts, powerful figure, powerful presidency
18
18/32 Taking advantage of POS tags Looking at context in terms of POS rather than lexical items may be more informative Example, how can we distinguish to as an infinitive marker from to as a preposition? Look at words which immediately precede to –able to, began to, … vs back to, according to, … t-score can show that they have a different distribution
19
19/32
20
20/32 Similar investigation with subordinate conjunction that (fact that, say that, that the, that he) and demonstrative pronoun that (that of, that is, in that, to that) Look at both preceding and following word Distribution is so distinctive that this process can help us to spot tagging errors
21
21/32
22
22/32 subordinate conjunction demonstrative pronoun t w that/cs w that/dt w t w that/cs w that/dt w 14.19 227 2 so/cs –12.25 1 151 of/in
23
23/32 If your corpus is parsed Looking for word sequences can be limiting More useful if you can extract things like subjects and objects of verbs (Can be done to some extent by specifying POS tags within a window, but that’s very noisy) Assuming you can easily extract, eg Ss, Vs, and Os …
24
24/32 What kinds of things do boats do?
25
25/32
26
26/32 What is an appropriate unit of text? Mostly we have looked at neighbouring words, or words within a defined context Bigger discourse units can also provide useful information eg taking entire text as the unit: –How do stories that mention food differ from stories that mention water?
27
27/32
28
28/32 More subtle distinctions can be brought out in this way What’s the difference between a boat and a ship? Notice how immediately neighbouring words won’t necessarily tell much of a story But words found in stories that mention boats/ships help to characterize the difference in distribution, and give a clue as to the difference in meaning Notice that human lexicographer still has to interpret the data
29
29/32
30
30/32 Word-sense disambiguation The article also shows how you can distinguish two senses of bank –Identify words which occur in the same text as bank and river on the one hand, and bank and money on the other
31
31/32 bank (river) vs bank (money) t bank&river bank&money w 6.6345 4river 4.902813River 4.012013water 3.571611feet 3.462339miles 3.442132near 3.2712 5boat 3.061416south 2.83 8 1fisherman 2.832149along 2.761112border 2.741735area 2.72 9 6village 2.71 7 0drinking 2.701632across 2.66 9 7east 2.58 7 2century 2.531013missing t bank&river bank&money w -15.956467money -10.702199Bank -10.600134funds -10.460131billion -10.130124Washington -10.130124Federal - 9.430110cash - 9.031134interest - 8.791129financial - 8.790 98Corp - 8.381121loans - 8.170 87loan - 7.570 77amount - 7.440 75fund - 7.381102William - 7.311101company - 7.251101account - 7.250 72deposits
32
32/32 Bank vs bank Bank bank t Bank bank w 35.021324 24Gaza-36.4812843362bank 34.031301 36Palestinian-10.93 9001161money 33.601316 48Israeli-10.43 624 859federal 33.181206 26Strip- 9.59 586 786company 32.981204 29Palestinians- 8.47 282 430accounts 32.681339 72Israel- 8.26 544 693central 31.5641161284Bank- 8.21 408 554cash 31.131151 47occupied- 8.21 675 816business 30.791104 40Arab- 7.74 546 676loans 27.97 867 21territories- 7.54 52 140robbery
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.