Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group http://ixa.si.ehu.es

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 2 Introduction: motivation Desired grained of word sense distinctions controversial Fine-grainedness of word senses unnecessary for some applications MT: channel (tv, strait)  kanal Senseval-2 WSD competition also provides coarse-grained senses The desired sense groupings depend on the application: MT: same translation (language pair dependant) IR: some related senses: metonymic, diathesis, specialization Dialogue (deeper NLP): in principle, all word senses in order to do proper inferences WSD needs to be tuned, multiple senses returned  Clustering of word senses

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 3 Introduction: a sample word Channel has 7 senses and 4 coarse-grained senses (Senseval 2)

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 4 Introduction Work presented here: Test quality of 4 clustering methods 2 based on distributional similarity Confusion matrix of Senseval-2 systems Translation equivalencies Result: hierarchical cluster Clustering algorithms: CLUTO toolkit Evaluation: Senseval-2 coarse-grained senses

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 5 Clustering toolkit used CLUTO (Karypis 2001) Possible inputs: context vector for each word sense (corpora) similarity matrix (built from any source) Number of clustering parameters Output: hierarchical or flat clusters

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 6 Distributional similarity methods Hypothesis: two word senses are similar if they are used in similar contexts 1.Clustering directly over the examples 2.Clustering over similarity among topic signatures

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 7 1.Clustering directly from examples 1.Take examples from tagged data (Senseval 2) OR retrieve sense-examples from the web E.g. if we want examples of first sense of channel use examples of monosemous synonym: transmision channel We use: synonyms, hypernyms, all hyponyms, siblings 1000 snippets for each monosemous term from Google Resource freely available (contact us) 2.Cluster the examples as if they were documents

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 8 2. Clustering over similarity among TS 1.Retrieve the examples 2.Build topic signatures: vector of words in context of word sense, with high weights for distinguished words: 1. sense: channel, transmission_channel "a path over which electrical signals can pass " medium(3110.34) optic(2790.34) transmission(2547.13) electronic(1553.85) channel(1352.44) mass(1191.12) fiber(1070.28) public(831.41) fibre(716.95) communication(631.38) technology(368.66) system(363.39) datum(308.50)... 3.Build similarity matrix of TS 4.Cluster

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 9 3. Confusion matrix method Hypothesis: sense A is similar to sense B if many WSD algorithms tag occurrences of A as B Implemented using results from all Senseval-2 systems 4. Translation similarity method Hypothesis: two word senses are similar if they are translated in the same way in a number of languages (Resnik & Yarowsky, 2000) Similarity matrix kindly provided by Chugur & Gonzalo (2002)

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 10 Experiment and results: by method Best results for distributional similarity: Topic signatures from web data Method purity Random 0.748 Confusion Matrixes 0.768 Multilingual Similarity 0.799 TS Senseval (Worse) (Best) 0.744 0.806 TS Web (Worse) (Best) 0.764 0.840

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 11 word by word

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 12 Conclusions Meaningful hierarchical clusters For all WordNet nominal synsets (soon) Using Web data and distributional similarity All data freely available (MEANING) But... Are the clusters useful for the detection of relations (homonymy, metonymy, metaphor,...) among word senses? Which clusters? Are the clusters useful for applications? WSD (ongoing work) MT, IR, CLIR, Dialogue Which clusters?

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 13 Thank you!

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 14 An example of a Topic signature http://ixa3.si.ehu.es/cgi-bin/signatureak/signaturecgi.cgi Source: web examples using monosemous relatives 1. sense: channel, transmission_channel "a path over which electrical signals can pass " medium(3110.34) optic(2790.34) transmission(2547.13) electronic(1553.85) channel(1352.44) mass(1191.12) fiber(1070.28) public(831.41) fibre(716.95) communication(631.38) technology(368.66) system(363.39) datum(308.50) 5. sense: channel, communication_channel, line "(often plural) a means of communication or access; " service(3360.26) postal(2503.25) communication(1868.81) mail(1402.33) communicate(1086.16) us(651.30) channel(479.36) communicating(340.82) united(196.55) protocol(170.02) music(165.93) london(162.61) drama(160.95) 7. sense: channel, television_channel, TV_channel "a television station and its programs; " station(24288.54) television(13759.75) tv(13226.62) broadcast(1773.82) local(1115.18) radio(646.33) newspaper(333.57) affiliated(301.73) programming(283.02) pb(257.88) own(233.25) independent(230.88)

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 15 Experiment and results: an example Sample cluster built for channel: Entropy: 0.286, Purity: 0.714.

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 16 1. Clustering directly from examples: Retrieving sense-examples from the web Examples of word senses scarce Alternative, automatically acquire examples from corpora (or web) In this paper we follow the monosemous relative method (Leacock et al.1998) E.g. if we want examples of first sense of channel use examples of monosemous synonym: transmision channel We use: synonyms, hypernyms, all hyponyms, siblings 1000 snippets for each monosemous term from Google Heuristics to extract partial or full meaningful sentences More details of the method in (Agirre et al. 2001)

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 17 2. Clustering over similarity among TS Building topic signatures Given a set of examples for each word sense... build a vector for each word sense: each word in the vocabulary is a dimension Steps: 1.Get frequencies for each word in context 2.Use  2 to assign weight to each word/dimension in contrast to the other word senses 3.Filtering step More details of the method in (Agirre et al. 2001)

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 18 3. Confusion matrix method Hypothesis: sense A is similar to sense B if WSD algorithms tag occurrences of A as B Implemented using results from all Senseval-2 systems Algorithm to produce similarity matrix: M = number of systems N(x) = number of occurrences of word sense x n(a,b) = number of times sense a is tagged as b confusion-similarity(a,b) = n(a,b) / N(a) * M Not symmetric

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 19 4. Translation similarity method Hypothesis: two word senses are similar if they are translated in the same way in a number of languages (Resnik & Yarowsky, 2000) Similarity matrix kindly provided by Chugur & Gonzalo (2002) Simplified algorithm: L = languages (= 4) n(a,b) = number of languages where a and b share a translation similarity(a,b) = n(a,b)/L Actual formula is more elaborate

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 20 Previous work on WordNet clustering Use of WordNet structure: Peters et al. 1998: WordNet hierarchy, try to identify systematic polysemy Tomuro 2001: WordNet hierarchy (MDL), try to identify systematic polysemy (60% precision against WordNet cousins, increase in inter-tagger agreement)  Our proposal does not look for systematic polysemy. We get individual relations among word senses:  e.g. television channel and transmission channel Mihalcea & Moldovan 2001: heuristics on WordNet, WSD improvement (Polysemy reduction 26%, error 2.1% in Semcor)  Provide complementary information

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 21 Previous work (continued) Resnik & Yarowsky 2000: (also Chugur & Gonzalo (2002). Translations across different languages, improving evaluation metrics (very high correlation with Hector sense hierarchies).  We only get 80% purity using (Chugur & Gonzalo). Unfortunately the dictionaries are rather different (Senseval-2 results dropped compared to Senseval-1). Difficult to scale to all words. Pantel & Lin (2002): induce word senses using soft clustering of word occurrences (overlap with WordNet over 60% prec.)  Use syntactic dependencies rather than bag-of-words vector Palmer et al. (submitted): criteria for grouping verb senses.

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 22

Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Similar presentations

Presentation on theme: "Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Similar presentations

Presentation on theme: "Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group"— Presentation transcript:

Similar presentations

About project

Feedback