Download presentation
Presentation is loading. Please wait.
Published byJean Ellis Modified over 9 years ago
1
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department of Computer Science and Engineering, Korea University
2
Introduction Since natural language has its lexical ambiguity, it is predictable that text retrieval systems benefits from resolving ambiguities from all words in a given collection. Since natural language has its lexical ambiguity, it is predictable that text retrieval systems benefits from resolving ambiguities from all words in a given collection. Previous IR experiments using word senses have shown disappointing results. Previous IR experiments using word senses have shown disappointing results. Some reasons for previous failures: skewed sense frequencies, collocation problem, inaccurate sense disambiguation, etc. Some reasons for previous failures: skewed sense frequencies, collocation problem, inaccurate sense disambiguation, etc.
3
Introduction WSD for IR tasks should be performed on all ambiguous words in a collection since we cannot know user query in advance. WSD for IR tasks should be performed on all ambiguous words in a collection since we cannot know user query in advance. Performance of WSD reach at most about 75% precision and recall on all word task in SENSEVAL competition. (about 95% in the lexical sample task) Performance of WSD reach at most about 75% precision and recall on all word task in SENSEVAL competition. (about 95% in the lexical sample task)
4
Introduction Some observations that sense disambiguation for crude tasks such as IR is different from traditional word sense disambiguation. Some observations that sense disambiguation for crude tasks such as IR is different from traditional word sense disambiguation. It ’ s arguable that fine-grained word sense disambiguation is necessary to improve retrieval performance. It ’ s arguable that fine-grained word sense disambiguation is necessary to improve retrieval performance. ex: “ stock ” has 17 different senses in WordNet. For IR, consistent disambiguation is more important than accurate disambiguation, and flexible disambiguation is better than strict disambiguation. For IR, consistent disambiguation is more important than accurate disambiguation, and flexible disambiguation is better than strict disambiguation.
5
Root Sense Tagging Approach This approach aims to improve the performance of large-scale text retrieval by conducting coarse-grained, consistent, and flexible WSD. This approach aims to improve the performance of large-scale text retrieval by conducting coarse-grained, consistent, and flexible WSD. 25 root senses for the nouns in WordNet 2.0 are used. 25 root senses for the nouns in WordNet 2.0 are used.ex: “ story ” has 6 senses in WordNet - {message, fiction, history, report, fib} are from the same root sense of “ relation ”. - {floor} is from the root sense of “ artifact ”.
6
Root Sense Tagging Approach The root sense tagger classifies each noun in the documents and queries into one of the 25 root senses, so it is called coarse-grained disambiguation. The root sense tagger classifies each noun in the documents and queries into one of the 25 root senses, so it is called coarse-grained disambiguation.
7
Root Sense Tagging Approach When classifying a given ambiguous word, the most informative neighboring clue word have the highest MI with the given word is selected. When classifying a given ambiguous word, the most informative neighboring clue word have the highest MI with the given word is selected. The single most probable sense among the candidate root senses for the given word is chosen according to the MI between the selected neighboring clue word and each candidate root sense. The single most probable sense among the candidate root senses for the given word is chosen according to the MI between the selected neighboring clue word and each candidate root sense.
8
Root Sense Tagging Approach There are 101,778 non-ambiguous units in WordNet 2.0. There are 101,778 non-ambiguous units in WordNet 2.0. ex: “ actor ” = {role player, doer} → person “ computer system ” → artifact
9
Co-occurrence Data Construction The steps to extract co-occurrence information from each document: The steps to extract co-occurrence information from each document: 1. Assign root sense to each non-ambiguous noun in the document. 2. Assign a root sense to each second noun of non- ambiguous compound nouns in the document. 3. Even if any noun tagged in step 2 occurs alone in other position, assign the same root sense in step 2. 4. For each sense-assigned noun in the document, extract all (context word, sense ) pairs within a predefined window. 5. Extract all ( word, word ) pairs.
10
Co-occurrence Data Construction
11
MI-based Root Sense Tagging “ system ” has 9 fine-grained senses in WordNet, and 5 root senses: artifact, cognition, body, substance and attribute. “ system ” has 9 fine-grained senses in WordNet, and 5 root senses: artifact, cognition, body, substance and attribute.
12
Indexing and Retrieval 26-bit sense field is added to each term posting element in index. 26-bit sense field is added to each term posting element in index. 1 bit is used for unk assigned to unknown words. 1 bit is used for unk assigned to unknown words. If s( w ) is set to null or w is not a noun, all the bits are 0. If s( w ) is set to null or w is not a noun, all the bits are 0. Two situations must be considered: Two situations must be considered: Several different root senses may be assigned to the same word within a document. Several different root senses may be assigned to the same word within a document. Only nouns are sense tagged, but a verb with the same indexing keyword form may exist in the document. Only nouns are sense tagged, but a verb with the same indexing keyword form may exist in the document.
13
Indexing and Retrieval
14
A sense-oriented term weighting method is proposed. A sense-oriented term weighting method is proposed. Traditional term-based index. Traditional term-based index. Term weight is transformed by using sense weight sw calculated by referring to the sense field. Term weight is transformed by using sense weight sw calculated by referring to the sense field. Sense weight sw ij for term t i in document d j is defined as: Sense weight sw ij for term t i in document d j is defined as: sw ij = 1 + α . q(dsf ij, qsf i ) where dsf ij and qsf i indicate the sense field of term t i in document d i and query respectively. Sense-matching function q is defined as:
15
Data and Evaluation Methodologies Two document collections and two query sets are used. Two document collections and two query sets are used. Documents Documents 210,157 documents of Financial Times collection in TREC CD vol.4 210,157 documents of Financial Times collection in TREC CD vol.4 127,742 documents of LA Times collection in vol.5 127,742 documents of LA Times collection in vol.5 Queries Queries TREC 7(351-400) and TREC 8(401-450) queries TREC 7(351-400) and TREC 8(401-450) queries Three baseline term weighting method Three baseline term weighting method W1: simple idf weighting W1: simple idf weighting W2: tf . idf weighting W2: tf . idf weighting W3: (1 + log( tf )) . idf weighting W3: (1 + log( tf )) . idf weighting
16
Experiment Results Sense weight parameter α is set to 0.5. Sense weight parameter α is set to 0.5. More improvements is obtained in the long queries experiments. More improvements is obtained in the long queries experiments. Overgrown weight in W3(W2)+sense Overgrown weight in W3(W2)+sense
17
Pseudo Relevance Feedback Five terms from the top ten documents are selected by the probabilistic term selection method. Five terms from the top ten documents are selected by the probabilistic term selection method. For the sense fields of the new query terms in +sense experiments, a voting method is used, which the most frequent root sense in the top 10 documents is assigned to the new terms. For the sense fields of the new query terms in +sense experiments, a voting method is used, which the most frequent root sense in the top 10 documents is assigned to the new terms.
18
Result of Pseudo Relevance Feedback
19
BM25 using Root Senses
20
Conclusions A coarse-grained, consistent, and flexible sense tagging method is proposed to improve large-scale text retrieval performance. A coarse-grained, consistent, and flexible sense tagging method is proposed to improve large-scale text retrieval performance. This approach can be applied to retrieval systems in other languages in cases where there are lexical resources much more roughly constructed than expensive resources like WordNet. This approach can be applied to retrieval systems in other languages in cases where there are lexical resources much more roughly constructed than expensive resources like WordNet. The proposed sense-field based indexing and sense- weight oriented ranking do not seriously increase system overhead. The proposed sense-field based indexing and sense- weight oriented ranking do not seriously increase system overhead.
21
Conclusions Experiment results show good performance even with the relevance feedback method or state-of-the-art BM25 retrieval model. Experiment results show good performance even with the relevance feedback method or state-of-the-art BM25 retrieval model. Verbs should be assigned with senses in future work. Verbs should be assigned with senses in future work. It is essential to develop an elaborate retrieval model, i.e., a term weighting model considering word senses. It is essential to develop an elaborate retrieval model, i.e., a term weighting model considering word senses.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.