An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee 2007-08-17

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee 2007-08-17 yonggulee@hotmail.com

Contents Introduction Related Works Research Goals Effective Word Sense Disambiguation Model and Evaluation Conclusion

Introduction Word Sense Disambiguation (WSD) The problem of selecting a sense for a word from a set of predefined possibilities. “Intermediate task” which is not an end in itself, but rather is necessary at one level or another. Obviously essential for language understanding applications. Machine translation Information retrieval and hypertext navigation Content and thematic analysis Speech processing and text processing

Related works(1/3) Approaches to WSD Knowledge-Based Disambiguation use of external lexical resources such as dictionaries and thesauri discourse properties Corpus-based Disambiguation Hybrid Disambiguation

Related works(2/3) Corpus-based Disambiguation Supervised Disambiguation based on a labeled training set the learning system has:  a training set of feature-encoded inputs AND  their appropriate sense label (category) Unsupervised Disambiguation based on unlabeled corpora The learning system has:  a training set of feature-encoded inputs BUT  NOT their appropriate sense label (category)

Related works(3/3) Lexical Resources for WSD Machine readable format Machine Readable Dictionaries (MRD) : Longman, Oxford, etc Thesauri and semantic networks : Roget Thesaurus, Wordnet, etc Sense tagged data Senseval-1,2,3(www.senseval.org)www.senseval.org  Provides sense annotated data for many languages, for several tasks  Languages: English, Romanian, Chinese, Basque, Spanish, etc.  Tasks: Lexical Sample, All words, etc. SemCor, Hector, etc

Research Motivation Manual sense tagging Labor-intensive and high cost Limitation of available sense tagged corpus Except English, other languages have a few corpus for WSD. Coverage of sense tagged words Some corpus has only one or a few words whose sense was tagged. “Line” corpus, “interest” corpus, etc If using supervised disambiguation method, the only word that appeared in the sense tagged corpus is disambiguated.

Research Goals Minimize or eliminate the cost of manual labeling. Automatic sense tagging using MRD and heuristic rules Improve the performance of word sense disambiguation. Using supervised disambiguation Naïve Bayes classifier

Effective Word Sense Disambiguation Model Automatic Tagging Technique Experimental Environment Evaluation of Automatic Tagging Technique Evaluation of Sense Classification Evaluation of Fusion Method

An Outline Diagram for the Proposed Research Key Word Extraction Collection Collocation Extraction Auto Sense Tagging Module Context Extraction of Target Word Naïve Bayes Classifier Test Context Evaluation Automatic Sense Tagging and Training Sense Classification Classify Word Sense Training Set Sense Tagging Dictionary

Automatic Tagging Technique Dictionary Information-based Method Collocation Overlap-based Method Data Fusion Method Dictionary Information-based Method + Collocation Overlap-based Method

Dictionary Information-based Method(1/2) Extract necessary information from dictionary. Heuristic 1: One Sense per Collocation / One Sense per Discourse Telephone line, 景氣展望 / Gyeonggi - jeonmang (economic prospect) Heuristic 2: Using of corresponding Chinese characters 감자 /Gamja : 柑子 (Potato)/ 減資 (Reduction of capital) Heuristic 3: Co-occurrence of synonym, antonym and related terms. Heuristic 4: Occurrence of the derived words

Dictionary Information-based Method(2/2) Heuristic 5: Co-occurrence of key feature that is extracted from definition of target word entry like Lesk(1986). Algorithm:  Retrieve from MRD all sense definitions of the words to be disambiguated  Determine the overlap between each sense definition and the current context  Choose senses that lead to highest overlap

Collocation Overlap-based Method Semantic similarity metric using the collocation overlap Algorithm:  Retrieve keywords from MRD all sense definitions of the words to be disambiguated  Extract collocation words of the keywords from test collection by threshold  Extract collocation words of the target words from the test collection  Determine the overlap of each collocation words(2, 3)  Choose senses that lead to highest overlap

Feature Selection By document frequency Test Collection -> docDF Definitions as documents -> dicDF docDF <= 5000 & dicDF <= 300

Sense Classification : Naïve Bayes Classifier * Algorithm: * source: Manning and Schütze. 1999. Foundations of Statistical Natural Language Processing.

Experimental Environment(1/2) Test Collection Includes all the articles(127,641) in three Korean daily newspapers for the year 2004 Use part-of-speech tagger and lexical analysis Evaluation Accuracy

Target Word for WSD words No of Senses No of Articles Total Frequency 감자 /Gamja 26221,115 경기 /Gyeonggi 418,48437,763 기간 /Gigan 211,25515,803 신병 /Sinbyeong 3360469 신장 /Sinjang 4703952 연기 /Yongi 43,2275,147 인도 /Indo 52,0222,750 지구 /Jigu 24,0179,372 지원 /Jiwon 312,57721,320

Evaluation of Automatic Sense Tagging Dictionary Information-based Method By Rule NoInformation Type All Target Words TotalCorrectAccuracy 1Collocation3,2292,9310.9077 2Chinese characters74 1.0000 3 Synonym2,1071,5980.7584 Antonym2371950.8228 Related Terms8467910.9350 4Derived Words1,0781,0710.9935 5Definitions128,52060,8100.5091 SUM136,09167,4700.4958

Results of Feature Selection - words words All Information Type TotalCorrectAccuracy 감자 /Gamja 8028000.9975 경기 /Gyeonggi 6,2004,8330.7795 기간 /Gigan 2,1281,2710.5973 신병 /Sinbyeong 2992650.8863 신장 /Sinjang 6534710.7213 연기 /Yongi 4,7324,1690.8810 인도 /Indo 3,9562,2740.5748 지구 /Jigu 2,2072,1240.9624 지원 /Jiwon 4,8261,8700.3875 SUM25,80318,0770.7006

Results of Feature Selection - Rule NoInformation Type All Target Words TotalCorrectAccuracy 1Collocation1,6031,5480.9657 2Chinese characters74 1.0000 3 Synonym1,6501,5560.9430 Antonym2371950.8228 Related Terms8467910.9350 4Derived Words1,0781,0710.9935 5Definitions20,31512,8420.6321 SUM25,80318,0770.7006

Evaluation of Automatic Sense Tagging Collocation Co-occurrence-based Method Performance by threshold RankTotalCorrectAccuracy Top106,1553,7270.6055 Top309,2585,2150.5633 Top5011,5446,2640.5426 Top10013,4326,7510.5026 All19,4367,7960.4009

Auto Tagging Result of Top 30 By Target Words WordMain SourceTotalCorrectAccuracy 감자 /Gamja Definitions 2732510.9194 경기 /Gyeonggi Definitions 3,5402,9510.8336 기간 /Gigan Synonym, Definitions 1,2053650.3029 신병 /Sinbyeong Definitions 112670.5982 신장 /Sinjang Definitions 101770.7624 연기 /Yongi Definitions 5204350.8365 인도 /Indo Antonym, Definitions 2771950.7040 지구 /Jigu Definitions 6095460.8966 지원 /Jiwon Related Words, Definitions 2,6213280.1251 Sum 9,2585,2150.5633

Auto Tagging Result of Top 30 By Information type Information Type All Target Words TotalCorrectAccuracy Synonym5444020.7390 Antonym1291190.9225 Related Terms2301660.7217 Definitions8,3554,5280.5420 SUM9,2585,2150.5633

Comparison of Two Auto Tagging Methods

Build a Classifier Train set : 600 Test set : others Window size: 50byte length Rule for making train set There are errors in the automatic sense tagging. For reducing errors and improving tagging accuracy of train set, information type of the high accuracy is firstly used.

Sense Classification - Dictionary Information-based Method Word All Information Type TotalCorrectAccuracy 감자 /Gamja 1,2701,1390.8966 경기 /Gyeonggi 37,76330,8970.8182 기간 /Gigan 15,8039,2780.5871 신병 /Sinbyeong 4693860.8230 신장 /Sinjang 9536710.7043 연기 /Yongi 5,1474,3020.8359 인도 /Indo 2,7501,2120.4408 지구 /Jigu 9,3758,3730.8932 지원 /Jiwon 21,3218,3450.3914 SUM94,85164,6040.6811

Sense Classification - Collocation Overlap-based Method By rank RankTotalCorrectAccuracy Top1094,85157,2010.6031 Top3094,85158,8910.6209 Top5094,85156,8710.5996 Top10094,85156,9160.6001 All94,85153,2180.5611

Sense Classification - Collocation Overlap-based Method By target words WordTotalCorrectAccuracy 감자 /Gamja 1,2701,0160.8000 경기 /Gyeonggi 37,76330,7870.8153 기간 /Gigan 15,80310,2390.6479 신병 /Sinbyeong 4692900.6183 신장 /Sinjang 9536220.6527 연기 /Yongi 5,1472,8340.5507 인도 /Indo 2,7501,6830.6118 지구 /Jigu 9,3758,4990.9065 지원 /Jiwon 21,3212,9220.1371 SUM94,85158,8910.6209

Comparison of Two Sense Classifications

Data Fusion of Two Auto Tagging Methods Dictionary Information base Method : Using all the information type except definitions Collocation Overlap base Method : Using only the information type of Top10

Results of the Auto Tagging Method in Data Fusion - Words WordTotalCorrectAccuracy 감자 /Gamja 5034850.9642 경기 /Gyeonggi 3,0862,5820.8367 기간 /Gigan 2,1891,5070.6884 신병 /Sinbyeong 96560.5833 신장 /Sinjang 3052710.8885 연기 /Yongi 9508880.9347 인도 /Indo 3672730.7439 지구 /Jigu 9399180.9776 지원 /Jiwon 2,9171,6570.5680 SUM11,3528,6370.7608

Results of the Auto Tagging Method in Data Fusion – Information Type NoInformation Type All Target Words TotalCorrectAccuracy 1Collocation1,6031,5480.9657 2Chinese characters74 1.0000 3 Synonym2,0641,8560.8992 Antonym3362900.8631 Related Terms9789070.9274 4Derived Words1,0781,0710.9935 5Definitions5,2192,8910.5539 SUM11,3528,6370.7608

Comparison of the Three Auto Tagging Methods Auto Tagging MethodTotalCorrectAccuracy Dictionary Information base Method 25,80318,0770.7006 Collocation Overlap base Method 9,2585,2150.5633 Fusion Method11,3528,6370.7608

Sense Classification in Data Fusion - Words WordTotalCorrectAccuracy 감자 /Gamja 1,2701,0870.8559 경기 /Gyeonggi 37,76332,1280.8508 기간 /Gigan 15,80313,0550.8261 신병 /Sinbyeong 4691210.2580 신장 /Sinjang 9537020.7366 연기 /Yongi 5,1474,4370.8621 인도 /Indo 2,7501,2510.4547 지구 /Jigu 9,3758,2050.8752 지원 /Jiwon 21,32111,2510.5277 SUM94,85172,2370.7616

Comparison of Three WSD Methods WSD MethodTotalCorrectAccuracy Improve ment( %) Fusion Method94,85172,2370.7616- Dictionary Information base Method 94,85164,6040.681111.82 Collocation Overlap base Method 94,85158,8910.620922.66

Conclusion(1/2) The performance of the automatic tagging technique differed depending on the type of information sources in the dictionary. In case of the frequently used keywords extracted from the dictionary, to apply feature selection method is needed.

Conclusion(2/2) The word sense disambiguation model using the automatic tagging method based on dictionary information showed a comparable performance to the supervised learning method using manual tagging information. The WSD model using data fusion technique combing two automatic tagging methods outperforms the model using a single tagging method.

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee 2007-08-17

Similar presentations

Presentation on theme: "An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee 2007-08-17"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee 2007-08-17

Similar presentations

Presentation on theme: "An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee 2007-08-17"— Presentation transcript:

Similar presentations

About project

Feedback