Page 1 SenDiS Sectoral Operational Programme "Increase of Economic Competitiveness" "Investments for your future" Project co-financed by the European Regional Development Fund General Word Sense Disambiguation System applied to Romanian and English Languages - SenDiS - Andrei Mincă - SenDiS – WSD model, components, algorithms, methods & results
Page 2 SenDiS WSD model
Page 3 SenDiS System components
Page 4 SenDiS Order Lexicon Network (OLN) Build Meaning Semantic Signatures (BMSS) Compare Meaning Semantic Signatures (CMSS) Compute WSD Variants (CwsdV) WSD phases
Page 5 SenDiS Input: unordered lexicon network lexicon network optimizations considering number of edges loops or strong connected components number of roots and leafs number of levels (in the case of leveling the LN) Output: ordered lexicon network OLN Algorithms
Page 6 SenDiS Input a lexicon network (not necessarily ordered) a meaning ( ID ) Builds a semantic interpretation for the specified meaning over the lexicon network spanning trees sets of nodes sequences of edges or combinations of the above Output : a semantic interpretation (signature) for the meaning BMSS Algorithms
Page 7 SenDiS Input: two or more semantic signatures comparison depends on the nature of the semantic signatures Output: degrees of similarity CMSS Algorithms
Page 8 SenDiS Input : a matrix with degrees of similarity between the context words sense Output : one or several WSD variants with the highest cost CwsdV Algorithms
Page 9 SenDiS Input text list of meanings lexicon network Computing tokenization of text annotation of text tokens with meaning interpretations selecting a window-text for WSD other context filters or topologies build meaning semantic signatures for each word-sense compare meaning semantic signatures and fill the matrix compute best WSD variants Output one or more WSD variants with one or more meaning interpretations for each text token WSD methods
Page 10 SenDiS tokenization part-of-speech tagging lemmatization sense interpretations chunking parsing general WSD requirements
Page 11 SenDiS Performance indicators P - precision P = noCorrectlyDisambiguated_TargetWords / noDisambiguated_TargetWords R - recall R = noCorrectlyDisambiguated_TargetWords / noTargetWords F-measure 2 * P * R / (P+R) state-of-the-art results (F-measure) lexical sample task coarse-grained: ~ 90% fine-grained: ~ 73% All-words task coarse-grained: ~83% fine-grained: ~ 65% Testing WSD
Page 12 SenDiS A test configuration for SenDiS consists of: a meaning inventory a lexicon network an OLN algorithm a BMSS algorithm a CMSS algorithm a CwsdV algorithm a WSD method a Corpus test Testing SenDiS nMIs x nLNs x nOLNs x nBMSSs x nCMSSs x nCwsdVs x nWSDMs x nCorpusTests
Page 13 SenDiS Results Senseval 2 No. Texts LexNetPRF-measureTime (h) Observations (no POS tagging) 224WN_ex meaning interpretations only for recognized lemmas 225WN_ex % coverage for GRAALAN Inflection Form Entries 225WN_ex % IFEs + corpus target words lemmas tags Senseval 3 No. Texts LexNetPRF-measureTime (h) Observations (no POS tagging) 254WN_ex no IFEs 265WN_ex % IFEs 256WN_ex % IFEs + corpus target words lemmas tags Semcor No. Texts LexNetPRF-measureTime (h) Observations (no POS tagging) 33,855WN_ex % IFEs 33,866WN_ex % IFEs + corpus target words lemmas tags
Page 14 SenDiS Tagged glosses as a Test Corpus WN_ex No. Texts LexNetPRF-measureTime (h) Observations (no POS tagging) 206,941WN_ex only corpus target words lemmas tags 158,378WN_ex % IFEs 158,667WN_ex % IFEs + corpus target words lemmas tags LLR_99% No. Texts LexNetPRF-measureTime (h) Observations (no POS tagging) 106,899LLR_99% no IFEs 110,596LLR_99% % IFEs 110,635LLR_99% % IFEs + corpus target words lemmas tags LLE_2% No. Texts LexNetPRF-measureTime (h) Observations (no POS tagging) 2,927LLE_2% no IFEs 3,125LLE_2% % IFEs 3,071LLE_2% % IFEs + corpus target words lemmas tags
Page 15 SenDiS