SENSEVAL: Evaluating WSD Systems

SENSEVAL: Evaluating WSD Systems
Jason Blind & Lisa Norman College of Computer and Information Science Northeastern University Boston, MA 02115 January 25, 2006

What is SENSEVAL? Mission Underlying Goal History When
To organize and run evaluations that test the strengths and weaknesses of WSD systems. Underlying Goal To further human understanding of lexical semantics and polysemy. History Began as a workshop in April 1997 Organized by ACL-SIGLEX When 1998, 2001, 2004, 2007?

What is WSD? Machine Translation Information Retrieval
English drug translates into French as either drogue or médicament. Information Retrieval If user queries for documents about drugs, do they want documents about illegal narcotics or medicine? How do people disambiguate word senses? Grammatical Context “AIDS drug” (modified by proper name) Lexical Context If drug is followed by {addict, trafficker, etc…} the proper translation is most likely drogue. Domain-based Context If the text/document/conversation is about {disease, medicare, etc…} then médicament is most likely the correct translation.

Evaluating WSD Systems
Definition of the task Selecting the data to be used for the evaluation Production of correct answers for the evaluation data Distribution of the data to the participants Participants use their program to tag the data Administrators score the participants’ tagging Participants and administrators meet to compare notes and learn lessons

Evaluation Tasks All-Words Lexical Sample Multilingual Lexical Sample
Translation Automatic Sub-categorization Acquisition (?) WSD of WordNet Glosses Semantic Roles (FrameNet) Logic Forms (FOPC)

SENSEVAL-1 Comprised of WSD tasks for English, French and Italian.
Timetable A plan for selecting evaluation materials was agreed. Human annotators generated the ‘gold standard’ set of correct answers. The gold standard materials, without answers, were released to participants, who then had a short time to run their programs over them and return their sets of answers to the organizers. The organizers scored the returned answer sets, and scores were announced and discussed at the workshop. 17 Systems were evaluated.

SENSEVAL-1 : Tasks Lexical Sample Task breakdown
first carefully select a sample of words from the lexicon (based upon BNC frequency and WordNet polysemy levels); systems must then tag several corpus instances of the sample words in short extracts of text. Advantages over All-words sample More efficient human tagging The all-words task requires access to a full dictionary Many systems needed either sense tagged training data or some manual input for each dictionary entry, so all-words would be infeasible Task breakdown 15 noun tasks 13 verb tasks 8 adjective tasks 5 indeterminate tasks

SENSEVAL-1 : Dictionary & Corpus
Hector A joint Oxford University Press/Digital project in which a database with linked dictionary and 17M word corpus was developed. Chosen, when SENSEVAL wasn’t sure if it would have extra funding to pay humans to sense-tag text, because it was already sense-tagged. One disadvantage is that the OUP delivered corpus instances were associated with very little context (1-2 sentences usually).

SENSEVAL-1 : Data Dry-run Distribution Training-data Distribution
Systems must tag almost all of the content words in a sample of running text. Training-data Distribution first carefully select a sample of words from the lexicon; systems must then tag several instances of the sample words in short extracts of text. 20,000+ instances of 38 words. Evaluation Distribution A set of corpus instances for each task. Each instance had been tagged by at least 3 humans (these were obviously not part of the distribution :^) There were a total of 8,448 corpus instances in total. Most tasks had between 80 and 400 instances.

SENSEVAL-1 : Baselines Lesk’s algorithm Dictionary-based Corpus-based
Unsupervised systems Corpus-based Supervised systems

SENSEVAL-1 : Results (English)
State of the art, where training data is available, is 75%-80% When training data is available, systems that use it perform substantially better than those that do not. A well implemented simple LESK algorithm is hard to beat.

SENSEVAL-2 : Tasks All-Words Lexical Sample Translation
Systems must tag almost all of the content words in a sample of running text. Lexical Sample first carefully select a sample of words from the lexicon; systems must then tag several instances of the sample words in short extracts of text. 73 words = 29 nouns + 15 adjectives + 19 verbs Translation (Japanese only) in SENSEVAL-2 task in which word sense is defined according to translation distinction. (By contrast, SENSEVAL-1 evaluated systems on only lexical sample tasks in English, French, and Italian.)

Sense-Dictionary WordNet (1.7) Corpus Penn Treebank II Wall Street Journal articles British National Corpus (BNC)

Systems must tag almost all of the content words in a sample of running text. Training-data Distribution 12,000+ instances of 73 word Evaluation Distribution A set of corpus instances for each task. Each instance had been tagged by at least 3 humans (these were obviously not part of the distribution :^) There were a total of ? corpus instances in total.

SENSEVAL-2 : Results 34 Teams : 93 Systems Czech AW 1 - .94 Basque LS
Language Task # of Submissions # of Teams IAA Baseline Best System Czech AW 1 - .94 Basque LS 3 2 .75 .65 .76 Estonian .72 .85 .67 Italian .39 Korean .71 .74 Spanish 12 5 .64 .48 Swedish 8 .95 .70 Japanese 7 .86 .78 TL 9 .81 .37 .79 English 21 .57 .69 26 15 .51/.16 .64/.40

SENSEVAL-3 : Tasks All-Words Lexical Sample
English, Italian Lexical Sample Basque, Catalan, Chinese, English, Italian, Romanian, Spanish, Swedish Multilingual Lexical Sample Automatic Sub-categorization Acquisition (?) WSD of WordNet Glosses Semantic Roles (FrameNet) Logic Forms (FOPC)

Sense-Dictionary WordNet (2.0), eXtendedWordNet, EuroWordNet, ItalWordNet(1.7) MiniDir-Cat FrameNet Corpus British National Corpus (BNC) Penn Trebank, Los Angeles Times, Open Mind Common Sense SI-TAL (Integrated System for the Automatic Treatment of Language), MiniCors-Cat, etc...

Systems must tag almost all of the content words in a sample of running text. Training-data Distribution 12,000+ instances of 57 word Evaluation Distribution A set of corpus instances for each task. Each instance had been tagged by at least 3 humans (these were obviously not part of the distribution :^)

Approaches to WSD Kernel methods EM-based Clustering
Lesk-based methods Most Common Sense Heuristic Domain Relevance Estimation Latent Semantic Analysis (LSA) Kernel methods EM-based Clustering Ensemble Classification Maximum Entropy Naïve Bayes SVM Boosting KPCA

References A.Kilgarriff. “An Exercise in Evaluating Word Sense Disambiguation Programs”, 1998 A.Kilgarriff. “Gold Standard Datasets for Evaluating Word Sense Disambiguation Programs”, 1998 R.Mihalcea, T.Chklovski, and A.Kilgarriff. “The SENSEVAL-3 English Lexical Sample Task.”, 2004 J.Rosenzweig and A.Kilgarriff. “English SENSEVAL: Report and Results”, 1998 P.Edmonds. “The evaluation of word sense disambiguation systems”. ELRA Newsletter, Vol. 7, No.3, 2002 M.Carpuat, W.Su, and D.Wu. “Augmenting Ensemble Classification for Word Sense Disambiguation with Kernel PCA Model”. SENSEVAL-3, 2004

SENSEVAL: Evaluating WSD Systems

Similar presentations

Presentation on theme: "SENSEVAL: Evaluating WSD Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SENSEVAL: Evaluating WSD Systems

Similar presentations

Presentation on theme: "SENSEVAL: Evaluating WSD Systems"— Presentation transcript:

Similar presentations

About project

Feedback