1 Representing Meaning in Unsupervised Word Sense Disambiguation Bridget T. McInnes 5 September 2008 University of Minnesota Twin Cities
2 What is WSD? The culture count doubled. Culture Laboratory Culture Anthropological Culture Sense Inventory
3 Approaches to WSD Supervised Advantages: obtains a high accuracy Disadvantages: manually annotated training data is required for each word that needs to be disambiguated therefore it can not scale Unsupervised Advantages: does not require manually annotated training data Disadvantages: generally does not obtain as high of an accuracy as supervised approaches
4 Unsupervised Approaches Similarity and Relatedness Based
5 Unsupervised Approaches Similarity and Relatedness Based Patwardhan, Banerjee and Pedersen 2005 Pedersen, et al 2006 Budanitsky and Hirst 2006
6 Unsupervised Approaches Similarity and Relatedness based Vector Based
7 Unsupervised Approaches Similarity and Relatedness Based Vector-based Mohammad and Hirst, 2006 Patwardhan, 2003 Pedersen, et al 2006 Humphrey, et al 2006
8 Unsupervised Approaches Similarity and Relatedness-based Vector-based Clustering
9 Unsupervised Approaches Similarity and Relatedness based Vector-based Clustering Pedersen and Bruce, 1997 Shütze, 1998 Pedersen and Bruce, 1998 Purandare and Pedersen, 2004 Kulkarni and Pedersen, 2005
10 Road Map Previous Approaches Our vector approach Future Work
11 Previous Approaches Similarity and Relatedness Based SenseRelate (Banerjee and Pedersen, 2003) Vector-based Semantic Type Indexing (Humphrey et al 2006) Clustering SenseClusters (Kulkarni and Pedersen, 2005)
12 Banerjee and Pedersen 2003 Sense Relate
13 SenseRelate Target Word: Transport Concept 1: Biological Transport (C ) Concept 2: Patient Transport (C ) Transport of glutathione S-linked conjugates. glutathione S-linked conjugates. C C C C = SS + SS = Total SS for Concept 1
14 SenseRelate Target Word: Transport Concept 1: Biological Transport (C ) Concept 2: Patient Transport (C ) Transport of glutathione S-linked conjugates. glutathione S-linked conjugates. C C C C = SS + SS = Total SS for concept 2 C = SS + SS = Total SS for concept 1
15 Humphrey et al, 2006 Semantic Type Indexing for WSD
16 Semantic Type Indexing (STI) Target Word: Transport Concept 2 Vector Concept 1 Vector Target Word Vector Cosine 2 Cosine 1 Concept 1: Biological Transport Semantic type: Cell Function Concept 2: Patient Transport Semantic type: Health Care Activity JDI CV1 – JDI vector CV2 – JDI vector TW – JDI vector Transport of glutathione S-linked conjugates.
17 Target Word Vector Transport of glutathione S-linked conjugates. Contains the words surrounding the ambiguous word
18 STI - Target Word Vectors Transport of glutathione S-linked conjugates. Contains the words surrounding the ambiguous word
19 STI -Concept Vectors The concept vectors are created based on their semantic type(s) Transport:C : Biological Transport C : Patient Transport C C Cell Function One word terms in the Metathesaurus associated with Cell Function Health Care Activity One word terms in the Metathesaurus associated with Health Care Activity
20 Kulkarni and Pedersen, 2005 SenseClusters
21 Sense Clusters (SC) Target Word: Transport Concept 1: Biological TransportConcept 2: Patient Transport Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7 Instance 8 Instance 9 Instance 10 Instance 11 Instance 12 Instance 13 … Concept 1 Concept 2 Transport of glutathione S-linked conjugates.
22 Sense Clusters (SC) Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7 Instance 8 Instance 9 Instance 10 Instance 11 Instance 12 Instance 13 … Concept 1 Concept 2 Target Word: Transport Concept 1: Biological TransportConcept 2: Patient Transport Transport of glutathione S-linked conjugates.
23 Sense Clusters Concept 2 Vector Concept 1 Vector Target Word Vector Cosine 2 Cosine 1 Target Word: Transport Concept 1: Biological TransportConcept 2: Patient Transport Transport of glutathione S-linked conjugates.
24 SC -Vectors Contain the words surrounding the ambiguous word Created using: First order co-occurrences Second order co-occurrences
25 First Order Co-occurrence Vectors glutathione S-linked conjugates Word 1 Word 2 Word N Target Vector
26 Second Order Co-occurrence Vectors Word 1 Word 2 Word N st order glutathione … … … … … …… Word1 Word 2 … Word N 022 … 2 nd order glutathione
27 Second Order Co-occurrence Vectors S-linked conjugates Word 1 Word 2 Word N Target Vector glutathione
28 Our unsupervised approach
29 CuiTools Approach Our approach uses a general vector approach with SenseCluster vectors
30 CuiTools Concept 2 Vector Concept 1 Vector Target Word Vector Cosine 2 Cosine 1 Target Word: Transport Concept 1: Biological Transport (C ) Concept 2: Patient Transport (C ) Transport of glutathione S-linked conjugates.
31 CuiTools Approach We explore using First-order co-occurrence vectors Second-order co-occurrence vectors Our approach uses a general vector approach with SenseCluster vectors
32 Target Word Vector Contains the words surrounding the ambiguous word Transport of glutathione S-linked conjugates.
33 CuiTools - Concept Vectors How to create a vector that can represent the meaning of a concept for word sense disambiguation?
34 To answer this question We explore information in the UMLS that can be used to represent the meaning of a concept.
35 CuiTools - Concept Vectors Adjustment Individual Adjustment Conceptually broad term referring to a state of harmony between internal needs and external … Adjustment Action The act of making necessary corrections or modifications … Psychological Adjustment A state of harmony between internal needs and external demands and the processes used … CUI definition
36 CuiTools - Concept Vectors Blood Pressure Force exerted by the blood on the walls of the arteries and other vessels. Blood Pressure Determination Actions performed to measure the diastolic and systolic pressure of the blood. Arterial Pressure NO DEFINTION CUI definition
37 CuiTools - Concept Vectors CUI definition Use CUI definition but if it doesn’t exist PARent definition Semantic Type definition SYNonymous terms For example: C : Laboratory Culture laboratory culture microbial culture sample culture
38 CuiTools - Concept Vectors CUI definition PARent definition Semantic Type definition SIBlings For example: C : Anthropological Culture archeology family social groups If CUI definition doesn’t exist SYNonymous terms
39 CuiTools - Concept Vectors CUI definition If CUI definition doesn’t exist PARent definition Semantic Type definition SIBlings SYNonymous terms TOP 50 most frequent words surrounding the terms associated with the CUI
40 Dataset National Library of Medicine's Word Sense Disambiguation (NLM-WSD) Dataset 50 words from the 1998 MEDLINE abstracts 100 instances for each of the 50 words The target word was manually assigned a UMLS concept or None All instances of None were removed Average number of concepts per ambiguous word is 2.26
41 Data subsets Humphrey subset Humphrey, et al out of the 50 words in NLM-WSD 5 words were excluded because at least two of the possible concepts associated with these words have the same semantic type Instances that were assigned “None” were removed
42 Training Data The training data used to create the 1 st and 2 nd order co-occurrence vectors is 2005 Medline baseline
43 Results
45 Results of Co-occurrence Vectors
46 Results of the Representations of Meaning
47 Results of the Representations of Meaning - CUI Adding the parent and semantic type definitions decreased the accuracy by 6 and 7 percentage points Parent and semantic type definitions are too broad to define the meaning of a concept
48 Results of the Representations of Meaning - SYN Using the synonymous terms associated with a concept is too narrow to represent the meaning. Adjustment Action Adjustment – action Adjustments Adjustment, NOS Adjustment – action qualifier value Adjustment – action procedure
49 Results of the Representations of Meaning - SIB Using the terms associated the siblings of a concept is too broad to represent the meaning. Adjustment Action Biopsy Cauterisation Cautery Cold Therapy Desiccation Drainage procedure Electrolysis
50 Results of the Representations of Meaning
51 Supervised versus Unsupervised Joshi McInnes Stevenson SenseClusters Humphrey CuiTools et al 04 et al 07 et al 08 et al 06
52 To recap How to create a vector that can represent the meaning of a concept for word sense disambiguation?
53 Conclusions To answer this we explored information in the UMLS that could be used to represent the meaning of a concept Finding a context to represent the meaning of a concept is difficult We found using the top 50 most frequent words surrounding the terms associated with the concept best represented the concept for the task of word sense disambiguation
54 Take away message Unsupervised approaches are showing promise Their disadvantage due to supervised approaches obtaining a higher disambiguation accuracy is slowly disappearing But we are not there yet … so there is more work to do
55 Future Work UMLS-Similarity package Using the Semantic Similarity scores rather than frequency in the 1 st order co-occurrence vectors
56 First Order Co-occurrence Vectors glutathione S-linked conjugates Word 1 Word 2 Word N Target Vector FREQ (glutathione, word N)Average
57 First Order Co-occurrence Vectors glutathione S-linked conjugates Word 1 Word 2 Word N Target Vector Similarity (glutathione, word N)Average
58 First Order Co-occurrence Vectors glutathione S-linked conjugates Word 1 Word 2 Word N Target Vector Similarity (glutathione, word N)Sum (like SenseRelate)
59 First Order Co-occurrences glutathione Word 1 Word 2 Word N Word N (C ).3+.2 C C Similarity ==.5 C
60 Future Work UMLS-Similarity package Creating 2 nd order co-occurrence matrices based on highly similar concepts rather than words in text Using the Semantic Similarity scores rather than frequency in the 1 st order co-occurrence vectors
61 Second Order Co-occurrence Vectors Word 1 Word 2 Word N … … … … … …… Word1 Word 2 … Word N Words come from training corpus Frequency counts
62 Second Order Co-occurrence Vectors CUI 1 CUI 2 CUI N … … … … … …… CUI1 CUI2 … CUI N Use concepts from the UMLS Similarity scores
63 Future Work UMLS-Similarity package Creating 2 nd order co-occurrence matrices based on highly similar concepts rather than co- occurrences in text Use terms associated with CUIs that have a high similarity score with the possible concept to represent the meaning of the concept Using the Semantic Similarity scores rather than frequency in the 1 st order co-occurrence vectors
64 Similarity Scores What is potentially gained by using the similarity or relatedness measures May catch words/concepts that are similar but do not frequently occur together in the training data culture and ethnology Ethnology is the study of anthropology ethnology appears with culture only five times in the training data The concepts Anthropological Culture and Ethnology would have a high similarity score where as Laboratory culture and Ethnology would not
65 Software CuiTools version
66 Thank you Lan Aronson François Lang Jim Mork Aurélie Névéol Will Rogers Olivier Bodenreider Allen Browne May Chey Dina Demner- Fushman Guy Divita Kin Wah Fung Susanne Humphrey Dwayne McCully Tom Rindflesch Suresh Srinivasan