Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Representing Meaning in Unsupervised Word Sense Disambiguation Bridget T. McInnes 5 September 2008 University of Minnesota Twin Cities.

Similar presentations


Presentation on theme: "1 Representing Meaning in Unsupervised Word Sense Disambiguation Bridget T. McInnes 5 September 2008 University of Minnesota Twin Cities."— Presentation transcript:

1 1 Representing Meaning in Unsupervised Word Sense Disambiguation Bridget T. McInnes 5 September 2008 University of Minnesota Twin Cities

2 2 What is WSD? The culture count doubled. Culture Laboratory Culture Anthropological Culture Sense Inventory

3 3 Approaches to WSD Supervised Advantages: obtains a high accuracy Disadvantages: manually annotated training data is required for each word that needs to be disambiguated therefore it can not scale Unsupervised Advantages: does not require manually annotated training data Disadvantages: generally does not obtain as high of an accuracy as supervised approaches

4 4 Unsupervised Approaches Similarity and Relatedness Based

5 5 Unsupervised Approaches Similarity and Relatedness Based Patwardhan, Banerjee and Pedersen 2005 Pedersen, et al 2006 Budanitsky and Hirst 2006

6 6 Unsupervised Approaches Similarity and Relatedness based Vector Based

7 7 Unsupervised Approaches Similarity and Relatedness Based Vector-based Mohammad and Hirst, 2006 Patwardhan, 2003 Pedersen, et al 2006 Humphrey, et al 2006

8 8 Unsupervised Approaches Similarity and Relatedness-based Vector-based Clustering

9 9 Unsupervised Approaches Similarity and Relatedness based Vector-based Clustering Pedersen and Bruce, 1997 Shütze, 1998 Pedersen and Bruce, 1998 Purandare and Pedersen, 2004 Kulkarni and Pedersen, 2005

10 10 Road Map Previous Approaches Our vector approach Future Work

11 11 Previous Approaches Similarity and Relatedness Based SenseRelate (Banerjee and Pedersen, 2003) Vector-based Semantic Type Indexing (Humphrey et al 2006) Clustering SenseClusters (Kulkarni and Pedersen, 2005)

12 12 Banerjee and Pedersen 2003 Sense Relate

13 13 SenseRelate Target Word: Transport Concept 1: Biological Transport (C0005528) Concept 2: Patient Transport (C0150390) Transport of glutathione S-linked conjugates. glutathione S-linked conjugates. C0017817C0522529C0301869 C0005528 = SS + SS = Total SS for Concept 1

14 14 SenseRelate Target Word: Transport Concept 1: Biological Transport (C0005528) Concept 2: Patient Transport (C0150390) Transport of glutathione S-linked conjugates. glutathione S-linked conjugates. C0017817C0522529C0301869 C0150390 = SS + SS = Total SS for concept 2 C0005528 = SS + SS = Total SS for concept 1

15 15 Humphrey et al, 2006 Semantic Type Indexing for WSD

16 16 Semantic Type Indexing (STI) Target Word: Transport Concept 2 Vector Concept 1 Vector Target Word Vector Cosine 2 Cosine 1 Concept 1: Biological Transport Semantic type: Cell Function Concept 2: Patient Transport Semantic type: Health Care Activity JDI CV1 – JDI vector CV2 – JDI vector TW – JDI vector Transport of glutathione S-linked conjugates.

17 17 Target Word Vector Transport of glutathione S-linked conjugates. Contains the words surrounding the ambiguous word

18 18 STI - Target Word Vectors Transport of glutathione S-linked conjugates. Contains the words surrounding the ambiguous word

19 19 STI -Concept Vectors The concept vectors are created based on their semantic type(s) Transport:C0005528: Biological Transport C0150390: Patient Transport C0005528 C0150390 Cell Function One word terms in the Metathesaurus associated with Cell Function Health Care Activity One word terms in the Metathesaurus associated with Health Care Activity

20 20 Kulkarni and Pedersen, 2005 SenseClusters

21 21 Sense Clusters (SC) Target Word: Transport Concept 1: Biological TransportConcept 2: Patient Transport Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7 Instance 8 Instance 9 Instance 10 Instance 11 Instance 12 Instance 13 … Concept 1 Concept 2 Transport of glutathione S-linked conjugates.

22 22 Sense Clusters (SC) Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7 Instance 8 Instance 9 Instance 10 Instance 11 Instance 12 Instance 13 … Concept 1 Concept 2 Target Word: Transport Concept 1: Biological TransportConcept 2: Patient Transport Transport of glutathione S-linked conjugates.

23 23 Sense Clusters Concept 2 Vector Concept 1 Vector Target Word Vector Cosine 2 Cosine 1 Target Word: Transport Concept 1: Biological TransportConcept 2: Patient Transport Transport of glutathione S-linked conjugates.

24 24 SC -Vectors Contain the words surrounding the ambiguous word Created using: First order co-occurrences Second order co-occurrences

25 25 First Order Co-occurrence Vectors glutathione S-linked conjugates Word 1 Word 2 Word N.............. 50 6 5...... 5 6 1...... 5 0 15...... 20 4 7 Target Vector

26 26 Second Order Co-occurrence Vectors Word 1 Word 2 Word N.............. 10 30 0 1 st order glutathione 20100 0 0 2 50 2 … … … … … …… Word1 Word 2 … Word N 022 … 2 nd order glutathione

27 27 Second Order Co-occurrence Vectors S-linked conjugates Word 1 Word 2 Word N.............. 10 30 2...... 0 6 0...... 5 0 13...... 5 5 Target Vector glutathione

28 28 Our unsupervised approach

29 29 CuiTools Approach Our approach uses a general vector approach with SenseCluster vectors

30 30 CuiTools Concept 2 Vector Concept 1 Vector Target Word Vector Cosine 2 Cosine 1 Target Word: Transport Concept 1: Biological Transport (C0005528) Concept 2: Patient Transport (C0150390) Transport of glutathione S-linked conjugates.

31 31 CuiTools Approach We explore using First-order co-occurrence vectors Second-order co-occurrence vectors Our approach uses a general vector approach with SenseCluster vectors

32 32 Target Word Vector Contains the words surrounding the ambiguous word Transport of glutathione S-linked conjugates.

33 33 CuiTools - Concept Vectors How to create a vector that can represent the meaning of a concept for word sense disambiguation?

34 34 To answer this question We explore information in the UMLS that can be used to represent the meaning of a concept.

35 35 CuiTools - Concept Vectors Adjustment Individual Adjustment Conceptually broad term referring to a state of harmony between internal needs and external … Adjustment Action The act of making necessary corrections or modifications … Psychological Adjustment A state of harmony between internal needs and external demands and the processes used … CUI definition

36 36 CuiTools - Concept Vectors Blood Pressure Force exerted by the blood on the walls of the arteries and other vessels. Blood Pressure Determination Actions performed to measure the diastolic and systolic pressure of the blood. Arterial Pressure NO DEFINTION CUI definition

37 37 CuiTools - Concept Vectors CUI definition Use CUI definition but if it doesn’t exist PARent definition Semantic Type definition SYNonymous terms For example: C0430400: Laboratory Culture laboratory culture microbial culture sample culture

38 38 CuiTools - Concept Vectors CUI definition PARent definition Semantic Type definition SIBlings For example: C0010453: Anthropological Culture archeology family social groups If CUI definition doesn’t exist SYNonymous terms

39 39 CuiTools - Concept Vectors CUI definition If CUI definition doesn’t exist PARent definition Semantic Type definition SIBlings SYNonymous terms TOP 50 most frequent words surrounding the terms associated with the CUI

40 40 Dataset National Library of Medicine's Word Sense Disambiguation (NLM-WSD) Dataset 50 words from the 1998 MEDLINE abstracts 100 instances for each of the 50 words The target word was manually assigned a UMLS concept or None All instances of None were removed Average number of concepts per ambiguous word is 2.26

41 41 Data subsets Humphrey subset Humphrey, et al 2006 45 out of the 50 words in NLM-WSD 5 words were excluded because at least two of the possible concepts associated with these words have the same semantic type Instances that were assigned “None” were removed

42 42 Training Data The training data used to create the 1 st and 2 nd order co-occurrence vectors is 2005 Medline baseline

43 43 Results

44

45 45 Results of Co-occurrence Vectors

46 46 Results of the Representations of Meaning

47 47 Results of the Representations of Meaning - CUI Adding the parent and semantic type definitions decreased the accuracy by 6 and 7 percentage points Parent and semantic type definitions are too broad to define the meaning of a concept

48 48 Results of the Representations of Meaning - SYN Using the synonymous terms associated with a concept is too narrow to represent the meaning. Adjustment Action Adjustment – action Adjustments Adjustment, NOS Adjustment – action qualifier value Adjustment – action procedure

49 49 Results of the Representations of Meaning - SIB Using the terms associated the siblings of a concept is too broad to represent the meaning. Adjustment Action Biopsy Cauterisation Cautery Cold Therapy Desiccation Drainage procedure Electrolysis

50 50 Results of the Representations of Meaning

51 51 Supervised versus Unsupervised Joshi McInnes Stevenson SenseClusters Humphrey CuiTools et al 04 et al 07 et al 08 et al 06

52 52 To recap How to create a vector that can represent the meaning of a concept for word sense disambiguation?

53 53 Conclusions To answer this we explored information in the UMLS that could be used to represent the meaning of a concept Finding a context to represent the meaning of a concept is difficult We found using the top 50 most frequent words surrounding the terms associated with the concept best represented the concept for the task of word sense disambiguation

54 54 Take away message Unsupervised approaches are showing promise Their disadvantage due to supervised approaches obtaining a higher disambiguation accuracy is slowly disappearing But we are not there yet … so there is more work to do

55 55 Future Work UMLS-Similarity package Using the Semantic Similarity scores rather than frequency in the 1 st order co-occurrence vectors

56 56 First Order Co-occurrence Vectors glutathione S-linked conjugates Word 1 Word 2 Word N.............. 50 6 5...... 5 6 1...... 5 0 15...... 20 4 7 Target Vector FREQ (glutathione, word N)Average

57 57 First Order Co-occurrence Vectors glutathione S-linked conjugates Word 1 Word 2 Word N...............5.6.5.......6.1.......5 0.15.......75.6.25 Target Vector Similarity (glutathione, word N)Average

58 58 First Order Co-occurrence Vectors glutathione S-linked conjugates Word 1 Word 2 Word N...............5.6.5.......6.1.......5 0.15...... 1.5 1.2.75 Target Vector Similarity (glutathione, word N)Sum (like SenseRelate)

59 59 First Order Co-occurrences glutathione Word 1 Word 2 Word N...............5.6.5 Word N (C0005528).3+.2 C0000000 C0000001 Similarity ==.5 C0005528

60 60 Future Work UMLS-Similarity package Creating 2 nd order co-occurrence matrices based on highly similar concepts rather than words in text Using the Semantic Similarity scores rather than frequency in the 1 st order co-occurrence vectors

61 61 Second Order Co-occurrence Vectors Word 1 Word 2 Word N........ 20100 0 0 2 50 2 … … … … … …… Word1 Word 2 … Word N Words come from training corpus Frequency counts

62 62 Second Order Co-occurrence Vectors CUI 1 CUI 2 CUI N.........20.100 0 0.20.50.20 … … … … … …… CUI1 CUI2 … CUI N Use concepts from the UMLS Similarity scores

63 63 Future Work UMLS-Similarity package Creating 2 nd order co-occurrence matrices based on highly similar concepts rather than co- occurrences in text Use terms associated with CUIs that have a high similarity score with the possible concept to represent the meaning of the concept Using the Semantic Similarity scores rather than frequency in the 1 st order co-occurrence vectors

64 64 Similarity Scores What is potentially gained by using the similarity or relatedness measures May catch words/concepts that are similar but do not frequently occur together in the training data culture and ethnology Ethnology is the study of anthropology ethnology appears with culture only five times in the training data The concepts Anthropological Culture and Ethnology would have a high similarity score where as Laboratory culture and Ethnology would not

65 65 Software CuiTools version 0.19 http://cuitools.sourceforge.net

66 66 Thank you Lan Aronson François Lang Jim Mork Aurélie Névéol Will Rogers Olivier Bodenreider Allen Browne May Chey Dina Demner- Fushman Guy Divita Kin Wah Fung Susanne Humphrey Dwayne McCully Tom Rindflesch Suresh Srinivasan


Download ppt "1 Representing Meaning in Unsupervised Word Sense Disambiguation Bridget T. McInnes 5 September 2008 University of Minnesota Twin Cities."

Similar presentations


Ads by Google