Presentation is loading. Please wait.

Presentation is loading. Please wait.

LREC 2008 Marrakech1 Clustering Related Terms with Definitions Scott Piao, John McNaught and Sophia Ananiadou

Similar presentations


Presentation on theme: "LREC 2008 Marrakech1 Clustering Related Terms with Definitions Scott Piao, John McNaught and Sophia Ananiadou"— Presentation transcript:

1 LREC 2008 Marrakech1 Clustering Related Terms with Definitions Scott Piao, John McNaught and Sophia Ananiadou {scott.piao,john.mcnaught,sophia.ananiadou}@manchester.ac.uk National Centre for Text Mining School of Computer Science The University of Manchester

2 LREC 2008 Marrakech2 Outline of talk Task: match related terms of ontology. Approach: detect and cluster related terms based on definitions. Implementation: definition matching and term clustering, user interface. Evaluation on GO terms. Conclusion.

3 LREC 2008 Marrakech3 Task: matching terms for ontology enrichment matching similar or related terms/expressions is important task in NLP and Text Mining applications. Ontology term matching is also closely related to ontology enrichment. In the EU BOOTSTrep Project, some techniques have been tested for ontology entities matching and alignment. Our work focuses on testing and evaluating a text matching tool for identifying related ontology terms with their definitions.

4 LREC 2008 Marrakech4 Definitions of term definitions Ontology terms, such as GO (Gene Ontology) terms, often contain detailed definitions:. –id: GO:0000124 –name: SAGA complex –def: "A large multiprotein complex that possesses histone acetyltransferase and is involved in regulation of transcription. The budding yeast complex includes Gcn5p, several proteins of the Spt and Ada families, and several TBP-associate proteins (TAFs); analogous complexes in other species have analogous compositions, and usually contain homologs of the yeast proteins.“ –id: GO:0005671 –name: Ada2/Gcn5/Ada3 transcription activator complex –def: "A multiprotein complex that possesses histone acetyltransferase and is involved in regulation of transcription. The budding yeast complex includes Gcn5p, two proteins of the Ada family, and two TBP-associate proteins (TAFs); analogous complexes in other species have analogous compositions, and usually contain homologs of the yeast proteins."

5 LREC 2008 Marrakech5 Our approach to the issue The definitions can provide a fundamental information source for detecting relations between terms. lexicon definitions have been previously used for analyzing relations between words/terms (Castillo et al., 2003). We assume text matching tools can be used to detect related terms based on the definitions.

6 LREC 2008 Marrakech6 A tool for clustering related texts Align similar sentences between texts. Measure the distances between texts based on the aligned sentences. Cluster similar texts based on a distance matrix.

7 LREC 2008 Marrakech7 Metrics for pairwise text comparison, (δ 1 =0.85,δ 2 =0.05,δ 3 =0.1), (0 <= d <= 1). For further details, see the paper.

8 LREC 2008 Marrakech8 An effective algorithm text comparison Cited from Clough et al. (2002)

9 LREC 2008 Marrakech9 Clustering texts Using the text comparison tool, produce distance matrix matrix elements: e ij =1 – d ij, (0<=e ij <=1) Error Sum of Squares (ESS) hierarchical clustering

10 LREC 2008 Marrakech10 Sample of cluster tree {layer=9 {layer=10 {layer=11 {layer=12 GO:0009897 GO:0010339 } {layer=12 GO:0010282 } } {layer=11 {layer=12 GO:0045284 } {layer=12 GO:0045293 } } {layer=10 {layer=11 {layer=12 GO:0017117 GO:0033202 } {layer=12 GO:0017119 } }

11 LREC 2008 Marrakech11 A package for definition comparison and term clustering pairwise definitions comparison term clusterer user interface checkupdate synonym lexicon extended Porter’s stemmer distance matrix clusters term database

12 LREC 2008 Marrakech12 User interface for checking and updating terms

13 LREC 2008 Marrakech13 Evaluation The text comparison and clustering components are evaluated on a set of GO terms as test data. In the evaluation, we consider GO terms to be related if they: –share a parent term within three layers of ancestor trees via IS_A relation, or –have direct parent/child relations (e.g. X is_a Y), or –have direct part-of relations (e.g. X is part of Y).

14 LREC 2008 Marrakech14 Evaluation Test data –GO terms under the namespace of cellular_component –2,027 found, of which 2,010 have definitions --- actual test data. –All of the 2,010 test terms are related as defined previously with one or more other test terms. Our evaluation strategy is to examine: –How many clustered terms have the relations defined previously, and –How many of the related terms can be covered by the clusters.

15 LREC 2008 Marrakech15 Evaluation of bottom-layer clusters Total_clustered_terms=1,076 depths of parent nodes considered clustered true pairsprecision (%) coverage (%) 1 417 (834 terms)76.0941.49 2 489 (978 terms)89.2348.66 3 531 (1,062 terms)96.9052.84

16 LREC 2008 Marrakech16 Distribution of relation types IS_A and PART_OF in the clustered terms 1 parent node2 parent nodes3 parent nodes typeis-apart-ofis-apart-ofis-apart-of numb122491285012850 percent29.311.7526.210.224.19.4

17 LREC 2008 Marrakech17 Evaluation of the second layer clusters depths of parent nodes considered correctly clustered terms precision/coverage (%) 1 1,16357.86 2 1,47473,33 3 1,68583,83 Total_clustered_terms=2,010

18 LREC 2008 Marrakech18 Evaluation of the third layer clusters depths of parent nodes considered correctly clustered terms precision/coverage (%) 11,28463.88 21,64281.69 31,84391.69 Total_clustered_terms=2,010

19 LREC 2008 Marrakech19 This package can be used as an assistant tool for modifying and enriching ontology and terminology. (Brief demo of interface) Application of this package

20 LREC 2008 Marrakech20 Conclusion Ontology term definitions provide an important information source for term matching. Text comparing and clustering tool can provide useful tool for matching the terms. For a better performance, the tool needs domain knowledge resources.

21 LREC 2008 Marrakech21 Acknowledgements This research was supported by EC BOOTStrep Project (ref. FP6-028099). The UK National Centre for Text Mining is sponsored by the JISC/BBSRC/EPSRC.

22 LREC 2008 Marrakech22 References BOOTStrep Project website: http://www.BOOTStrep.org. Castillo, Gabriel, Gerardo Sierra, John McNaught (2003). An improved Algorithm for Semantic Clustering. Proceedings of the 1st international symposium on Information and communication technologies, Dublin. Clough, Paul, Robert Gaizauskas, Scott Piao, Yorick Wilks (2002), METER: MEasuring TExt Reuse, In Proceedings of the ACL-2002, University of Pennsylvania, Philadelphia, USA, pp. 152-159. Gene Ontology http://www.geneontology.org. Piao, Scott and Tony McEnery (2003). A tool for text comparison. Proceedings of the Corpus Linguistics


Download ppt "LREC 2008 Marrakech1 Clustering Related Terms with Definitions Scott Piao, John McNaught and Sophia Ananiadou"

Similar presentations


Ads by Google