Download presentation
Presentation is loading. Please wait.
Published by傍汛 王 Modified over 5 years ago
1
Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou Euripides G.M. Petrakis Evangelos Milios 5/5/2019 ACM WIDM'2005
2
Semantic Similarity Semantic Similarity relates to computing the conceptual similarity between terms which are not lexicographically similar “car” “automobile” Map two terms to an ontology and compute their relationship in that ontology 5/5/2019 ACM WIDM'2005
3
Objectives We investigate several Semantic Similarity Methods and we evaluate their performance We propose the Semantic Similarity Retrieval Model (SSRM) for computing similarity between documents containing semantically similar but not necessarily lexicographically similar terms 5/5/2019 ACM WIDM'2005
4
Ontologies Tools of information representation on a subject
Hierarchical categorization of terms from general to most specific terms object artifact construction stadium Domain Ontologies representing knowledge of a domain e.g., MeSH medical ontology General Ontologies representing common sense knowledge about the world e.g., WordNet 5/5/2019 ACM WIDM'2005
5
WordNet A vocabulary and a thesaurus offering a hierarchical categorization of natural language terms More than 100,000 terms An ontology of natural language terms Nouns, verbs, adjectives and adverbs are grouped into synonym sets (synsets) Synsets represent terms or concepts stadium, bowl, arena, sports stadium – (a large structure for open-air sports or entertainments) 5/5/2019 ACM WIDM'2005
6
WordNet Hierarchies The synsets are also organized into senses
Senses: Different meanings of the same term The synsets are related to other synsets higher or lower in the hierarchy by different types of relationships e.g. Hyponym/Hypernym (Is-A relationships) Meronym/Holonym (Part-Of relationships) Nine noun and several verb Is-A hierarchies 5/5/2019 ACM WIDM'2005
7
A Fragment of the WordNet Is-A Hierarchy
5/5/2019 ACM WIDM'2005
8
Semantic Similarity Methods
Map terms to an ontology and compute their relationship in that ontology Four main categories of methods: Edge counting: path length between terms Information content: as a function of their probability of occurrence in corpus Feature based: similarity between their properties (e.g., definitions) or based on their relationships to other similar terms Hybrid: combine the above ideas 5/5/2019 ACM WIDM'2005
9
Example Edge counting distance between “conveyance” and “ceramic” is 2
An information content method, would associate the two terms with their common subsumer and with their probabilities of occurrence in a corpus 5/5/2019 ACM WIDM'2005
10
Semantic Similarity on WordNet
The most popular methods are evaluated All methods applied on a set of 38 term pairs Their similarity values are correlated with scores obtained by humans The higher the correlation of a method the better the method is 5/5/2019 ACM WIDM'2005
11
Evaluation Method Type Correlation Rada 1989 Edge Counting 0.59
Wu 1994 0.74 Li 2003 0.82 Leackok 1998 Richardson 1994 0.63 Resnik 1999 Info. Content 0.79 Lin 1993 Lord 2003 Jiang 1998 0.83 Tversky 1977 Feature Based 0.73 Rodriguez 2003 Hybrid 0.71 5/5/2019 ACM WIDM'2005
12
Observations Edge counting/Info. Content methods work by exploiting structure information Good methods take the position of the terms into account Higher similarity for terms which are close together but lower in the hierarchy e.g., [Li et.al. 2003] Information Content is measured on WordNet rather than on corpus [Seco2002] Similarity only for nouns and verbs No taxonomic structure for other p.o.s 5/5/2019 ACM WIDM'2005
13
5/5/2019 ACM WIDM'2005
14
Semantic Similarity Retrieval Model (SSRM)
Classic retrieval models retrieve documents with the same query terms SSRM will retrieve documents which also contain semantically similar terms Queries and documents are initially assigned tfxidf weights q=(q1,q2,…qN) , d=(d1,d2,…dN) 5/5/2019 ACM WIDM'2005
15
SSRM Query term re-weighting similar terms reinforce each other
Query term expansion with synonyms and similar terms Document similarity 5/5/2019 ACM WIDM'2005
16
Query Term Expansion 5/5/2019 ACM WIDM'2005
17
Observations Specification of T ? Large T may lead to topic drift
Word sense disambiguation for expanding with the correct sense Expansion with co-concurring terms? SVD, local/global analysis Semantic similarity between terms of different parts of speech? Work with compound terms (phrases) 5/5/2019 ACM WIDM'2005
18
Evaluation of SSRM SSRM is evaluated through intellisearch a system for information retrieval on the WWW 1,5 Million Web pages with images Images are described by surrounding text The problem of image retrieval is transformed into a problem of text retrieval 5/5/2019 ACM WIDM'2005
19
5/5/2019 ACM WIDM'2005
20
Methods Vector Space Model (VSM) SSRM
Each method is represented by a precision/recall plot Each point is the average precision/recall over 20 queries 20 queries from the list of the most frequent Google image queries 5/5/2019 ACM WIDM'2005
21
Experimental Results 5/5/2019 ACM WIDM'2005
22
MeSH and MedLine MeSH: ontology for medical and biological terms by the N.L.M. 22,000 terms MedLine: the premier bibliographic medical database of N.L.M. 13 Million references 5/5/2019 ACM WIDM'2005
23
Evaluation on MedLine 5/5/2019 ACM WIDM'2005
24
Conclusions Semantic similarity methods approximated the human notion of similarity reaching correlation up to 83% SSRM exploits this information for improving the performance of retrieval SSRM can work with any semantic similarity method and any ontology 5/5/2019 ACM WIDM'2005
25
Future Work Experimentation with more data sets (TREC) and ontologies
Extend SSRM to work with Compound terms More parts of speech (e.g., adverbs) Co-occurring terms More terms relationships in WordNet More elaborate methods for specification of thresholds 5/5/2019 ACM WIDM'2005
26
Try our system on the Web
Semantic Similarity System: SRRM: 5/5/2019 ACM WIDM'2005
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.