Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme Presented by Smitashree Choudhury.

Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme Presented by Smitashree Choudhury

Overview  Motivation  Measures of semantic Relatedness  Semantic Grounding of measures  Result analysis

Motivation  Folksonomy is open ended, noisy and large system  Lack of explicit semantic relation in the tag space  Lack of robust semantic grounding of existing similarity measures.  Possible applications are :  Ontology learning  Tag recommendation  Query expansion

Folksonomies and tagging  Folksonomy is a result of social annotation of shared resources.  A folksonomy is a tuple F := (U, T, R, Y)  U: the set of users  T: the set of tags  R: the set of resources  Y: a set of ternary “tagging” relation/assignment.  A post is a set of tags assigned by a user to a resource user1resource1tag1

Data under study  Del.icio.us tags for November 2006.  667,128 users (U)  2,454,546 tags (T)  18,782,132 resources (R)  140,333,714 tag assignments (Y)  The study was focused on |T| =10,000 most frequent tags and their users (|U|=476, 378),resources (|R|=12, 660, 470) and |Y | = 101, 491, 722 tag assignments.

Similarity and relatedness  Capture the emergent semantics of the folksonomy  Similarity can be considered as a special case of relatedness  There are (at least) two options for similarity metrics:  mapping into a domain where similarity is well -defined  by means of the network structure of the folksonomy

Measures of Relatedness  Co-Occurrence  Contextual (Distributional) Measures : based on three different vector space feature representation for the tag.  Tag context  Resource context  User context  Folk Rank (Graph based)

Co-Occurrence  Given a folksonomy (U,T,R,Y) a tag-tag co- occurrence graph is a weighted undirected graph whose set of nodes is the set of tags (T).  two tags are connected by an edge if both are used at least for 1 post.  The weight of this edge is given by the number of posts that contain both t 1 and t 2.  U1-{t1,t2,t3}-r1  U2-{t1,t2}-r1  U3-{t1,t2,t5}-r2 3 1 2 2 t1 t4 t5 t3 t2 1

Contextual measures (cosine similarity)  Three measures of tag relatedness based on three different vector space representation of tags. The elements of tag vectors are tag, users and resource weights  If two tags t1 and t2 are represented by v1, v2 their cosine similarity is defined as: cossim(t1, t2) := cos (v1, v2).  The cosine similarity is independent of the length of the vectors and normalised to avoid frequency bias.

Contextual measures (Tag context)  Tag Context Similarity. The Tag Context Similarity (TagCont) is computed in the vector space RT, where, for tag t, the entries of the vector v t are defined by w(t 1 t 2 ) where w is the co-occurrence weight defined above. t1t1 t2t2 t3t3 t 10000 t1t1 0310 t2t2 3002 t3t3

Contextual measures (Resource and User Context)  The vector space of tag t is computed based on how often a tag t is used to annotate certain resource r.  The user context similarity is built similarly to resource context by swapping the roles of the sets R and U. r5 t10100300100010032 t20001101200010011

Folk rank  Adaptation of PageRank to folksonomy : “A resource which is tagged with important tags by important users becomes important itself”[Hotho].  FolkRank computes a ranked list of relevant tags on a random surfer vector.  It considers a folksonomy (U,T,R,Y) as an undirected graph  Initially each tag is assigned weight 1 and adjusted with iterations by spreading weights.  Tags for a given tag t 1 obtain highest FolkRank weight are considered to be the most relevant in relation to t 1.

Related tags according to various similarity measures Co- occurrence Cosine FolkRank

Result Analysis  Computed most related tags for the 10000 most frequent tags  tag and resource context similarity provide more synonyms than the other measures. For instance, for the tag web2.0 they return some of its alternative spellings such as web-2.0,web,web2.  For the tag games, the tag and resource similarity also provide tags that could be regarded as semantically similar. For instance, the morphological variations game and gaming, or corresponding words in other languages, like spiel (German), jeu (French) and juegos (Spanish).  whereas the FolkRank and co-occurrence measures provide more related general tags and categories.  An interesting observation about the tag java is that python, perl and c++ (provided by tag context similarity) could all be considered as siblings in some suitable concept hierarchy, presumably under a common parent concept like programming languages.

Result analysis Are related tags shared across relatedness measures?  related tags obtained via tag context or resource context appear to be “synonyms” or “siblings” of the original tag.  Co-occurrence and FolkRank seem to provide “more general” tags.  In terms of shared tags, the co-occurrence and FolkRank measures are most similar and overlap 6.81 tags out of 10, while cosine similarity displays little overlap with either of them.

Semantic Grounding  The strategy is to ground the relations between the original and the related tags by looking up the tags in a formal representation of word meanings.  Mapping tags into WordNet synsets allows these measures to be compared against well-studied similarity measures.  In WordNet words are grouped into synsets, sets of synonyms that represent one concept. Synsets are nodes in a network and links between synsets represent semantic relations.  Only is-a relationships are considered.  Roughly 61% of the 10,000 most frequent tags in del.icio.us are covered in WordNet.

Wordnet similarity In Wordnet semantic similarity is measured using both  taxonomic shortest-path length  Jiang-Conrath metric  combines taxonomic path length with an information-theoretic similarity measure  validated in user studies  A first assessment of the measures of relatedness is carried out by measuring – in WordNet – the average semantic distance between a tag and the corresponding most closely related tag according to each one of the relatedness measures

Wordnet similarity

Analysis  Jiang-Conrath measure has been validated in user studies [Budanitsky] so semantic distances correspond to distances cognitively perceived by human subjects.  The tag and resource context relatedness point to tags that are semantically closer according to both grounding measures.  Resource context measure is optimal but expensive  Tag context performs equally good like resource context yet computationally lighter.

Summary  First, it introduces a systematic methodology for characterizing measures of tag relatedness in a folksonomy.  Grounded several measures of tag relatedness by mapping the tags of the folksonomy to synsets in WordNet using semantic distance.  semantic characterization of similarity measures computed on a folksonomy is possible and insightful in terms of the type of relations that can be extracted  given an appropriate measure, globally meaningful tag relations can be harvested from an aggregated and uncontrolled folksonomy vocabulary.  Admittedly, in their current status, none of the measures we studied can be seen as the way to instant ontology creation but further analysis and combination of measures will help to close the gap towards the Semantic Web.  The tag or resource context similarities are clearly the first measures to choose when one would like to discover synonyms and also useful for query expansion  Both FolkRank and co-occurrence relatedness seemed to extract taxonomic relationship between tags and tag recommendations.

References  Jiang, J.J., Conrath, D.W.: Semantic Similarity based on Corpus Statistics and Lexical Taxonomy.In: Proceedings of the International Conference on Research in Computational Linguistics(ROCLING), Taiwan (1997)  Hotho, A., J¨aschke, R., Schmitz, C., Stumme, G.: Information retrieval in folksonomies: Search and ranking. In Sure, Y., Domingue, J., eds.: The Semantic Web: Research and Applications. Volume 4011 of LNAI., Heidelberg, Springer (2006) 411–426  Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics 32(1) (2006) 13–47  Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (1989)  And others.....

Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme Presented by Smitashree Choudhury.

Similar presentations

Presentation on theme: "Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme Presented by Smitashree Choudhury."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme Presented by Smitashree Choudhury.

Similar presentations

Presentation on theme: "Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme Presented by Smitashree Choudhury."— Presentation transcript:

Similar presentations

About project

Feedback