Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut.

Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut

Outline Interoperability of Keywords Wikipedia and del.icio.us Keyword similarity Experiment Conclusion

Interoperability of Keywords Documents (pictures, movies, …) are annotated with keywords for organization and retrieval. In different collections/communities different sets of keywords are used. –The set of selectable keywords is often organized in and delimited by a thesaurus. –The set of freely generated end-user keywords, “tags” forms a folksonomy Align keywords/tags by comparing usage. Tested on del.icio.us tags and Wikipedia categories.

del.icio.us and Wikipedia Del.icio.us –Social book marking site –Bookmarks in most cases can be interpreted as labels or tags for the bookmarked URL. –Many Wikipedia articles are tagged by del.icio.us users Wikipedia –Articles are labeled with one or more categories by the article authors. –Categories are organized hierarchically. –Categories are organized consciously like in a thesaurus New categories are introduced after discussions between active Wikipedians.

Keyword alignment Problem –Given a keyword k in a system A, what is the most similar keyword k’ in system B. Given a tag from del.icio.us, what is the most similar Wikipedia category (or vice versa). Approach –Interpret similarity as similarity of usage. –Compute similarity of usage on a common subcollection. Evaluation –Compare results to human judgment of similarity.

Keyword similarity Basic assumption: similarity is similarity of usage. –If two keywords have similar usage they will give similar results in retrieval tasks. Two keywords have similar usage if they –Have a similar distribution over documents Divergence (relative entropy) of distributions Cosine –Often co-occur Jaccard coefficient

New measure for keyword similarity Keywords have similar usage if they co-occur with similar frequency with all other keywords. –We use the frequency with which a tag/keyword is assigned to a document. –We include co-occurrence information with other terms. Helps to cope with sparse data In other words: –Terms are similar if they have similar co-occurrence patterns Similar to Tag Context Similarity of Cattuto et al.’s presentation tomorrow (Semantic Social Networks Session)

Formalization: Distribution of co-occurring terms where –q(t|d) is the keyword distribution of d –Q(d|z) is the document distribution of z “The fraction of z’s that is found in d” Weighted average of the keyword distributions of documents –The weight is the relevance of d for z given by the probability Q(d|z)

Distance of keywords For each keyword there is a distribution over all (other) keywords. Similarity is expressed by divergence of these distributions Kullback-Leibler divergence: Bits per keyword saved by compressing a subcollection with keyword distribution p using p instead of a general distribution q.

Distance of keywords (cont’d) Jensen-Shannon divergence: –Mean distribution: Jensen-Shannon divergence is symmetric. Jensen-Shannon divergence is square of a non-negative distance satisfying the triangle inequality.

Alignment Consider a collection of documents annotated with different sets of keywords. Represent a keyword by a distribution over terms from both collections. For each term find the closest term from the other collection.

Experiment I Mapping between Teleblik keywords and User Tags Educational video’s. Professional keywords from public broadcasting archive. Keywords assigned in an experiment by high school students. Data –100 videos –12.414 tags –4.348 different tags –269 different keywords

Experiment II Mapping between del.icio.us tags and Wikipedia categories Del.icio.us tags collected by Mathias Lux (Klagenfurt Univ.) Data –58.345 Wikipedia articles –500.618 tags and category annotations –42.425 different Wikipedia categories –49.603 different tags Mappings computed for tags occurring on at least 10 docs. –Mappings for 2355 tags –Mappings for 1827 categories –Using co-occurrence data with all 49.603 tags/categories

Evaluation of mapping Manual evaluation Classification of a sample of mappings into: b Broader term n Narrower r Related term u Unrelated x Source term is not a keyword (e.g. “to read”) q Meaning unknown

Evaluation of aligning Wikipedia and del.icio.us

Pairs with a small distance are evaluated better than pairs with large distance. Evaluation of mappings with smallest and largest distance –a) Categories to tags –b) Tags to categories Distance vs. mapping quality

Effect of keyword frequency No correlation between keyword frequency and divergence with best mapping found.

Evaluation of mapping using two different distance measures. Categories broader, narrower and related are merged Results for –a) Categories to tags –b) Tags to categories Comparison with Jaccard-coefficient

Discussion of results Method works very well in test –Good mapping results –Distance is good indication of quality –Insensitive to frequency (upto a certain degree) Better than Jaccard, because it uses: –co-occurrence with other tags (‘tag context’) –frequency with which a tag is assigned to a document. Frequency information is typical for user generated tags. We expect this method to perform less well for aligning keywords with other keywords (without assignment frequencies). Distance measure also works well for clustering tags.

Future work Evaluating relatedness using external sources (e.g. Wordnet) Compare to other distance measures We used documents annotated completely according to two annotation schemes. –How large has the overlap to be to obtain decent results? –We can create partial overlap of disjoint document sets by a partial identification of the keywords. Detect asymmetry in relations (broader vs. narrower term)

Conclusion Using co-occurrence patterns is a fruitful approach. Frequent terms from folksonomies do behave similar to carefully assigned keywords. –Because usage based similarity measure yields good mappings. –Folksonomy seems to work!

Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut.

Similar presentations

Presentation on theme: "Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut.

Similar presentations

Presentation on theme: "Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut."— Presentation transcript:

Similar presentations

About project

Feedback