Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut.

Slides:



Advertisements
Similar presentations
Clustering Basic Concepts and Algorithms
Advertisements

Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Todays topic Social Tagging By Christoffer Hirsimaa.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
The Complex Dynamics of Collaborative Tagging Harry Halpin University of Edinburgh Valentin Robu CWI, Netherlands Hana Shepherd Princeton University WWW.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Tagging Systems Mustafa Kilavuz. Tags A tag is a keyword added to an internet resource (web page, image, video) by users without relying on a controlled.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
The Value of Usage Scenarios for Thesaurus Alignment in Cultural Heritage Context Antoine Isaac, Claus Zinn, Henk Matthezing, Lourens van der Meij, Stefan.
An Empirical Study of Instance-Based Ontology Mapping Antoine Isaac, Lourens van der Meij, Stefan Schlobach, Shenghui Wang funded by NWO Vrije.
Multi-Concept Alignment and Evaluation Shenghui Wang, Antoine Isaac, Lourens van der Meij, Stefan Schlobach Ontology Matching Workshop Oct. 11 th, 2007.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Scalable Text Mining with Sparse Generative Models
Tag-based Social Interest Discovery
Tag-based Social Interest Discovery 2009/2/9 Presenter: Lin, Sin-Yan 1 Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc WWW 2008 Social Networks & Web 2.0.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
By : Garima Indurkhya Jay Parikh Shraddha Herlekar Vikrant Naik.
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Building Face Dataset Shijin Kong. Building Face Dataset Ramanan et al, ICCV 2007, Leveraging Archival Video for Building Face DatasetsLeveraging Archival.
Multi-Prototype Vector Space Models of Word Meaning __________________________________________________________________________________________________.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis Kira Radinsky (Technion) Eugene Agichtein (Emory) Evgeniy Gabrilovich (Yahoo!
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Christian Körner 1, Dominik Benz 2, Andreas Hotho 3, Markus Strohmaier 1, Gerd Stumme 2 Stop thinking, start tagging: Tag Semantics arise from Collaborative.
Expressing Implicit Semantic Relations without Supervision ACL 2006.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
1 Computing Relevance, Similarity: The Vector Space Model.
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Statistical Comparison of Tag and Query Logs Mark J. Carman, Robert Gwadera, Fabio Crestani, and Mark Baillie SIGIR 2009 June 4, 2010 Hyunwoo Kim.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Clustering C.Watters CS6403.
Thesis Proposal: Prediction of popular social annotations Abon.
Flickr Tag Recommendation based on Collective Knowledge BÖrkur SigurbjÖnsson, Roelof van Zwol Yahoo! Research WWW Summarized and presented.
Evgeniy Gabrilovich and Shaul Markovitch
Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.
Measuring Behavioral Trust in Social Networks
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
A Knowledge-Based Search Engine Powered by Wikipedia David Milne, Ian H. Witten, David M. Nichols (CIKM 2007)
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Using Social Annotations to Improve Language Model for Information Retrieval Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
15 Sep 2015 EunJeong Cheon i501: introduction to informatics Semiotic Dynamics and Collaborative Tagging Ciro Cattuto, Vittorio Loreto, and Luciano Pietronero.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Similarity Measures for Text Document Clustering
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Neighborhood - based Tag Prediction
Semantic Processing with Context Analysis
Data Mining K-means Algorithm
School of Computer Science & Engineering
Information Organization: Clustering
Representation of documents and queries
Retrieval Utilities Relevance feedback Clustering
Presentation transcript:

Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut

Outline Interoperability of Keywords Wikipedia and del.icio.us Keyword similarity Experiment Conclusion

Interoperability of Keywords Documents (pictures, movies, …) are annotated with keywords for organization and retrieval. In different collections/communities different sets of keywords are used. –The set of selectable keywords is often organized in and delimited by a thesaurus. –The set of freely generated end-user keywords, “tags” forms a folksonomy Align keywords/tags by comparing usage. Tested on del.icio.us tags and Wikipedia categories.

del.icio.us and Wikipedia Del.icio.us –Social book marking site –Bookmarks in most cases can be interpreted as labels or tags for the bookmarked URL. –Many Wikipedia articles are tagged by del.icio.us users Wikipedia –Articles are labeled with one or more categories by the article authors. –Categories are organized hierarchically. –Categories are organized consciously like in a thesaurus New categories are introduced after discussions between active Wikipedians.

Keyword alignment Problem –Given a keyword k in a system A, what is the most similar keyword k’ in system B. Given a tag from del.icio.us, what is the most similar Wikipedia category (or vice versa). Approach –Interpret similarity as similarity of usage. –Compute similarity of usage on a common sub- collection. Evaluation –Compare results to human judgment of similarity.

Keyword similarity Basic assumption: similarity is similarity of usage. –If two keywords have similar usage they will give similar results in retrieval tasks. Two keywords have similar usage if they –Have a similar distribution over documents Divergence (relative entropy) of distributions Cosine –Often co-occur Jaccard coefficient

New measure for keyword similarity Keywords have similar usage if they co-occur with similar frequency with all other keywords. –We use the frequency with which a tag/keyword is assigned to a document. –We include co-occurrence information with other terms. Helps to cope with sparse data In other words: –Terms are similar if they have similar co-occurrence patterns Similar to Tag Context Similarity of Cattuto et al.’s presentation tomorrow (Semantic Social Networks Session)

Formalization: Distribution of co-occurring terms where –q(t|d) is the keyword distribution of d –Q(d|z) is the document distribution of z “The fraction of z’s that is found in d” Weighted average of the keyword distributions of documents –The weight is the relevance of d for z given by the probability Q(d|z)

Distance of keywords For each keyword there is a distribution over all (other) keywords. Similarity is expressed by divergence of these distributions Kullback-Leibler divergence: Bits per keyword saved by compressing a subcollection with keyword distribution p using p instead of a general distribution q.

Distance of keywords (cont’d) Jensen-Shannon divergence: –Mean distribution: Jensen-Shannon divergence is symmetric. Jensen-Shannon divergence is square of a non-negative distance satisfying the triangle inequality.

Alignment Consider a collection of documents annotated with different sets of keywords. Represent a keyword by a distribution over terms from both collections. For each term find the closest term from the other collection.

Experiment I Mapping between Teleblik keywords and User Tags Educational video’s. Professional keywords from public broadcasting archive. Keywords assigned in an experiment by high school students. Data –100 videos – tags –4.348 different tags –269 different keywords

Experiment II Mapping between del.icio.us tags and Wikipedia categories Del.icio.us tags collected by Mathias Lux (Klagenfurt Univ.) Data – Wikipedia articles – tags and category annotations – different Wikipedia categories – different tags Mappings computed for tags occurring on at least 10 docs. –Mappings for 2355 tags –Mappings for 1827 categories –Using co-occurrence data with all tags/categories

Evaluation of mapping Manual evaluation Classification of a sample of mappings into: b Broader term n Narrower r Related term u Unrelated x Source term is not a keyword (e.g. “to read”) q Meaning unknown

Evaluation of aligning Wikipedia and del.icio.us

Pairs with a small distance are evaluated better than pairs with large distance. Evaluation of mappings with smallest and largest distance –a) Categories to tags –b) Tags to categories Distance vs. mapping quality

Effect of keyword frequency No correlation between keyword frequency and divergence with best mapping found.

Evaluation of mapping using two different distance measures. Categories broader, narrower and related are merged Results for –a) Categories to tags –b) Tags to categories Comparison with Jaccard-coefficient

Discussion of results Method works very well in test –Good mapping results –Distance is good indication of quality –Insensitive to frequency (upto a certain degree) Better than Jaccard, because it uses: –co-occurrence with other tags (‘tag context’) –frequency with which a tag is assigned to a document. Frequency information is typical for user generated tags. We expect this method to perform less well for aligning keywords with other keywords (without assignment frequencies). Distance measure also works well for clustering tags.

Future work Evaluating relatedness using external sources (e.g. Wordnet) Compare to other distance measures We used documents annotated completely according to two annotation schemes. –How large has the overlap to be to obtain decent results? –We can create partial overlap of disjoint document sets by a partial identification of the keywords. Detect asymmetry in relations (broader vs. narrower term)

Conclusion Using co-occurrence patterns is a fruitful approach. Frequent terms from folksonomies do behave similar to carefully assigned keywords. –Because usage based similarity measure yields good mappings. –Folksonomy seems to work!