On Stability, Clarity, and Co-occurrence of Self-Tagging Aixin Sun and Anwitaman Datta Nanyang Technological University Singapore
2 Outline Collaborative-tagging vs. Self-tagging Dataset overview and characteristics Experiments Tag Usage and Stability Tag Clarity vs. Popularity Tag Co-occurrence vs. Semantic distance Conclusion Questions/suggestions forwarded
3 Collaborative-tagging vs. Self-tagging Collaborative tagging A resource may be tagged by multiple users with multiple tags, e.g., del.icio.us and CiteULike Self-tagging A resource can only be tagged by its creator, e.g., most blog posts. Questions Any differences in tagging behavior? Observations made on collaborative tagging hold in self-tagging? When tags are used in any application (e.g., tag recommendation, classification/clustering), shall the two systems be treated differently?
4 Dataset Overview: Blogs listed in and hosted by blogspot.comhttp://dir.blogflux.com/ Categories: Academic – Zookeeping Blogs: 15,244, Posts: 3.3M Posts with tag(s): 983K Distinct tags: 29K Characteristics [Marlow06]
5 Tag Usage
6 Tag Dynamics Collaborative tagging systems [Halpin07] Tag distribution used to collaboratively annotate a particular resource became stable after certain time period The tags that could well describe the resource are repeatedly received from multiple users. Possible reasons [Golder06]: Imitation of others Shared knowledge Self-tagging systems? No direct interaction to influence and imitate each other Bloggers may read each others’ posts and tags shared background? an implicit consensus of tag usage.
7 Tag Stability A relatively small set of tags to annotate most blog posts
8 Tag Clarity Question The same tag tends to be assigned to topically-similar blog posts? Tag clarity: A tag receives high clarity score if all posts annotated by the tag are topically cohesive Inspired by query clarity score in ad-hoc retrieval [Cronen- Townsend02] The clarity score of a tag is the distance between the tag language model and the collection language model
9 Tag Clarity vs. Tag Popularity Number of tags reduces as tag popularity increase Clarity scores of tags decrease with popularity increase
10 Tag Clarity vs. Tag Popularity Less popular tags have clarity scores close to those dummy tags More popular tags have higher clarity scores than dummy tags
11 Tag Co-occurrence vs. Semantic distance Co-occurrence Semantic distance: KL-divergence between the two tag language models Question: If tags co-occur in annotating blog posts, then their semantic distance is small?
12 Tag Co-occurrence vs. Semantic distance
13 Tag Co-occurrence vs. Semantic distance Observations The co-occurrence of two tags does not suggest any semantic relationship between the two tags (correlation coefficient = 0.017). Tag pairs (e.g.,, ) is much clearer in describing posts supported by their clarity scores. Tag pairs are likely to be semantically-orthogonal, partially consistent with [Weinberger08]. Possible reasons: Tags are more for personal use than others’ benefit. A blogger has a clear understanding about her post, it is not necessary for her to tag the post with many similar tags. Rather, she may tag post with tags from different perspectives.
14 Tag Clarity vs. Tag Popularity (Revisit)
15 Conclusion A preliminary study on tags in self-tagging system Tag dynamics Tag clarity vs. popularity Tag co-occurrence vs. semantic distance Observations: Tags are often assigned to topically similar blog posts through the notion of tag clarity. Co-occurred tags may not necessarily be semantically-similar to each other, but are likely to be semantically-orthogonal.
16 Questions/suggestions forwarded For resources only tagged by its owner, people will avoid redundancy, but provide different aspects for a single resource. How does this feature influence the application on such system? Can we expect different facets can be extracted from self tagging system? This system only allows one user to tag one resource, and allow the user to use multi-words/phrase tag. It must be a sparse linked data; and the co-occurrence of tags must be less than the free tagging system. Could we expect some differences from this point of view?
17 More questions/suggestions How does this difference make research and applications on self-tagging system challenging? I wonder if the convergence of tags to the final set of tags is represented primarily by the dominance of a few tags. If you omit the most common handful of topics, do the remainder converge also? Several blogging systems separately show author tags and reader tags. It would be interesting to see the overlap between these and the effect of one another.
18 Acknowledgement This work was supported by A*STAR Public Sector R&D, Singapore
19 References [Cronen-Townsend02] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. In Proc. of SIGIR’02, pages 299–306, Tampere, Finland, [Golder06] S. A. Golder and B. A. Huberman. Usage patterns of collaborative tagging systems. Journal of Information Science, 32(2):198–208, [Halpin07] H. Halpin, V. Robu, and H. Shepherd. The complex dynamics of collaborative tagging. In Proc. of WWW’07,pages 211–220, Banff, Alberta, Canada, [Marlow06] C. Marlow, M. Naaman, D. Boyd, and M. Davis. Ht06, tagging paper, taxonomy, flickr, academic article, to read. In Proc. of ACM HyperText’06, pages 31–40, Odense, Denmark, [Weinberger08] K. Weinberger, M. Slaney, and R. van Zwol. Resolving tag ambiguity. In ACM Multimedia, Vancouver, Canada, 2008.
Thank you