Tag-based Social Interest Discovery Xin Li, Lei Guo, Eric Zhao Yahoo! International Social Search
Internet Social Networks Are Emerging! Internet social networks are self-organized by online users Del.icio.us, facebook, flickr, MySpace, YouTube Users are driven by their interests Fetch and bookmark contents Create new contents Share contents Interest discovery is crucial to a social network Discover interests of users in different contents Locate users with similar interests Link people with similar interests to form communities
Important Features of Social Networks Organize users and contents Cluster users into communities Categorize contents into interesting topics Provide search functions Given a topic, locate all matching contents and all users that are interested in the topic Given a user, locate all his fetched/created contents and the topics of his interests Given a user, locate all other users that have similar interests
The Problem: Social Interest Discovery Questions to answer How to discover a user’s interests based on his fetched/created contents? How to use individual users’ interests to find interesting topics shared by users? How to use the topics to create interest-based user communities?
Existing Solutions and Limitations User-centric Using social network graph to discover users with common interests Problem: online/offline user connections are hard to identify Object-centric Detect common interests based on the common objects fetched by users Problem: discovered interests are object-base, non-descriptive and implicit Predefined categorization Not flexible, cannot catch most recent popular or hot user interests Cannot reflect various user interest groups which may keep changing over time
Our approach Leverage user-generated tags Compute frequent co-occurrences of tag patterns Use the tag patterns as topics of interests Cluster users and content around the topics to build communities
Overview Motivation and Problem Analysis of tags in a social network ISID system design Evaluation Conclusion
Tags in Social Networks User-generated labels for annotating the contents Descriptive, summary, reflecting human judgment Meta data between users and contents Widely used in social networks Del.icio.us: http://del.icio.us/help/tags Youtube: http://www.google.com/support/youtube/bin/answer.py?hl=en&answer=55769 Facebook: http://www.facebook.com/help.php?hq=tag
del.icio.us Social Network A pioneer social bookmark system http://del.icio.us/ Our Data Set Dump for a limited period of time 4.3 M public, tagged bookmarks, 0.2 M users, 1.4 M bookmarked URLs
URL Popularity Follows Power Law The distribution of URL bookmarking frequency. Most URLs are unpopular.
User Activity Follows Heavy-tail The distribution of user bookmarking frequency. Most users are less active.
Tags vs. Keywords URL http://ka1fsb.home.att.net/resolve.html Top tf keywords domain,name,file,resolver,server,conf,network,nameserver,ip,org,ampr Top tfidf keywords ampr,domain,jnos,nameserver,conf, ka1fsb,resolver,ip,file,name,server All tags linux,howto,network,sysadmin,dns
Tag Vocabulary Tag coverage for tf keywords Tag coverage for tf-idf keywords User tags missed ≤ 20% of tf keywords for ≥ 98% docs and ≤ 10% of tf-idf keywords for ≥ 90% docs. Tags covered most important keywords. But the total number of unique tags are ~10x smaller than that of keywords.
Tag Convergence The total number of different tags users can use for a given document is limited no matter how popular the URL is.
Tags Capture Concepts of Contents Nearly 50% of all URLs have tag match ratio 1 70% of all URLs have a tag match ratio > 0.5 Only 10% of the URLs have no matched tags
From Tags to User Interests Bookmarks reflect user interests Tags summarize/describe bookmarked contents Meta data between users and contents Connect users and bookmarked contents Frequently used tag patterns reflect user interests The key is the co-occurrences of tags
Overview Motivation and Problem Analysis of tags in a social network ISID system design Evaluation Conclusion
System Design Find topics of interests Clustering Indexing For a given set of tagged bookmarks, find all topics of interests, i.e., frequent co-occurrences of tags Clustering For each topic, find all the URLs and the users such that those users have labeled each of the URLs with all the tags in the topic. Indexing Import the topics and their user and URL clusters into an indexing system for application queries.
ISID Architecture Data Source Topic Discovery Posts Topics, posts Posts = (user, content, tags) Topics, Clusters Indexing Clustering
Topic Discovery Use the association rule algorithms to discover co-occurring tag patterns Was invented for identifying frequently bought items in supermarkets E.g., bread and milk Use a support number to define the frequency threshold Efficient in finding frequent patterns out of a large set transactions for given support number (threshold) The rule building part is not used One more step: remove pattern A if A is a sub-pattern of some other pattern B, and both A & B have the same support number To remove duplicate clusters
Clustering
Indexing Find all URLs that contain a topic, i.e. tagged with same sets of tags Find all users interested in a topic Find all topics containing a tag Find all topics for a user Find all topics for a URL Combination of the above
Overview Motivation and Problem Analysis of tags in a social network ISID system design Evaluation Conclusion
Content Similarity of Topic Clusters Similarity of two documents Inner product of tf-idf document vectors Keyword-based vector Tag-based vector (comparison) Intra-topic similarity Average cosine similarity of every document pairs Inter-topic similarity Similarity of two topics Average similarity of one topic to all other topics
Inter- and Intra- Topic Similarity Keyword based (tf-idf) Tag based (tf-idf) Intra-topic similarity is significantly higher than inter-topic similarity Tag co-occurrence can well cluster similar content Tag-based similarity is quite close to keyword-based similarity
Inter-topic Similarity Similarity of two topics with different number of overlapped tags Keyword-based (tf-idf) Tag-based (tf-idf) Co-occurrences of tags can really capture similar contents. Inter-topic similarity increases with number of co-occurring tags. Tag co-occurrences capture similar contents.
User Interest Coverage 90% users have ≥ 90% top 5 tags covered 87% users have ≥ 90% top 10 tags covered 90% users have ≥ 80% tags covered The topics discovered by ISID capture the interests of users.
Human Reviews Scores: 1, Highly unrelated 2, Unrelated 3, Not sure 5, Highly related From the human being’s judgment, ISID indeed clusters related URLs into clusters for each topic defined by user tags.
Cluster Properties Cluster size follows power-law User interests follows power-law. There exists really hot topics!
Cluster Properties Most topics have less than 6 tags. Beyond 6, the number of clusters quickly drops.
Overview Motivation and Problem Data and Their Properties ISID system Evaluation Conclusion
Conclusion Tags reflect human judgments on contents Co-occurring tags are effective to represent user interests Reflect human understanding for different but similar web contents Consensus of judgments among users ISID system Topic discovery, Clustering, Indexing Evaluation results are promising