Tag-based Social Interest Discovery

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Web Mining.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014.
Scott Wen-tau Yih (Microsoft Research) Joint work with Vahed Qazvinian (University of Michigan)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Collective Collaborative Tagging System Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Mining Query Subtopics from Search Log Data Date : 2012/12/06 Resource : SIGIR’12 Advisor : Dr. Jia-Ling Koh Speaker : I-Chih Chiu.
WIMS 2014, Thessaloniki, June 2014 A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Tagging Systems Mustafa Kilavuz. Tags A tag is a keyword added to an internet resource (web page, image, video) by users without relying on a controlled.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
Commentary-based Video Categorization and Concept Discovery By Janice Leung.
Social Bookmarking & Research What Delicious can do for you.
Discovery of Aggregate Usage Profiles for Web Personalization
Recommender Systems; Social Information Filtering.
IR Models: Review Vector Model and Probabilistic.
Information Retrieval
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
1 Analyzing Patterns of User Content Generation in Online Social Networks Lei Guo, Yahoo! Enhua Tan, Ohio State University Songqing Chen, George Mason.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Tag-based Social Interest Discovery 2009/2/9 Presenter: Lin, Sin-Yan 1 Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc WWW 2008 Social Networks & Web 2.0.
Web 2.0: Concepts and Applications 4 Organizing Information.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Social scope: Enabling Information Discovery On Social Content Sites
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Tag Data and Personalized Information Retrieval 1.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Tag-based Social Interest Discovery SNU IDB Lab. Chung-soo Jang April 18, 2008 WWW 2008, Beijing, China. Xin Li, Lei Guo, Yihong (Eric) Zhao Yahoo! Inc.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Let's play “tag”. what is a tag? A tag is a keyword or descriptive term associated with an item as means of classification by means of a folksonomy...
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.
This work is by Georgia Koutrika, published on CIDR'09 All the figures & tables in these slides are from that paper.
Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut.
Flickr Tag Recommendation based on Collective Knowledge BÖrkur SigurbjÖnsson, Roelof van Zwol Yahoo! Research WWW Summarized and presented.
Vector Space Models.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
Tag-based Social Interest Discovery By yjhuang Yahoo! Inc Searcher Xin Li, Lei Guo, Yihong(Eric) Zhao 此投影片所有權為該著作者所有,在此僅作講解使用。將於最後附上出處.
Social Networking for Scientists (Research Communities) Using Tagging and Shared Bookmarks: a Web 2.0 Application Marlon Pierce, Geoffrey Fox, Joshua Rosen,
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
KMS & Collaborative Filtering Why CF in KMS? CF is the first type of application to leverage tacit knowledge People-centric view of data Preferences matter.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Information Organization: Overview
Neighborhood - based Tag Prediction
Personalized Social Image Recommendation
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Wikitology Wikipedia as an Ontology
Information Organization: Clustering
Text Categorization Assigning documents to a fixed set of categories
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Information Organization: Overview
Presentation transcript:

Tag-based Social Interest Discovery Xin Li, Lei Guo, Eric Zhao Yahoo! International Social Search

Internet Social Networks Are Emerging! Internet social networks are self-organized by online users Del.icio.us, facebook, flickr, MySpace, YouTube Users are driven by their interests Fetch and bookmark contents Create new contents Share contents Interest discovery is crucial to a social network Discover interests of users in different contents Locate users with similar interests Link people with similar interests to form communities

Important Features of Social Networks Organize users and contents Cluster users into communities Categorize contents into interesting topics Provide search functions Given a topic, locate all matching contents and all users that are interested in the topic Given a user, locate all his fetched/created contents and the topics of his interests Given a user, locate all other users that have similar interests

The Problem: Social Interest Discovery Questions to answer How to discover a user’s interests based on his fetched/created contents? How to use individual users’ interests to find interesting topics shared by users? How to use the topics to create interest-based user communities?

Existing Solutions and Limitations User-centric Using social network graph to discover users with common interests Problem: online/offline user connections are hard to identify Object-centric Detect common interests based on the common objects fetched by users Problem: discovered interests are object-base, non-descriptive and implicit Predefined categorization Not flexible, cannot catch most recent popular or hot user interests Cannot reflect various user interest groups which may keep changing over time

Our approach Leverage user-generated tags Compute frequent co-occurrences of tag patterns Use the tag patterns as topics of interests Cluster users and content around the topics to build communities

Overview Motivation and Problem Analysis of tags in a social network ISID system design Evaluation Conclusion

Tags in Social Networks User-generated labels for annotating the contents Descriptive, summary, reflecting human judgment Meta data between users and contents Widely used in social networks Del.icio.us: http://del.icio.us/help/tags Youtube: http://www.google.com/support/youtube/bin/answer.py?hl=en&answer=55769 Facebook: http://www.facebook.com/help.php?hq=tag

del.icio.us Social Network A pioneer social bookmark system http://del.icio.us/ Our Data Set Dump for a limited period of time 4.3 M public, tagged bookmarks, 0.2 M users, 1.4 M bookmarked URLs

URL Popularity Follows Power Law The distribution of URL bookmarking frequency. Most URLs are unpopular.

User Activity Follows Heavy-tail The distribution of user bookmarking frequency. Most users are less active.

Tags vs. Keywords URL http://ka1fsb.home.att.net/resolve.html Top tf keywords domain,name,file,resolver,server,conf,network,nameserver,ip,org,ampr Top tfidf keywords ampr,domain,jnos,nameserver,conf, ka1fsb,resolver,ip,file,name,server All tags linux,howto,network,sysadmin,dns

Tag Vocabulary Tag coverage for tf keywords Tag coverage for tf-idf keywords User tags missed ≤ 20% of tf keywords for ≥ 98% docs and ≤ 10% of tf-idf keywords for ≥ 90% docs. Tags covered most important keywords. But the total number of unique tags are ~10x smaller than that of keywords.

Tag Convergence The total number of different tags users can use for a given document is limited no matter how popular the URL is.

Tags Capture Concepts of Contents Nearly 50% of all URLs have tag match ratio 1 70% of all URLs have a tag match ratio > 0.5 Only 10% of the URLs have no matched tags

From Tags to User Interests Bookmarks reflect user interests Tags summarize/describe bookmarked contents Meta data between users and contents Connect users and bookmarked contents Frequently used tag patterns reflect user interests The key is the co-occurrences of tags

Overview Motivation and Problem Analysis of tags in a social network ISID system design Evaluation Conclusion

System Design Find topics of interests Clustering Indexing For a given set of tagged bookmarks, find all topics of interests, i.e., frequent co-occurrences of tags Clustering For each topic, find all the URLs and the users such that those users have labeled each of the URLs with all the tags in the topic. Indexing Import the topics and their user and URL clusters into an indexing system for application queries.

ISID Architecture Data Source Topic Discovery Posts Topics, posts Posts = (user, content, tags) Topics, Clusters Indexing Clustering

Topic Discovery Use the association rule algorithms to discover co-occurring tag patterns Was invented for identifying frequently bought items in supermarkets E.g., bread and milk Use a support number to define the frequency threshold Efficient in finding frequent patterns out of a large set transactions for given support number (threshold) The rule building part is not used One more step: remove pattern A if A is a sub-pattern of some other pattern B, and both A & B have the same support number To remove duplicate clusters

Clustering

Indexing Find all URLs that contain a topic, i.e. tagged with same sets of tags Find all users interested in a topic Find all topics containing a tag Find all topics for a user Find all topics for a URL Combination of the above

Overview Motivation and Problem Analysis of tags in a social network ISID system design Evaluation Conclusion

Content Similarity of Topic Clusters Similarity of two documents Inner product of tf-idf document vectors Keyword-based vector Tag-based vector (comparison) Intra-topic similarity Average cosine similarity of every document pairs Inter-topic similarity Similarity of two topics Average similarity of one topic to all other topics

Inter- and Intra- Topic Similarity Keyword based (tf-idf) Tag based (tf-idf) Intra-topic similarity is significantly higher than inter-topic similarity Tag co-occurrence can well cluster similar content Tag-based similarity is quite close to keyword-based similarity

Inter-topic Similarity Similarity of two topics with different number of overlapped tags Keyword-based (tf-idf) Tag-based (tf-idf) Co-occurrences of tags can really capture similar contents. Inter-topic similarity increases with number of co-occurring tags. Tag co-occurrences capture similar contents.

User Interest Coverage 90% users have ≥ 90% top 5 tags covered 87% users have ≥ 90% top 10 tags covered 90% users have ≥ 80% tags covered The topics discovered by ISID capture the interests of users.

Human Reviews Scores: 1, Highly unrelated 2, Unrelated 3, Not sure 5, Highly related From the human being’s judgment, ISID indeed clusters related URLs into clusters for each topic defined by user tags.

Cluster Properties Cluster size follows power-law  User interests follows power-law. There exists really hot topics!

Cluster Properties Most topics have less than 6 tags. Beyond 6, the number of clusters quickly drops.

Overview Motivation and Problem Data and Their Properties ISID system Evaluation Conclusion

Conclusion Tags reflect human judgments on contents Co-occurring tags are effective to represent user interests Reflect human understanding for different but similar web contents Consensus of judgments among users ISID system Topic discovery, Clustering, Indexing Evaluation results are promising