Tag-based Social Interest Discovery By yjhuang Yahoo! Inc Searcher Xin Li, Lei Guo, Yihong(Eric) Zhao 此投影片所有權為該著作者所有,在此僅作講解使用。將於最後附上出處
Outline Introduction Data Set Analysis of Tags The Architecture Evaluation
Introduction Social network systems Del.icio.us, Facebook, MySpace, Youtube Discovering Social Interests Main challenge Difficult to detect and represent Existing approaches: online connections
This paper ’ s work Based on user-generated tags Analyze the real-world traces of tags and web content Develop the Internet Social Interest Discovery system (ISID) Discover the common user interests Cluster users and urls by topics Evaluation
Data Set Delicious Bookmark 4.3m bookmarks, 0.2m users, 1.4m urls
Data Collection and Pre-Processing Crawl the urls & download the url pages Discard all non-html objects Coding -> UTF-8, remove non-English pages Stopword List Porter Stemming algorithm 298,350 distinct tags, 4,072,265 keywords
Users, URLs and Tags Figure 1: Distribution of the frequencies that the URLs were bookmarked in our data set Log-log scale
Users, URLs and Tags Figure 2: Distribution of the bookmarking activity Log-log scale
Users, URLs and Tags Figure 3: Distribution of tag frequencies
Analysis of Tags Use VSM model Each URL: two vectors One in the space of all tags, one for doc keywords A corpus with t terms and d documents A term-document matrix A =.
Weight Measurements Tf-based Tf-Idf based
An Example of Tags vs. Keywords A URL bookmarked by users About the resolv.conf in Linux Table show the top 10 keywords
The Vocabulary of Tags Compare the vocabulary of tags with that of keywords in web documents if the most import words be covered Figure 4 (5) The coverage of user-generated tags for the tf (tf-idf) keywords of 7000 random docs.
The Convergence of Tag Selections Measure the convergence of tags for all URLs X-axis: the popularity of URLs Y-axis: the no. of distinct tags
Tags Matched by Documents Tags: catch the main concept of docs? Matched by the content of the URL? Statistical analysis Occurrences no. -> weight Tag match ration e(T, U) T= ti: the set of tags attached to a given URL U The total weight of the tags that also appeared in the keyword set of U
Tags Matched by Documents
Architecture for Social Interest Discovery 1.Find topics of interests 2.Clustering 3.Indexing
Topic Discovery Find frequent tag pattern for a given set The association rule algorithms Support Implication rules Identify the frequent tag patterns a frequent tag pattern {a,b} If w({a,b}) = w({a}) = w({b})
Clustering
Indexing
Evaluation The URL Similarity of Intra- and Inter- Topics Cosine similarity of tf-idf keyword term vector Cosine similarity of Tag tem vector 500 interest topics > 30 bookmarked urls Share 5-6 co-occurring tags Inter-: 10,000 topic-pairs
User Interest Coverage For each user Sort his tags by the number of times the tags have been used by the user Top-5: the top 5 hot tags of each user Top-10: All:
Human Reviews 4 human editors 10 topics 20 most frequent urls for each topic Scores: 1-5
Cluster Properties(Add) 此頁內容非原作者投影片,如需參考原版請至出處參考
Cluster Properties(Add) 此頁內容非原作者投影片,如需參考原版請至出處參考
Cluster Properties(Add) 此頁內容非原作者投影片,如需參考原版請至出處參考
Conclusion(Add) Propose a tag-based social interest discovery approach Justify user-generated tags to represent user interests Implement a system in social network such as delicious 此頁內容非原作者投影片,如需參考原版請至出處參考
References Xin Li, Lei Guo, Yihong Zhao, Tag- based Social Interest Discovery, www08, Yahoo! Inc
備註 投影片下載出處: achments/1313/Tag- based+Socail+Interest+Discovery- by+yjhuang.ppt?version=1 Data Set 網頁