Download presentation
Presentation is loading. Please wait.
Published byNelson Singleton Modified over 9 years ago
1
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc., CA Paper presentation: Konstantinos Zacharis, Dept. of Comp. & Comm.Engineering, UTH
2
Paper Outline Introduction Previous work Data collection pre-processing Tag analysis System architecture Evaluation Conclusions and future work
3
Introduction Problem statement: discover common interests shared by users in a social network system 1 Two approaches: user-centric (by analyzing online user connections) and object-centric (by analyzing objects transferred, also offline) Paper’s approach: concentrate on user-defined tags (examining pairs of tag-URL’s) 1 Most famous commercial such systems are: http://del.icio.us/ http://www.facebook.com/ http://www.myspace.com/ http://www.youtube.com/
4
Why study tags: 4 key observations Tag vocabulary is rich and large enough For each URL, # of unique tags associated is smaller than # of keywords in the referred web page For the same URL there may be different tags. The tag and keyword vectors are, however, quite similar Tags carry the variation of human judgement and therefore can help identify social interests concisely and within finer granularity
5
Previously … User-centric approach: relations forming online (e.g. through blogging), difficult to extract (non-trivial) Object-centric : locating common objects that different users share through the network, but objects are non- descriptive and implicit to users Tagging techniques have already been used in social nets and blogs (often under descriptor “collaborative tagging”). There has also been proof of the power law obeyed by tag frequency in such nets. But novel idea here is to analyze co-occurrence of multiple tags, instead of single ones
6
Data collection/pre-processing Partial dump of del.icio.us database activity All non-HTML and non-English objects discarded, pages encoded to UTF-8 Then pages filtered for stopwords (producing keywords) Then tags and keywords normalized with Porter stemming algorithm #Tag vocabulary ~ 300,000 #Keyword vocabulary ~ 4,000,000
7
Distribution of data Distribution of tags (zipfian) is basically different from that of customers in online shopping systems
8
Tag analysis (1), VSM Table shows intuitively that user-generated tags have a higher level abstraction of the content (initial observation) and are therefore more appropriate to represent also web page content 1.Use of the Vector Space Model for tf and idf calculation 2.Each URL is represented by two vectors, one in tag space and the other in keyword space
9
Tag analysis (2), statistical estimators Tag vocabulary coverage is up to 90% of URL keywords (satisfactory) Tag matching by URL is almost complete (the opposite) Total tag # that users generate is limited for a given page, no matter how popular it is When multiple tags are used together, they define a topic of interest. This topic corresponds to a virtual community of users (they may have no physical or online connection in the real world)
10
Proposed software architecture Post stream p=(user, URL, tags), where (user, URL)=key
11
Topic Discovery (1) Problem: find a set of frequent tag patterns within a given set of posts (well studied in other domains e.g. supermarket transactions) Solution: classical association rule learning algorithms (e.g. Apriori) Another approach: probabilistic learning by EM algorithm ( A. Plangprasopchok, K. Lerman - AAAI 2007 )
12
Clustering (naive approach) (2) Step 6 is computationally intensive. A prefix tree implementation over the merged topics can reduce complexity
13
Indexing (3) Kinds of queries executed by the system: –For a given topic, a) list all URLs that contain this topic and b) list all users that are interested in this topic –For given tags, list all topics containing the tags –For a given URL, list all topics this URL belongs to –For a given URL and topic, list all appropriate users
14
Evaluation (1) Metrics: compare intra- with inter- topic similarity (cosine) to see how well are clusters formed Tag-based topic clustering and similarity computation is simple and accurate and also computationally cost- effective, because the dimension of term vector space is significantly reduced Topic clustering is also accurate because it is based on multiple co-occurring tags
15
Evaluation (2) Topics discovered capture almost 90% of interests of users To evaluate the quality of URL clusters, a review by 4 human editors was conducted Cluster sizes follow power law distribution (few hot topics on internet capture a large amount of users) Each topic usually contains no more than 5 tags
16
Conclusions Paper justifies use of tags as more appropriate for representing user interest No information on the online or offline social connection among users was necessary Paper provides an inside view to document semantics (by comparing tags and keywords) Paper demonstrates extensive computational (in statistics) and graphical properties. Can easily be characterized as a complete report
17
Any questions? Thank you for your attention!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.