Download presentation
Presentation is loading. Please wait.
Published byMichael Franklin Modified over 8 years ago
1
Topical Semantics of Twitter Links MARCH 23, 2011 In-seok An SNU Internet Database Lab. Michael J. Welch, Uri Schonfeld Yahoo! Inc., UCLA Computer Science Dept WSDM`11
2
Outline Introduction –twitter Modeling Twitter Analysis of The Graph Exploring Link Semantics Experiments on Link Semantics Conclusion 2
3
Introduction Twitter –Microblogging site –10 th world wide in total traffic –28 million unique monthly visitors –Provider of information for breaking news events 3
4
Introduction Simple graphical modeling for Web –Text-based pages connected by hyperlinks ( directed edges ) –Will fail to capture all that this information has to offer –Produce less than ideal results A rich graphical model for Twitter –Multiple semantic edges Follow, RT, Mention, List –Not all edges are created equal In this paper –Web graph vs. Twitter graph –Follow link vs. Retweet link 4
5
Introduction Twitter Twitter –Blogging platform Maximum of 140 characters Micro-blogging platform –Multiple interfaces Web, SMS, mobile application, instant messaging, etc. 5
6
Dual role –Reader A user may choose to follow another user’s posts –Accessible via a private stream ( timeline ) –Sorted by their publication timestamp Friends / follower –Writer Posting messages Retweet messages Reply or Mention other twitterian 6 Introduction Twitter
7
Mention –User is referred to by their username prefixed with the character “@” Retweet –A user chooses to repeat another user’s post –New style retweet –Old style retweet Introduction Twitter 7
8
List –Added in late 2009 –Allows users to construct and organize a group of users referred to as a list –Help a user to focus on the posts of certain subsets of their friends Two broad categories –Topical lists Centered around the discussion of common interests or subjects “politics” –Classification lists Formed to group users who share a common trait “Celebrities”, “professional athletes” –Lists generate meaningful manually-created categorizations of users Introduction Twitter 8
9
Outline Introduction Modeling Twitter –The Full Twitter Graph Model –Additional Twitter Information –The Simplified Twitter Graph Analysis of The Graph Exploring Link Semantics Experiments on Link Semantics Conclusion 9
10
Modeling Twitter Web graph model –Nodes Web pages –Edges Hyperlinks connecting them –Enables the application of many graph analysis techniques Inlink & outlink distributions PageRank N by N matrix M –The Web graph is commonly represented as matrix –N is the number of pages on the web – 10
11
Modeling Twitter The Full Twitter Graph Model The Twitter graph is inherently more complex –At least two different types of entities ( nodes ) Users and Tweets –At least four types of relationships ( edges ) Follows, Publish, Retweets and Mentions Twitter Graph Edges –Follow edge User a follows the posts of user b –Publish edge Authorship of the post –Retweet edge Post a is a retweet of post b –Mention edge Post a mentions user b 11
12
Modeling Twitter The Full Twitter Graph Model Matrix representation of the Twitter graph –Identical to the Web graph –|U| + |P| by |U| + |P| matrix |U| : the number of users |P| : the number of posts –A non-zero value in Represents an edge between node i and node j 12
13
Modeling Twitter Additional Twitter Information Time –Twitter includes timestamp information When each post was written When accounts were created –When a follow link was created No explicit way to determine Can be approximated with repeated crawling –Valuable for studying factors Evolution of the graph Charting popularity over time 13
14
Modeling Twitter Additional Twitter Information Hyperlinks –Standard hyperlinks embedded in the posts –Third node type Web page Uniquely identified by a URL –Difficulty modeling hyperlinks in Twitter Common use of URL shortening services –TinyURL and bit.ly Prevents making use of keywords or other interesting artifacts the URL may contain directly Makes additional processing of the data necessary 14
15
Modeling Twitter Additional Twitter Information Post Content –Use the content of a post To extract metadata –User name mention –Identification of retweets –Remaining textual content of a post Determining the topics of interest to a user as well –Difficulties Small size of the posts –Sparsity of data –Sparsity of tokens Frequent use of nonstandard shorthand notation 15
16
Modeling Twitter The Simplified Twitter Graph Simplified Twitter Graph –Only includes user nodes –Still capturing the most important information From the original representation as it pertains to the users –The user-user follow links remain As they are from the Full Twitter graph –Add retweet edges to the simplified Twitter Graph If user a retweets user b at least one time –There is retwet edge from user a to user b 16
17
Outline Introduction Modeling Twitter Analysis of The Graph –Link Distributions –Graph Formation Exploring Link Semantics Experiments on Link Semantics Conclusion 17
18
Analysis of The Graph Data specification –Collected between October 2009 and January 2010 –1.1 million Twitter users –More than 273 million follow edges –2.9 million retweet edges Crawling method –Beginning with an initial seed set of the top 1000 users in twitterholic.com –Crawling in a BFS manner –Traversing the follow links in a forward direction 18
19
Analysis of The Graph Link Distributions Follow Edges –Power-law distribution –Two abnormal spikes in Outlink distribution 20-friend –Twitter provides an initial a set of 20 “recommended” users to follow 2000-friend –The restrictions Twitter places on following more than 2000 users 19
20
Analysis of The Graph Link Distributions Retweet Edges –Retweet Inlink Power-law distribution –Retweet Outlink Does not follow power-law distribution –While the number of friends one has is generally power-law, the number of users one finds truly interesting does not appear to scale in a similar fashion 20
21
Analysis of The Graph Link Distributions Posting Frequency – 417,613 users who publish at least one tweet –Most recent 200 posts per user –58,000 users published only a single post during the month –A large number of users wrote more than 100 posts 21
22
Analysis of The Graph Graph Formation Readers and Writers –Three potential scenarios A user acts primarily as reader –No or little posts A user frequently retweets posts –Writes little to no original content A user contributes significant new content –User’s reading and writing behavior Each dot : unique user X-axis : # of posts published by friends Y-axis : # of posts published by user Shade : originality –The lighter shades indicate less originality Size : PageRank of each user ( based on follow-edge ) 22
23
Analysis of The Graph Graph Formation General trend –For users who post very frequently A larger fraction of their posts are actually retweets –Many users retweeted at least one post which they did not read from one of their friends Despite the explicit friendship links available in the site structure, it is still not possible to know exactly what a user reads –Many websites are adding modules which display Twitter results 23
24
Outline Introduction Modeling Twitter Analysis of The Graph Exploring Link Semantics –Retweet vs. Follow based Ranking –Link Virality Experiments on Link Semantics Conclusion 24
25
Exploring Link Semantics Web graph –A link from page a to page b Endorsement of the quality of page b Extent its relevance to page a Twitter graph –Follow link Endorsement of quality or interest The actual semantics of the link –User a, acting as a reader, is interested in user b acting as writer –Retweet link Endorsement of quality –User is interested in the topic –User expects his readers to be interested in this post Retweet edge signifies a connection from user a as a writer to user b as a writer 25
26
Exploring Link Semantics Retweet vs. Follow based Ranking PageRank based on two edges –Retweet-based Simple power-law distribution –Follow-based Two different segments with different power-law coefficients 26
27
Exploring Link Semantics Retweet vs. Follow based Ranking PageRank over Retweet links vs. Follow links –Follow links Twitter recommended celebrities ( barackobama ) –Rich get richer phenomenon Top ranker has lower rank in RT-based PageRank –Retweet links Tweetmeme –Social bookmarking site Top ranker has lower rank in Follow-based PageRank 27
28
Exploring Link Semantics Retweet vs. Follow based Ranking Follow-based –Public figure or celebrities Retweet-based –News generating entities Aplusk is the only user who appears in the top 10 for both rankings These rank can be affected by spam or marketing techniques –ddlovatoRT simply retweet all posts mentioning Demi Lovato –Twitter’s research team estimates that less than 1% of Tweets are now spam 28
29
Exploring Link Semantics Link Virality Retweet Virality – Follow Virality – –RoF(u) : the users who u has seen at least on post from via a retweet –FoF(u) : the set of all users who are reachable by traversing exactly two directed follow edges –Fr(u) : the set of users whom user u follows Retweet Viriality is consistently higher than Follow Virality –Retweets demonstrate a stronger notion of importance or influence to users –Users are more likely to follow people they see retweeted than those who are merely “Friends of Friends” 29
30
Outline Introduction Modeling Twitter Analysis of The Graph Exploring Link Semantics Experiments on Link Semantics –Empirical Results –Topic Sensitive PageRank Conclusion 30
31
Experiments on Link Semantics Topical relevance –Follow links quickly diffuse into a broad range of topics –Retweet links remain more concentrated on the original topic Data –1.1 million users –273 million follow edges –2.9 million retweet edges 31
32
Experiments on Link Semantics Empirical Results Empirical evaluation –Starting from a seed set of users Members of the same topical list –photography and design –Generate two sets of users At least one seed member follows them At least one seed member has retweeted one of their posts –Random sample of 25 users from each of these sets –Manually assessed them for topical relevance Result –# of relevant users in the follow-generated samples were 4 and 5 –# of relevant users in the retweet-generated samples were 19 and 20 32
33
Experiments on Link Semantics Topic Sensitive PageRank PageRank –Recursive ranking formula –Page is as important as the pages pointing to it Topic Sensitive PageRank( TSPR ) –Quantify the difference in topical relevance carried by follow and retweet links –Biased PageRank Generate query-specific importance scores for pages at query time –We use topic sensitive PageRank to quantify the difference in topical relevance carried by follow and retweet link 1 [1] T.H. Haveliwala. Topic-sensitive PageRank, www 2002. 33
34
Experiments on Link Semantics Topic Sensitive PageRank Experiments –Beginning with a topical Twitter list –Compute topic sensitive PageRank for Follow edges Retweet edges –If the links carry the topicality well The high-ranking users are likely to be topically relevant to the original seed topic –Evaluate the resulting highest ranked users for relevance to the original topic with a user survey 34
35
Experiments on Link Semantics Topic Sensitive PageRank Experimental Setup –Collected 9 topical lists from listorious.com 19 ~ 437 users –Average 155, median 49 Seed users have average 14,284 followers –Compute personalized PageRank –Selected the 30 highest ranking non-seed users –Conduct a survey Participants were shown a topic description and the 30 highest raned users for either a follow-based or a retweet-based PageRank Ordered randomly Mixed with a random set of 10 of the seed users for that topic Make a binary judgment of each user’s relevance A total of 12 people participated in the survey Each list was evaluated by at least 2 people 35
36
Experiments on Link Semantics Topic Sensitive PageRank Accuracy of the highly ranked users –Precision The average relevancy of a set of users –Relevance The fraction of users who were judged relevant by at least on survey taker – the set of users from U judged relevant in evaluation k of a paricular list 36
37
Experiments on Link Semantics Topic Sensitive PageRank Result –Precision can be improved by simply using retweet links instead of following links Precision of top ranked user improved by over 30% 37
38
Experiments on Link Semantics Topic Sensitive PageRank Cohesiveness of Seed –To verify the seed users Include 10 randomly selected seed users for each evaluation Result –Average Precision : 0.931 Minimum of 0.838 Maximum of 1.9 –The seed users represented their topics well –Our survey takers understood and agreed upon the topic definitions 38
39
Conclusion We have described a detailed model of Twitter as a graph –Key statistics about the graph –Provided some initial insights as to how the graph forms important distinctions between edge types in the graph –Follow and retweet –The varying semantics and properties of these edges will have significant implication on graph algorithms such as PageRank –Retweet edges preserve topical relevance Better than follow edges 39
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.