Topical Semantics of Twitter Links MARCH 23, 2011 In-seok An SNU Internet Database Lab. Michael J. Welch, Uri Schonfeld Yahoo! Inc., UCLA Computer Science.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Learning more about Facebook and Twitter. Introduction  What we’ve covered in the Social Media webinar series so far  Agenda for this call Facebook.
Influence and Passivity in Social Media Daniel M. Romero, Wojciech Galuba, Sitaram Asur, and Bernardo A. Huberman Social Computing Lab, HP Labs.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Evaluating Search Engine
Search Engines and Information Retrieval
Social Media Motion: How to Get Started & Keep Going With Facebook, Twitter & More Presented by Eli Lilly and Company Hosted by Rob Robinson McNeely Pigott.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
TwitterSearch : A Comparison of Microblog Search and Web Search
Presented by Karen Porter UM School of Business Administration & ImpactOnlineMarketing.com Google + and Twitter for Biz ImpactOnlineMarketing.com.
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Using Social Networks in Education Region One Technology Conference May 11, 2010.
Social Networking and On-Line Communities: Classification and Research Trends Maria Ioannidou, Eugenia Raptotasiou, Ioannis Anagnostopoulos.
Using Social Media to Communicate and Support Your School A Closer Look at Twitter.
C HAPTER Social Networking Using Twitter 7 Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall.
An Introduction to the Powerful Social Network and What it Means for Your Business.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Who Says What to Whom on Twitter Shaomei Wu, Jake M. Hofman, Winter A. Mason, Duncan J. Watts WWW May 2013 SNU IDB Lab. Namyoon Kim.
Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Do's and don'ts to improve your site's ranking … Presentation by:
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Microblogs: Information and Social Network Huang Yuxin.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 3: The Foundations of Research 1.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
ON THE SELECTION OF TAGS FOR TAG CLOUDS (WSDM11) Advisor: Dr. Koh. Jia-Ling Speaker: Chiang, Guang-ting Date:2011/06/20 1.
Social Media Getting Social in a Digital World. (And, why it matters to your business!)
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
Jargon Busters Presented by Katie Munton and Natalie Dawson.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Algorithmic Detection of Semantic Similarity WWW 2005.
Using Social Media for Fundraising and Communication with Supporters Lindsay Boyle – Communications & Research Coordinator Claire Chapman – Information.
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
HAWAII CLEAN ENERGY INITIATIVE ONLINE PRESENCE Cover goes here.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
CIW Lesson 6MBSH Mr. Schmidt1.  Define databases and database components  Explain relational database concepts  Define Web search engines and explain.
We.b : The web of short URLs Demetris Antoniades, lasonas Polakis, Gerogios Kontaxis, Elias Athansapoulos, Sotiris loannidis, Evangelos P.Markatos, Thomas.
Optimizing today's websites using tomorrow's technologies.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
LOGO Comments-Oriented Blog Summarization by Sentence Extraction Meishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07) Advisor : Dr. Koh Jia-Ling Speaker :
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Measuring User Influence in Twitter: The Million Follower Fallacy Meeyoung Cha Hamed Haddadi Fabricio Benevenuto Krishna P. Gummadi.
Twitter Part One – The Fundamentals. First things first… What is Twitter? Social networking platform Short messages – 140 characters maximum Relaxed,
Grow Your Business with Social Marketing
1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Presentation transcript:

Topical Semantics of Twitter Links MARCH 23, 2011 In-seok An SNU Internet Database Lab. Michael J. Welch, Uri Schonfeld Yahoo! Inc., UCLA Computer Science Dept WSDM`11

Outline  Introduction –twitter  Modeling Twitter  Analysis of The Graph  Exploring Link Semantics  Experiments on Link Semantics  Conclusion 2

Introduction  Twitter –Microblogging site –10 th world wide in total traffic –28 million unique monthly visitors –Provider of information for breaking news events 3

Introduction  Simple graphical modeling for Web –Text-based pages connected by hyperlinks ( directed edges ) –Will fail to capture all that this information has to offer –Produce less than ideal results  A rich graphical model for Twitter –Multiple semantic edges  Follow, RT, Mention, List –Not all edges are created equal  In this paper –Web graph vs. Twitter graph –Follow link vs. Retweet link 4

Introduction Twitter  Twitter –Blogging platform  Maximum of 140 characters  Micro-blogging platform –Multiple interfaces  Web, SMS, mobile application, instant messaging, etc. 5

 Dual role –Reader  A user may choose to follow another user’s posts –Accessible via a private stream ( timeline ) –Sorted by their publication timestamp  Friends / follower –Writer  Posting messages  Retweet messages  Reply or Mention other twitterian 6 Introduction Twitter

 Mention –User is referred to by their username prefixed with the character  Retweet –A user chooses to repeat another user’s post –New style retweet –Old style retweet Introduction Twitter 7

 List –Added in late 2009 –Allows users to construct and organize a group of users referred to as a list –Help a user to focus on the posts of certain subsets of their friends  Two broad categories –Topical lists  Centered around the discussion of common interests or subjects  “politics” –Classification lists  Formed to group users who share a common trait  “Celebrities”, “professional athletes” –Lists generate meaningful manually-created categorizations of users Introduction Twitter 8

Outline  Introduction  Modeling Twitter –The Full Twitter Graph Model –Additional Twitter Information –The Simplified Twitter Graph  Analysis of The Graph  Exploring Link Semantics  Experiments on Link Semantics  Conclusion 9

Modeling Twitter  Web graph model –Nodes  Web pages –Edges  Hyperlinks connecting them –Enables the application of many graph analysis techniques  Inlink & outlink distributions  PageRank  N by N matrix M –The Web graph is commonly represented as matrix –N is the number of pages on the web – 10

Modeling Twitter The Full Twitter Graph Model  The Twitter graph is inherently more complex –At least two different types of entities ( nodes )  Users and Tweets –At least four types of relationships ( edges )  Follows, Publish, Retweets and Mentions  Twitter Graph Edges –Follow edge  User a follows the posts of user b –Publish edge  Authorship of the post –Retweet edge  Post a is a retweet of post b –Mention edge  Post a mentions user b 11

Modeling Twitter The Full Twitter Graph Model  Matrix representation of the Twitter graph –Identical to the Web graph –|U| + |P| by |U| + |P| matrix  |U| : the number of users  |P| : the number of posts –A non-zero value in  Represents an edge between node i and node j 12

Modeling Twitter Additional Twitter Information  Time –Twitter includes timestamp information  When each post was written  When accounts were created –When a follow link was created  No explicit way to determine  Can be approximated with repeated crawling –Valuable for studying factors  Evolution of the graph  Charting popularity over time 13

Modeling Twitter Additional Twitter Information  Hyperlinks –Standard hyperlinks embedded in the posts –Third node type  Web page  Uniquely identified by a URL –Difficulty modeling hyperlinks in Twitter  Common use of URL shortening services –TinyURL and bit.ly  Prevents making use of keywords or other interesting artifacts the URL may contain directly  Makes additional processing of the data necessary 14

Modeling Twitter Additional Twitter Information  Post Content –Use the content of a post  To extract metadata –User name mention –Identification of retweets –Remaining textual content of a post  Determining the topics of interest to a user as well –Difficulties  Small size of the posts –Sparsity of data –Sparsity of tokens  Frequent use of nonstandard shorthand notation 15

Modeling Twitter The Simplified Twitter Graph  Simplified Twitter Graph –Only includes user nodes –Still capturing the most important information  From the original representation as it pertains to the users –The user-user follow links remain  As they are from the Full Twitter graph –Add retweet edges to the simplified Twitter Graph  If user a retweets user b at least one time –There is retwet edge from user a to user b 16

Outline  Introduction  Modeling Twitter  Analysis of The Graph –Link Distributions –Graph Formation  Exploring Link Semantics  Experiments on Link Semantics  Conclusion 17

Analysis of The Graph  Data specification –Collected between October 2009 and January 2010 –1.1 million Twitter users –More than 273 million follow edges –2.9 million retweet edges  Crawling method –Beginning with an initial seed set of the top 1000 users in twitterholic.com –Crawling in a BFS manner –Traversing the follow links in a forward direction 18

Analysis of The Graph Link Distributions  Follow Edges –Power-law distribution –Two abnormal spikes in Outlink distribution  20-friend –Twitter provides an initial a set of 20 “recommended” users to follow  2000-friend –The restrictions Twitter places on following more than 2000 users 19

Analysis of The Graph Link Distributions  Retweet Edges –Retweet Inlink  Power-law distribution –Retweet Outlink  Does not follow power-law distribution –While the number of friends one has is generally power-law, the number of users one finds truly interesting does not appear to scale in a similar fashion 20

Analysis of The Graph Link Distributions  Posting Frequency – 417,613 users who publish at least one tweet –Most recent 200 posts per user –58,000 users published only a single post during the month –A large number of users wrote more than 100 posts 21

Analysis of The Graph Graph Formation  Readers and Writers –Three potential scenarios  A user acts primarily as reader –No or little posts  A user frequently retweets posts –Writes little to no original content  A user contributes significant new content –User’s reading and writing behavior  Each dot : unique user  X-axis : # of posts published by friends  Y-axis : # of posts published by user  Shade : originality –The lighter shades indicate less originality  Size : PageRank of each user ( based on follow-edge ) 22

Analysis of The Graph Graph Formation  General trend –For users who post very frequently  A larger fraction of their posts are actually retweets –Many users retweeted at least one post which they did not read from one of their friends  Despite the explicit friendship links available in the site structure, it is still not possible to know exactly what a user reads –Many websites are adding modules which display Twitter results 23

Outline  Introduction  Modeling Twitter  Analysis of The Graph  Exploring Link Semantics –Retweet vs. Follow based Ranking –Link Virality  Experiments on Link Semantics  Conclusion 24

Exploring Link Semantics  Web graph –A link from page a to page b  Endorsement of the quality of page b  Extent its relevance to page a  Twitter graph –Follow link  Endorsement of quality or interest  The actual semantics of the link –User a, acting as a reader, is interested in user b acting as writer –Retweet link  Endorsement of quality –User is interested in the topic –User expects his readers to be interested in this post  Retweet edge signifies a connection from user a as a writer to user b as a writer 25

Exploring Link Semantics Retweet vs. Follow based Ranking  PageRank based on two edges –Retweet-based  Simple power-law distribution –Follow-based  Two different segments with different power-law coefficients 26

Exploring Link Semantics Retweet vs. Follow based Ranking  PageRank over Retweet links vs. Follow links –Follow links  Twitter recommended celebrities ( barackobama ) –Rich get richer phenomenon  Top ranker has lower rank in RT-based PageRank –Retweet links  Tweetmeme –Social bookmarking site  Top ranker has lower rank in Follow-based PageRank 27

Exploring Link Semantics Retweet vs. Follow based Ranking  Follow-based –Public figure or celebrities  Retweet-based –News generating entities  Aplusk is the only user who appears in the top 10 for both rankings  These rank can be affected by spam or marketing techniques –ddlovatoRT simply retweet all posts mentioning Demi Lovato –Twitter’s research team estimates that less than 1% of Tweets are now spam 28

Exploring Link Semantics Link Virality  Retweet Virality –  Follow Virality – –RoF(u) : the users who u has seen at least on post from via a retweet –FoF(u) : the set of all users who are reachable by traversing exactly two directed follow edges –Fr(u) : the set of users whom user u follows  Retweet Viriality is consistently higher than Follow Virality –Retweets demonstrate a stronger notion of importance or influence to users –Users are more likely to follow people they see retweeted than those who are merely “Friends of Friends” 29

Outline  Introduction  Modeling Twitter  Analysis of The Graph  Exploring Link Semantics  Experiments on Link Semantics –Empirical Results –Topic Sensitive PageRank  Conclusion 30

Experiments on Link Semantics  Topical relevance –Follow links quickly diffuse into a broad range of topics –Retweet links remain more concentrated on the original topic  Data –1.1 million users –273 million follow edges –2.9 million retweet edges 31

Experiments on Link Semantics Empirical Results  Empirical evaluation –Starting from a seed set of users  Members of the same topical list –photography and design –Generate two sets of users  At least one seed member follows them  At least one seed member has retweeted one of their posts –Random sample of 25 users from each of these sets –Manually assessed them for topical relevance  Result –# of relevant users in the follow-generated samples were 4 and 5 –# of relevant users in the retweet-generated samples were 19 and 20 32

Experiments on Link Semantics Topic Sensitive PageRank  PageRank –Recursive ranking formula –Page is as important as the pages pointing to it  Topic Sensitive PageRank( TSPR ) –Quantify the difference in topical relevance carried by follow and retweet links –Biased PageRank  Generate query-specific importance scores for pages at query time –We use topic sensitive PageRank to quantify the difference in topical relevance carried by follow and retweet link 1 [1] T.H. Haveliwala. Topic-sensitive PageRank, www

Experiments on Link Semantics Topic Sensitive PageRank  Experiments –Beginning with a topical Twitter list –Compute topic sensitive PageRank for  Follow edges  Retweet edges –If the links carry the topicality well  The high-ranking users are likely to be topically relevant to the original seed topic –Evaluate the resulting highest ranked users for relevance to the original topic with a user survey 34

Experiments on Link Semantics Topic Sensitive PageRank  Experimental Setup –Collected 9 topical lists from listorious.com  19 ~ 437 users –Average 155, median 49  Seed users have average 14,284 followers –Compute personalized PageRank –Selected the 30 highest ranking non-seed users –Conduct a survey  Participants were shown a topic description and the 30 highest raned users for either a follow-based or a retweet-based PageRank  Ordered randomly  Mixed with a random set of 10 of the seed users for that topic  Make a binary judgment of each user’s relevance  A total of 12 people participated in the survey  Each list was evaluated by at least 2 people 35

Experiments on Link Semantics Topic Sensitive PageRank  Accuracy of the highly ranked users –Precision  The average relevancy of a set of users –Relevance  The fraction of users who were judged relevant by at least on survey taker – the set of users from U judged relevant in evaluation k of a paricular list 36

Experiments on Link Semantics Topic Sensitive PageRank  Result –Precision can be improved by simply using retweet links instead of following links  Precision of top ranked user improved by over 30% 37

Experiments on Link Semantics Topic Sensitive PageRank  Cohesiveness of Seed –To verify the seed users  Include 10 randomly selected seed users for each evaluation  Result –Average Precision :  Minimum of  Maximum of 1.9 –The seed users represented their topics well –Our survey takers understood and agreed upon the topic definitions 38

Conclusion  We have described a detailed model of Twitter as a graph –Key statistics about the graph –Provided some initial insights as to how the graph forms  important distinctions between edge types in the graph –Follow and retweet –The varying semantics and properties of these edges will have significant implication on graph algorithms such as PageRank –Retweet edges preserve topical relevance  Better than follow edges 39