Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.
Google News Personalization: Scalable Online Collaborative Filtering
January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
1 Multi-topic based Query-oriented Summarization Jie Tang *, Limin Yao #, and Dewei Chen * * Dept. of Computer Science and Technology Tsinghua University.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Caimei Lu et al. (KDD 2010) Presented by Anson Liang.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Twitter rank—finding topic- sensitive influential twitters Singapore Management University Jianshu WENG Ee Peng LIM Jing JIANG Qi He ACM International.
Personalized Search Result Diversification via Structured Learning
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Investigation of Web Query Refinement via Topic Analysis and Learning with Personalization Department of Systems Engineering & Engineering Management The.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Claims about a Population Mean when σ is Known Objective: test a claim.
Mao Ye, Peifeng Yin, Wang-Chien Lee, Dik-Lun Lee Pennsylvania State Univ. and HKUST SIGIR 11.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
By James Miller et.all. Presented by Siv Hilde Houmb 1 November 2002
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Chapter Seventeen. Figure 17.1 Relationship of Hypothesis Testing Related to Differences to the Previous Chapter and the Marketing Research Process Focus.
EigenRank: A ranking oriented approach to collaborative filtering By Nathan N. Liu and Qiang Yang Presented by Zachary 1.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Dynamic Multi-Faceted Topic Discovery in Twitter Date : 2013/11/27 Source : CIKM’13 Advisor : Dr.Jia-ling, Koh Speaker : Wei, Chang 1.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Exploring Social Influence via Posterior Effect of Word-of-Mouth Recommendations Junming Huang, Xue-Qi Cheng, Hua-Wei Shen, Tao Zhou, Xiaolong Jin WSDM.
Measuring User Influence in Twitter: The Million Follower Fallacy Meeyoung Cha Hamed Haddadi Fabricio Benevenuto Krishna P. Gummadi.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
1 CS 430: Information Discovery Lecture 24 Cluster Analysis.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Recommendation in Scholarly Big Data
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Luca Lugini Publication by Yingze Wang, Guang Xiang, and Shi-Kuo Chang
Community-based User Recommendation in Uni-Directional Social Networks
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Topic models for corpora and for graphs
Personalized Celebrity Video Search Based on Cross-space Mining
Dynamic Supervised Community-Topic Model
Topic models for corpora and for graphs
Introduction to the t Test
Jinwen Guo, Shengliang Xu, Shenghua Bao, and Yong Yu
--WWW 2010, Hongji Bao, Edward Y. Chang
Presentation transcript:

Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:

Outline 1. Introduction 2. Dataset 3. Topic modeling and Homophily in Twitter 4. TwitterRank 5. Experiment and results 6. Conclusions

Introduction Motivation The number of followers is the main metric to identify influential twitterers. Twitterer’s influence may vary with different topics. Solution Identify influential twitterers taking both the topical similarity between users and the link structure into account.

Introduction Two contributions of this paper: 1. First to report homophily in Twitter 2. Introduce TwitterRank to measure the topic-sensitive influence of the twitterers.

Introduction Topic Distillation Topic-specific Relationship Network Construction Topic-sensitive User Influenc Ranking Framework of the Proposed Approach

Outline  Introduction  Dataset  Topic modeling and Homophily in Twitter  TwitterRank  Experiment and results  Conclusions

Twitter Dataset 1.Obtain a set of top-1000 Singapore-based twitterers. Denote the set as S, |S|= Crawled all the followers and the friends of each s ∈ S and stored them in set S’. 3.Let S’’= S ∪ S’, and S* = {s|s ∈ S’’, and s is from Singapore}.|S*| = For each s ∈ S*, crawled all the tweets she had published so far. Denote it as T. |T|=1,021,039.

Tweet Distribution

Friends/Followers

Reciprocity in Following Relationships

72.4% of the twitterers follow more than 80% of their followers 80.5% of the twitterers have 80% of their friends follow them back Casual following or homophily?

Outline  Introduction  Dataset  Topic modeling and Homophily in Twitter  TwitterRank  Experiment and results  Conclusions

Homophily in Twitter Q1: Are twitterers with “following” relationships more similar than those without according to the topics they are interested in? Q2: Are twitterers with reciprocal “following” relationships more similar than those without according to the topics they are interested in?

Topic modeling 定义距离: Dist(i,j) 计算平均距离 验证: ? 验证: ? 计算平均距离 结论: homophily

Topic Modeling Goal: Automatically identify the topics that twitterers are interested in based on the tweets they published. Latent Dirichlet Allocation (LDA) model is applied

Topic Modeling LDA-based generative process for generating a doc: 1.For each document, pick a topic from its distribution over topic, 2.Sample a word from the distribution over the words associated with the chosen topic. 3.The process is repeated for all the words in the document.

Topic Modeling Results 1.DT — D×T matrix D: the number of users T: the number of topics DT ij : the number of times a word in user s i ’s tweets has been assigned to topic t j.

Topic Modeling we first row normalize the DT matrix as DT’ such that ||DT’ i · || 1 =1 for each row DT’ i ·. Thus each row of matrix DT’ is basically the probability distribution of twitterer s i ’s interest over the T topics, i.e. each element DT’ i j captures the probability that twitterer s i is interested in topic t j.

Topic Difference Definition 1: the topical difference between two twitterers s i and s j can be calculated as: D JS (i,j) is the Jensen-Shannon Divergence between the two probability distributions DT’ i · and DT’ j · which is defined as:

Topic Difference M is the average of the two probability distibutions, i.e. D KL is the Kullback-Leibler Divergence which defines the divergence from distribution Q to P as:

Hypothesis Testing * Note that, this part of work, hypothesis testing, and topic distillation as well, is applied on a set of twitterers who publish more than 10 tweets in total. We denote this set as, and | | = 4050.

Hypothesis Testing (I) Formalize Q1 as a two-sample t-tet: : the mean topical difference of the pairs of users with “following” relationship. : the mea topical difference of those without.

Hypothesis Testing (I) Result: The null-hypothesis H 0 is rejected at significant level.

Hypothesis Testing (II) Formalize Q2 as a two-sample t-tet: : the mean topical difference of the pairs of users with reciprocal following relationship. : the mea topical difference of pairs of users with only one-direction relationship.

Hypothesis Testing (II) Result: The null-hypothesis H 0 is rejected at significant level.

Implication Homophily phenomenon does exist: -The answer to Q1 is yes. -The answer to Q2 is also yes. -There are twitterers who are serious in following others.

Outline  Introduction  Dataset  Topic modeling and Homophily in Twitter  TwitterRank  Experiment and results  Conclusions

Topic-specific TwitterRank  A topic-specific random walk model is applied to calculate the user’s influential score.  The transition matrix for topic t, denoted as P t. The transition probability of surfer from follower s i to friend s j is:

Topic-specific TwitterRank  Topic-specific teleportation:  The influence scores of twitters are calculated iteratively:  Aggregation of topic-specific TwitterRank:

Outline  Introduction  Dataset  Topic modeling and Homophily in Twitter  TwitterRank  Experiment and results  Conclusions

Comparison with other Algorithms  Comparison to: In-degree PageRank Topic-sensitive PageRank  Comparison in recommendation scenario.

Recommendation task

StSt s0s0 sfsf L

Evaluation  Assume A is a ranked list recommended by any of the algorithms. Let A(s i ) to be the rank of si in A. The quality of the recommendation Q(A) is measured as Q(A)=|{s i |s i ∈ S t, and A(s i )<A(s f )}|. The lower the value of Q(A) is, the higher the quality of corresponding algorithm is.

Criteria to generate L set  The number of followers that s f has.  The number of tweets that s f published.  Topical difference between s 0 and s f.  Whether reciprocal relationship between s 0 and s f.

Experiment Results

 All performs better in L df than in L dh : - There are twitterers who “follow” because of the topical similarity between them and their friends. This support the homophily phenomenon.  TR is outperformed in L fh, L tl and L dh: - InD perform the best in L fh. This is because twitterers “following” benaviors have already been biased toward those with more followers.

Experiment Results - TR performs the worst in L tl, because LDA-based topic distillation needs more contents to achieve reasonable accuracy. - TR outperforms all the other algorithms except InD in L dh. There still exist some twitters who do not “follow” based on topical similarity, although homophily is observed.

Outline  Introduction  Dataset  Topic modeling and Homophily in Twitter  TwitterRank  Experiment and results  Conclusions

Conclusion and future work  Homophily does exist: - Not all users just randomly “follows”.  Future work: - To make the algorithm more robust to manipulation, e.g purposely publish large number of tweets. - To classify different categories of users by studying their following behaviors more closely. - Incremental topic distillation/ event detection.

Thank you