Xintao Wu Jan 18, 2013 Retweeting Behavior and Spectral Graph Analysis in Social Media
Social Media Customer Analytics 2 Network topology namesexagediseasesalary AdaF18cancer25k BobM25heart110k … idSexageaddressIncome 5FYNC25k 3MYSC110k Structured profile Retweet sequence Unstructured text (e.g., blog, tweet) Customer profile Customer transaction Inventory Product desc and review … Entity resolution Patterns Temporal/spatial Scalability Visualization Sentiment Privacy
Outline Examining retweeting behavior to understand information propagation Multi-factor interaction analysis Coverage prediction Burst detection Spectral graph analysis Community partition Fraud detection 3
Multi-factor interaction analysis 4 For each following relationship, what factors affect the user A’s decision on whether to forward messages from B to A’ s followers? We examine users’ retweet behaviors by using various features Power ratio (A) Link structure (B) Location factor (C) Gender factor (D) … We apply a fitted Log-linear model to capture and interpret interaction patterns among features A-D and retweet E.
Interpreting interaction effect 5
Interpretation example Neither gender nor location has any significant effect on retweeting solely. However, considering link structure, Females are more conservative and have a lower tendency to retweet messages from non-friend (especially female) users, but have a higher tendency to retweet messages from friends or superstars. Males are more open-minded and have a higher tendency to retweet messages from non-friend (especially female) users. 6
Outline Examining retweeting behavior to understand information propagation Multi-factor interaction analysis Coverage prediction Burst detection Spectral graph analysis Community partition Fraud detection 7
Retweet Sequence Information dynamically flows through the network. 8 Alice Bob Cathy DavidEllenFred D1D2 D3 … … … … … … …… t1m1A
Retweet Sequence Information dynamically flows through a social network. 9 Alice Bob Cathy DavidEllenFred D1D2 D3 … … … … … … …… t1m1A t2m2Bt1m1A
Flow Through Tree Structure Information dynamically flows through a social network. 10 Alice Bob Cathy DavidEllenFred D1D2 D3 … … … … … … …… t1m1A t2m2Bt1m1A t3m3D\t Bt1m1A
Flow Through Tree Structure Information dynamically flows through a social network. 11 Alice Bob Cathy DavidEllenFred D1D2 D3 … … … … … … …… t1m1A t2m2Bt1m1A t3m3D\t Bt1m1A t4m4Ct1m1A …
WISE12 Challenge Sina Weibo # of user: 5,636,858 # of tweets: 46,584,914 # of retweets: 190,920, test messages each with 100 initial retweets composed by 27 users from 6 events For each message, predict M1: the number of retweets in 30 days M2: the number of possible-views in 30 days 12
Idea We treat retweeting activities of each original message in the training data as a time series Each value corresponds to the number of times that the original message during time period t For each message in the test data 13 Known from 100 retweets Use ARMA to predict
Prediction Result 14 Runner-up award (2 nd place) on WISE 2012 Challenge – Mining Track. Death of Steve Jobs Xiaomi Release Yao Jiaxin Murder Case Xiaomi Release
Outline Examining retweeting behavior to understand information propagation Multi-factor interaction analysis Coverage prediction Burst detection Spectral graph analysis Community partition Fraud detection 15
Bursts 16 Peak Time Duration Time
Topic 17
Retweet vs. Time 18
Retweet vs. Time 19
Burst Analysis : Users Top 100 users tend to have: shorter path length, shorter peak time, shorter duration time. 20
Burst Prediction Extract features User related including profile and history information Tweet-related including time series and retweet tree Run classifiers Logistic regression Random forest Decision tree Naïve bayes SVM KNN Achieve 83.2% accuracy 21
Outline Examining retweeting behavior to understand information propagation Multi-factor interaction analysis Coverage prediction Burst detection Spectral graph analysis Community partition Fraud detection 22
Spectral graph analysis Spectral coordinate: Polbook Network 23
Accuracy of AdjCluster Lap [Miller and Teng 1998] : Laplacian based Ncut [Shi and Malik, 2000] : Normalized cut HE’ [Wakita and Tsurumi, 2007] : Modularity based agglomerative clustering SpokEn [Prakash et al., 2010] : EigenSpoke Accuracy: where :the i-th community produced by different algorithms 24 Refer to IJCAI 11 for details
Evaluation on Web spam challenge data SPCTRA fraud detection 25 GREEDY: based on outer-triangles [Shrivastava, ICDE, 2008] times faster Refer to ICDE11details.
Acknowledgments This work was supported in part by U.S. National Science Foundation CNS and CCF , and UNC Charlotte Chancellor’s Special Fund. Thank You! Questions? 26