Download presentation
Presentation is loading. Please wait.
Published byAudrey Morris Modified over 9 years ago
1
SPOTTING FAKE RETWEETING ACTIVITY IN TWITTER Maria Giatsoglou 1, Despoina Chatzakou 1, Neil Shah 2, Alex Beutel 2, Christos Faloutsos 2, Athena Vakali 1 1 Informatics Department, Aristotle University of Thessaloniki, Greece 2 School of Computer Science, Carnegie Mellon University, USA Informatics Department Aristotle University of Thessaloniki School of Computer Science Carnegie Mellon University
2
Content in Twitter Great topic diversity Varying attention levels (#views,, #favorites) 12/4/2015 2 #RTs
3
Retweet Fraud: overview User typically retweet a post due to its high quality / interestingness + author’s influence / popularity # retweets serves as a post’s popularity indicator Retweet fraud: falsely create the impression of popularity by artificially generating a high volume of retweets Twitter estimates 14% (5%) of user accounts being bots (spam bots); the problem is probably much bigger Such content is vacuous, spammy / malicious and detracts from Twitter content’s credibility and users’ experiences 12/4/2015 3 %
4
Retweet Fraud: example 12/4/2015 4
5
Retweet Fraud: examples 12/4/2015 5
6
Retweet Fraud: dimensions Accounts of varying automation level (bots, humans, semi-automated) Mixed honest and fake retweets for the same post Promiscuous vs. subtle fraudsters: based on the ratio of fraudulent to honest(-like) activity 12/4/2015 6 ### occasional retweet buyer honest humans paid human bots %% professional content / user promoter ### Complex problem with multiple dimensions examples
7
How can we spot fake retweeting activity? 12/4/2015 7
8
What features tell fake from genuine reactions? How do they relate to the targeted problems ? RTSCOPE addresses these issues Hypotheses and problems addressed There are distinctive patterns in retweet fraud in terms of H1. the timing of retweets (use of automation tools) H2. the accounts that retweet (fraudsters acting in lockstep) H3. the connectivity of retweeters (bot networks, “camouflage”) 12/4/2015 8 Retweet-thread level problem Given: the i th tweet of user u; its induced retweet activity (user IDs ×tamps) Identify: if the activity is organic or not. User level problem Given: a user u; a set of tweets of user u; their induced retweet activity Identify: if u is a spammer. promiscuous fraudsters cautious fraudsters
9
Background User u: a given Twitter account Tweet tw u,i : the i th post of user u Retweet thread: all re-posts of a tweet 12/4/2015 9 can be honest OR fraudster ###%%***$$$ t1t1 t2t2 t3t3 t4t4 u time tw u,1 tw u,2 tw u,3 tw u,4 ### t1t1 t2t2 t3t3 t4t4 time ### t4t4 R u,1 AlexMaryPeterDebbieTim tw u,1 “R” network (of R u,1) AlexMaryPeter Debbie
10
Introducing the RTSCOPE approach RTSCOPE: series of tests for spotting fraudsters with varying behaviors 12/4/2015 10 Maria Giatsoglou, Despoina Chatzakou, Neil Shah, Christos Faloutsos, and Athena Vakali. Retweeting Activity on Twitter: Signs of Deception. In PAKDD 2015.
11
Connectivity: TRIANGLES pattern 12/4/2015 11 honest “R” network fraudulent “R” network degree 2
12
Connectivity: DEGREES pattern 12/4/2015 12 spike at 30 honest “R” network fraudulent “R” network power-law
13
Activity Summarization: Features Temporal & popularity features per retweet thread Log-log pairwise feature scatterplots of retweet threads reveal dense microclusters for fraudsters 12/4/2015 13 ratio of activated followersauthor’s followers who retweeted response time time between the tweet’s posting and first retweet lifespan time between first and last retweet (constrained to 1 month) Arr-IQR inter-quartile range of inter-arrival times for retweets
14
Activity Summarization: Patterns ENTHUSIASM: High infection probability for followers of fraudsters MACHINE-GUN: Fraudsters retweet all at once/with similar time delay REPETITION: Fake retweet threads form microclusters due to similar response time, Arr-IQR, activated followers ratio 12/4/2015 14 Popular++, Popular+, Popular, Fraudulent users ENTHUSIASM MACHINE-GUN
15
Retweeters activation: Disparity Given the posts of user u i, what is the distribution of retweets across retweeters? Disparity reveals if retweeting activity spreads homogeneously over retweeters or it is skewed towards few dedicated users. Disparity for u i and a retweet thread size of k, 12/4/2015 15
16
### AlexMaryPeterDebbieTim %% $$$ uiui *** ### 100 posts k = 5. r i,1 = 100 r i,2 = 2 r i,3 = 2 r i,4 = 1 r i,5 = 1 *** %% $$$ *** %% Disparity: Intuition 12/4/2015 16 ### bot1 bot2bot4bot5bot3 %% $$$ uiui *** ### 100 posts $$$ %% *** k = 5. r i,1 = 100 $$$ %% *** $$$ %% *** $$$ %% *** $$$ %% ***. r i,2 = 100. r i,3 = 100. r i,4 = 100. r i,5 = 100 %% $$$
17
FAVORITISM & HOMOGENEITY patterns Disparity of a Zipf distribution (proof in paper) 12/4/2015 17 homogeneity favoritism FAVORITISM. Participation of honest users to retweets follows a Zipf law. HOMOGENEITY. Participation of fraudulent users to retweets is homogeneous. super- skewed favoritism DETAIL
18
Findings Patterns: we discovered several patterns for spotting retweet fraud All tests are content independent can catch more sophisticated fraudsters are language independent But: golden number of tests for flagging fraudsters? 12/4/2015 18
19
Can we come up with a more generalizable approach? 12/4/2015 19
20
Synchronization Fraud Group of unnaturally synchronized events/entities Collective / group anomaly e.g. retweets, Facebook likes, subgraphs, image subregions 12/4/2015 20 ### Alex ### Mary ### Peter ### Debbie ### Tim ### got 3K retweets in 10 minutes … 3000 times SUSPICIOUS?not necessarily 10’ ### John $$$ Alex $$$ Mary $$$ Peter $$$ Debbie $$$ Tim … 3000 times 10’ $$$ John &&& Alex &&& Mary &&& Peter &&& Debbie &&& Tim … 3000 times 10’ &&& John ### Alex ### Mary ### Peter ### Debbie ### Tim … 3000 times 10’ SUSPICIOUS ?Probably!
21
Our goals Given: N groups of entities; a representation for each entity in a p-dimensional space; Identify groups of entities abnormally synchronized in some feature subspaces. G1. Design a general, effective approach for collective anomalies detection G2. Customize it for Retweet Fraud detection G3. Find features that will assist distinguishing fraudsters from honest users 12/4/2015 21
22
Background: Measuring group strangeness 12/4/2015 22 fd average closeness Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, Shiqiang Yang. CatchSync: Catching Synchronized Behavior in Large Directed Graphs. KDD 2014.
23
Background: Robust outlier detection ROBPCA-AO: robust dimensionality reduction approach; finds outlying points Suitable for multivariate, high-dimensional data; Independent of features’ distribution; Non-deterministic 1. Finds the “best” k-D space to project data based on subset of points 2. Detects outliers based on two distance scores 12/4/2015 23 M. Hubert, P.J. Rousseeuw, T. Verdonck. Robust PCA for skewed data and its outlier map. Comput. Stat. Dat. An., 53 (2009), 2264-2274. orthogonal distance robust score distance Detail
24
Problem definition SYNCFRAUD Problem. Given: a set of groups of entities G with a variable number of entities e m,i for each group g m ; p features for the entities’ representation, Extract: a set of features at the group-level, and Identify: suspicious groups S with highly synchronized characteristics. RTFRAUD Problem. groups of entities users entities retweet threads suspicious groups RTFraudsters 12/4/2015 24
25
ND-SYNC pipeline Given N groups of p -D entities and I iterations Do 1. Feature subspace sweeping; 2. Group scoring; 3. Multivariate outlier detection; Extract suspicious groups 12/4/2015 25
26
ND-SYNC: Feature subspace sweeping 12/4/2015 26 sign of synchronicity all, for simplicity
27
ND-SYNC: Group scoring 12/4/2015 27
28
ND-SYNC: Multivariate outlier detection Aim: given the suspiciousness score vectors identify the suspicious groups 1. Apply ROBPCA-AO for I iterations and find outliers 2. Flag a group as suspicious based on majority vote over all iterations. To eliminate parameters automatic selection of dimensionality k via 95% cumulative variance explained criterion heuristic use of all entities for estimating the robust feature subspaces 12/4/2015 28
29
Features for retweet threads Retweets: # retweets Response time: tweet’s posting first retweet Lifespan: first last (observed) retweet constrained to 3 weeks RT-Q3 response time: tweet’s posting first ¾ of retweets RT-Q2 response time: tweet’s posting first ½ of retweets Arr-MAD: mean absolute deviation of RTs inter-arrival times Arr-IQR: inter-quartile range of RTs inter-arrival times 12/4/2015 29
30
Microclusters of fraudulent retweet threads 12/4/2015 30 high synchronicity for RTFraudsters 2D feature subspaces
31
Dataset generation Selection of target users (both honest and fraudulent) users with the most retweeted tweets and heavy use of spammy keywords (casino, buy, followback, etc) in a 2-day Twitter sample active (frequent tweets) and popular (> 100 retweets) users (http://twittercounter.com/)http://twittercounter.com/ topic experts (European affairs and Automobile), based on Twitter lists Target users tracked for 2-6 months (all tweets & their retweets) Pruned “unpopular” users (all retweet threads < 50 retweet or fewer than 20) 12/4/2015 31
32
Dataset overview Type#Retweet threads#Retweets honest83,5872,939,455 fraudulent50,4358,787,803 BOTH134,02211,727,258 12/4/2015 32 User categorization 28 fraudulent: tweets with spammy links and terms, repetitive promotions; fabricated profiles 278 honest (Available at http://oswinds.csd.auth.gr/project/NDSYNC)http://oswinds.csd.auth.gr/project/NDSYNC
33
ND-SYNC effectiveness & robustness Highly accurate and robust to the selection of k Best performance at k = 6 (selected with the 95% cumulative variance explained criterion) Only 1% decrease in F1-score using just 2D feature subspaces 12/4/2015 33 97% accuracy 0.82 F1-score
34
Detected outliers 12/4/2015 34 professional promoters promiscuous 65 retweet threads in 4 months 80% > 1k retweets 60% > 10k retweets news media account news media account politician
35
Conclusions G1. Design a general, effective approach for collective anomalies detection ND-SYNC is a general, effective pipeline, which automatically detects group anomalies G2. Customize it for Retweet Fraud detection Carefully designed set of features for the retweet fraud case G3. Find features that will assist spotting fraudsters from honest users ND-SYNC achieves 97% accuracy in distinguishing fraudulent from honest users on real Twitter data 12/4/2015 35
36
Questions? Download datasets at: http://oswinds.csd.auth.gr/project/RTSCOPE http://oswinds.csd.auth.gr/project/NDSYNC 36 12/4/2015
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.