CrowdTarget: Target-based Detection of Crowdturfing in Online Social Networks Jenny (Bom Yi) Lee
Introduction What is Crowdturfing? 2
Introduction ▹ Crowdturfing ▸ Crowdsourcing + astroturfing ▸ Malicious crowdsourcing Process of outsourcing tasks to a crowd of human workers ▸ Astroturfing False impression of widespread support 3
Crowdturfing 4
Twitter 5 ▹ Tweets and retweets ▹ Manipulation of account popularity using artificial retweets ▸ Unjust gain of money through sponsored tweets
Black-market vs Crowdturfing Sites for OSN ▹ Black-market sites ▸ Operates by utilising large number of bots ▸ Synchronised group activities ▹ Crowdturfing sites ▸ Human workers ▸ No synchronised group activities 6
▹ Legitimate user? ▸ Account-based features ▸ Synchronised group activities Existing Detection Methods 7
Analysing Accounts ▹ Account Popularity ▸ Follower to following ratio ▸ Number of received retweets per tweet ▸ Klout score ▹ Synchronised group activity ▸ Following similarity ▸ Retweet similarity 8
Account Popularity 9 Percentage of accounts with a larger number of followers than following: 20%, 37%, 70% Percentage of tweets that are retweeted more than once: 4%, 5%, 43% Median Klout scores: 20, 33, 41
Synchronised Group Activity 10 ▹ Following similarity: Similarity of followers between two accounts Black-market: HIGH Normal: LOW Crowdturfing: LOW Similarity of retweets between two accounts Black-market: HIGH Normal: LOW Crowdturfing: LOW ▹ Retweet similarity: Perform malicious activities while doing normal behaviour Human workers work independently of each other
Solution CrowdTarget 11
Solution 12 CrowdTarget: ▹ Focus on target of crowdturfing accounts ▹ Discover manipulation patterns of target objects ▸ Analyse retweets generated by: Normal Crowdturfing Black-market
Analysing Crowdturfing Targets ▹ Tweets receiving artificial retweets generated by crowdturfing workers ▹ Characteristics: ▸ Retweet time distribution ▸ Twitter application ▸ Unreachable retweeters ▸ Click information 13
Data Collection 14 Normal Tweets 1044 Twitter accounts with ≥ 100,000 followers Crowdturfing tweets Registered to 9 crowdturfing sites, retrieved tasks requesting retweets Black-market tweets Wrote 282 tweets and registered at black-market sites to purchase retweets
Retweet Time Distribution 15 ▹ Count number of retweets generated every hour since a tweet is created Normal tweets & crowdturfing & black-market tweets: Significant difference between mean, standard deviation, skewness and kurtosis value
Twitter Application, Unreachable Retweeters, Click Information 16 Ratio of retweets generated by dominant aplication: 99%, 40%, 90% Ratio of “non followers”: 80% of tweets have 80% unreachable followers Normal: < 10% Number of clicks per retweet: > 80% receives more clicks than number of retweets Most tweets never clicked > 90% receives smaller number of clicks
CrowdTarget 17 Prepare Training & Testing Data Build Classifiers Test Classifiers Set ratio of malicious tweets as 1% of total tweets. Using features of retweets explained previously Select top classifier with highest accuracy
CrowdTarget 18 classifier Retweet time distribution Twitter application Unreachable retweeters Ada Boost Gaussian Bayes K-nearest neighbours TPR: 0.95 TPR: 0.87 TPR: 0.96 Click Information classifier K-nearest neighbours TPR: 0.98
Results ▹ False-negatives ▸ Misjudgement of tweets that receive a small number of retweets ▸ 50% of undetected crowdturfing tweets mostly retweeted by reachable accounts Buy followers from same crowdturfing service ▹ False-positives ▸ Verified accounts received retweets from automated applications 19
Feature Robustness ▹ Artificially manipulate retweet time distribution ▸ Cooperation (Independent) ▸ Bot accounts to manipulate retweet time distribution (costly) ▹ Eliminate dominant applications ▹ Reduce number of unreachable retweeters ▸ Follow posting user (decrease popularity) ▹ Manipulate click information (spam?) 20
Summary ▹ Novel crowdturfing detection method ▹ CrowdTarget can detect crowdturfing retweets on Twitter with TPR of 0.98 at FPR of 0.01 ▹ Manipulation patterns of the target objects maintained regardless of what evasion techniques crowdturfing account used 21
Criticism ▹ Identification of crowdturfing targets ▸ No identification of crowdturfing accounts ▹ Data collection ▸ Same set of tweets used for training AND testing: biased results ▸ Data set not representative of black-market tweets ▹ Unaccounted cases: ▸ Indirect retweets via a “popular” user? Ratio of unreachable retweeters ↑ 22 A B G C F E D
23 THANKS! Any questions?