Group 4 1.Maithili Gokhale 2.Swati Sisodia 3.Aman Chanana 4.Piyush Agade “Uncovering Social Network Sybils in the Wild” - Zhi Yang, Christo Wilson, Xio Wang, Tingting Gao, Ben Y. Zhao, Yafei Dai
The Renren Network o Renren is one of the most popular (220 million users) OSNs in China. o Functions maintain personal profiles, upload photos, write diary entries (blogs), and establish bidirectional social links with friends. o The most popular type of user activity is sharing blog entries, which can be forwarded across social hops like “retweets” on Twitter.
What are Sybils? o Sybils are fake identities created to unfairly increase the power or resources of a single malicious user. o Sybil accounts on Renren blend in extremely well with normal users to effectively attract friends and disseminate advertisements. o They have completely filled user profiles with realistic background information, coupled with attractive profile. o As its user population has grown, Renren has become an attractive venue for companies to disseminate information about their products. o This has created opportunities for Sybil accounts to spam advertisements for companies.
Previous detectors on Renren o Previously, Renren had already deployed a few techniques to detect Sybil accounts: using thresholds to detect spamming scanning content for suspect keywords and blacklisted URLs providing Renren users with the ability to flag accounts and content as abusive. o Disadvantages of these techniques generally ad hoc require significant human effort effective only after spam content has been posted.
Identifying Malicious Activities o Definition: Malicious activities are actions taken by an attacker that directly or indirectly support a monetization strategy. o Example: targeting users with spam and phishing attacks. o The definition does not cover legitimate monetization strategies, such as keyword, banner, or news-feed advertising. o In order for attackers to reach a user on OSNs, the attacker must first be friends with that user.
Profiles that are NOT considered o Benign Fake Accounts.: Although, it is possible that an attacker could create benign Sybils that behave identically to normal users and appear on the surface to be real- we are only interested in detecting Sybil accounts that perform attacks. o Inactive Accounts: Determining whether an inactive account is a malicious Sybil is challenging because there is no behavioral data (e.g., friend requests, status updates). The goal of the detector is to catch these accounts as quickly as possible once they become active to minimize the amount of damage they can do to normal users.
Characterizing Sybil Accounts The features that would help distinguish Sybil accounts from normal users are: o Invitation Frequency o Outgoing Requests Accepted o Incoming Requests Accepted o Clustering Coefficient
Characterizing Sybil Accounts 1) Invitation Frequency o The number of friend requests that a user has sent within a fixed time period o Figure shows the friend invitation frequency of our dataset, averaged over long- term (400-hour) and short-term (1-hour) time scales. o Sybil accounts are much more aggressive in sending requests than normal users. There is a clear separation: accounts sending more than 20 invites per time interval are Sybils.
Characterizing Sybil Accounts 2) Outgoing Requests Accepted o It is the fraction of outgoing friend requests confirmed by the recipient. o Figure shows a distinct difference between Sybils and normal users o Non-Sybil users have high accepted percentages, with an average of 79%. o On average, only 26% of all friend requests sent by Sybil accounts are accepted.
Characterizing Sybil Accounts 3) Incoming requests Accepted o It is the fraction of incoming friend requests that users accept. o Sybil accounts are nearly uniform: they accept all incoming friend requests (e.g., 80% of Sybils accepted all friend requests). o Sybil accounts receive few friend requests, this detection mechanism- hence, this method can incur significant delay. The incoming requests accepted by non- Sybil users are spread across the board.
Characterizing Sybil Accounts 4) Clustering Coeffecient o Is graph metric that measures the mutual connectivity of a user’s friends. o Sybil accounts, are likely to befriend users with no mutual friendships. o Figure plots the CDF of cc values for each user’s first 50 friends (sorted by time). o Non-Sybil users have cc values orders of magnitude larger than Sybil users.
Building and Running a Sybil Detector o An Support Vector Machine (SVM) classifier is applied to dataset of 1,000 normal users and 1,000 Sybils. o Partition: five subsamples-four for training the classifier and one tests the classifier. o The results show that the classifier is very accurate, correctly identifying 99% of both Sybil and non-Sybil accounts. o Value of threshold: outgoing requests accepted % 20 ∧ cc< 0.01 o Properly tuned threshold-based detector can achieve performance similar to the computationally expensive SVM.
Real time Sybil Detection o Uses ground truth dataset to give an adaptive, threshold based Sybil detector. o Monitors characteristics of Sybil accounts. o After the detector has been bootstrapped, it uses an adaptive feedback(drawn from the customer complaint rate ) scheme to dynamically tune the threshold parameters on the fly. o Tuning the thresholds minimizes the likelihood of false-positive classifications of normal accounts as Sybils. o It is unlikely to detect Sybils that behave like normal users. o Drawback: will not catch benign inactive Sybils. Inactive Sybils will not be detected until after they begin friending normal users.
Real time Sybil Detection o The detector incorporates real-time changes in friendship links when calculating acceptance percentages. o In some cases, normal users accept friend requests from Sybils only to later revoke the friendship. This causes the accept percentage for the Sybil to drop. o When Renren bans Sybils, all of their edges are destroyed. o This causes the acceptance percentages for other Sybils with which they are linked to drop. o In both cases, the decrease in acceptance percentage helps the detector to more accurately detect Sybils.
False Positives
Analysis of Structural and Behavioral Attributes of Sybils 1.Topological Analysis 2. Clickstream Analysis
Topological Analysis Normal Edges Sybil Edges Attack Edges Honest Nodes Sybil Nodes
Topological Analysis o Community Detection Algorithms work under assumption that Sybils form tight knit communities Community Detection o Given Network Structure, is it ddpossible to detect Sybil Nodes ?
Topological Analysis o Normal User follow same general trend as Sybil User o Only 20% of Sybils are connected to one or more than one Sybil edges
Topological Analysis Is it still possible that the connected minority are vulnerable to community detection ? o Community detection is not a viable option o Is this edge creation intentional ?
Topological Analysis o Most Sybil edge creation is interspersed randomly with edges created to normal users. o For each Sybil, sequence of edges is plotted, with the edges sorted chronologically by creation time.
Topological Analysis o Majority of Sybils do not form communities. o Even the Sybil Edges that are formed are unintentional.
Clickstream Analysis o Each click characterized by USER ID : TIMESTAMP : URL o Clicks were grouped into five categories Photo Message Share Friending Profile Various aspects of clickstream were analyzed : o Number of clicks for each category o Sequence of clicks for a particular session. o Session Duration : Time between first and last click o Session Frequency : How often does a user login
Clickstream Analysis Session Frequency o Sixty-four percent of normal users access Renren no more than once per day. o Only 8% Sybils fall in this low- frequency range o Sybils averaged 3.9 sessions per day versus 1.5 for normal users
Clickstream Analysis Session Duration o The median session duration for normal users is 6 minutes, whereas the median for Sybils is 48 seconds o Less than 25% Normal sessions are 48s long o A very small percent of sybils exhibit sessions that are hours long
Clickstream Analysis Click Activity
Clickstream Analysis Clickstream Modelling o Each state represents a category o Initial and final states are added to mark the beginning and end of each click sequence o Each Edge represents probability of transition from one state to next To analyze sequence of clicks from normal and Sybil nodes a Markov model was created.
Clickstream Analysis
There is stark difference in Click Activity, Click Sequence, and Sessions of Normal and Sybil users. Can this difference be leveraged ?
Clickstream Analysis SVM (Support Vector Machine) Train an SVM on the following clickstream features: o Session-level features including Average session length Average sessions per day o Features from click activities Percentage of clicks in each category Transition probabilities between Categories
Clickstream Analysis MLE (Maximum likelihood Estimation) MLE categorizes user from its clickstream by examining which clickstream model better explains user’s click sequence. For a c lick sequence {s 1, s 2,..., s n } Individual Likelihood P M (s i, s i+1 ) = Probability that user transits from category s i to category s i+1 according to the model M. Likelihood that Model M = ∏ (Individual P M ) reproduces given click stream
Spam Strategies and Collusion o Share Spam on Renren o Case Study: Spam Blogs o Content-Based Sybil Components o Temporal Correlation Between Sybils
Share Spam on Renren o Sybils dominantly share links to spam content to disseminate spam. o Shares per Sybil is much greater than status updates or wall posts.
Share Spam on Renren o 25% of the 237K Sybils share once before they are caught and banned. o Less than 1% of Sybils go uncaught long enough to share 100 or more links.
Share Spam on Renren o The shares of a random sample of 1000 Sybils were manually examined. o Sybils on Renren share two types of links: o Blogs (62.5% shares link to spam blog posts) o Videos (37.5% shares link to bogus online videos)
Case Study: Spam Blogs o Classifying Spam Blogs o Identifying Collusion o Information Dissemination
Classifying Spam Blogs The subset of blogs shared by Sybils were manually verified to be spam. These blogs: o Include links to phishing sites. o Include links to websites selling contraband goods o Majority of them were banned by Renren’s security system.
Identifying Collusion o Fundamental question: are Sybils colluding to promote spam blogs, or is each Sybil operating independently? o Answer: the amount of duplication among the spam blogs was calculated. o Only 302,333 unique spam blogs were promoted, among the 3 million individual spam shares in the dataset.
Identifying Collusion o Top 30 spam blogs were shared more than 10,000 times. o 25% of spam blogs received 2 or more shares from Sybils.
Information Dissemination o Sybils collude so that the spam blogs get featured on the trending content section on Renren. o Sybils can inflate the popularity of spam blogs by making them artificially trend. o Currently, Renren relies on manual inspection by humans to filter spam out of the trending section.
Content-Based Sybil Components o Whether content similarity can be used to group Sybils into connected components. o Intuitively, a single attacker is likely to control strongly connected components. o Understanding these components allows to estimate the number of attackers threatening Renren. o Collusion between Sybils is modeled as a content similarity graph.
o In a content similarity graph, Sybils are nodes and two Sybils are connected if they share similar content. o Content similarity between two sets s i and s j is: where s i and s j are sets of contents shared by two Sybils, respectively. o It ranges from 0 to 1, where o 0- no duplication o 1- sybils share exactly same content Content-Based Sybil Components
o Two Sybils i and j share similar content if s ij is larger than some threshold T s (or equal to T s in the special case of T s = 1) o T s = 0 is the most lax threshold o T s = 1 is the strictest threshold o For T s = 1, >50% of Sybils have at least one Sybil partner forwarding exactly the same content. Content-Based Sybil Components
o Figure shows the quantity and sizes of connected components for different thresholds, ordered from largest to smallest. TsTs Connected componen ts Giant component 04.9K 219K(90%) Sybils 0.576K 84K(35%) Sybils 1114K 3700 Sybils Content-Based Sybil Components
Temporal Correlation Between Sybils o Are there temporal correlations between Sybils that exhibit content similarity? o We suspect that Sybils under the control of a single attacker will be active at similar times. o If t i and t j are set of links that two sybils i, j share during time interval ‘S’ the temporal similarity between them is defined as
o Temporal similarity ranges from 0 to 1, with 0 meaning no overlap and 1 meaning exact overlap. o The size of the time interval ‘S’ can be varied to control the granularity of comparisons. o We evaluate time similarity over two time intervals: 1 hour and 1 day. Temporal Correlation Between Sybils
o Each line plots average time similarity for discreet sets of Sybil pairs with close content similarity. o For example, the first point of the hour-scale line represents the average time similarity for all pairs of Sybils with content similarity in the range of 0 to 0.1. Temporal Correlation Between Sybils
o Figure reveals that time similarity is roughly proportional to content similarity. o Sybils that share similar content tend to do so at similar times. o Under 1 day threshold, Sybils that share near-identical content also exhibit nearly 0.92 time similarity. Temporal Correlation Between Sybils
Making Sybil Defense Future-proof o We discussed a scalable, and accurate system that has been really effective in detecting Sybils in Renren OSN. o Can attackers try adapting and circumvent the defense strategy discussed earlier? o If yes, what are the options that an attacker has? What can an attacker control and manipulate? o Invitation frequency? o Incoming requests acceptance rate? o Outgoing requests acceptance rate? o Clustering Coefficient?
Making Sybil Defense Future-proof o Outgoing requests acceptance rate? o Clustering Coefficient? o The only way these two features can be influenced by a Sybil is by forming tight-knit communities with other Sybil. o What will sending friend requests to other Sybils accomplish? o Other Sybils will accept the requests, hence, the outgoing acceptance rate of the sender will inflate. o A tight community of Sybils will imply a high clustering coefficient.
Fortunately, there is! The Sybils won? There should be something more that could be done.
Making Sybil Defense Future-proof
o A study where the new attack model was simulated (on a regional network in Renren having 170k nodes) suggests Sybil graph structure changed according to the input parameters. o In the simulations, two models for directing the creation of Sybil edges were used o Erdos-Renyi - the attacker links randomly chosen Sybils. o Preferential Attachment - the destination of each Sybil edge is chosen proportionally to the destination Sybil’s degree. o In the simulations, α = 0.26 and β = 0.5, p = 0.33 and Blondel’s algorithm was used to detect communities in the regional graph.
Making Sybil Defense Future-proof o For various values of n and N the following table was obtained. o The results are mixed. For n ≤ 300, the community detector is able to identify Sybils with high accuracy. However, as n grows, so does the false-positive rate. * Uncovering Social Network Sybils in the Wild, Zhi Yang, et alia N : no of Sybil nodes n : no of friend requests sent per Sybil node
So, the community detection algorithms alone are not as precise as we want them to be, as with increasing n, the number false positives increases.
Making Sybil Defense Future-proof o In order for Sybil community detectors to be accurate (i.e., not generate false positives), they must leverage additional features beyond the graph topology (detecting communities). o External Acceptance Rate – The external acceptance percentage is the fraction of friend requests sent by members of a community to users outside the community that are accepted. This should work. Why? Because for Sybils the vast majority of accepted friend requests are from other Sybils inside the local community. Conversely, rejections are from normal users outside the local community.
Conclusion o We discussed the behaviour of Sybils to create a feature-based Sybil detector which can manage to catch 99% of Sybils, with low false-positive and false-negative rates. o Next we saw characterization of Sybil graph topology on a major OSN (Renren). And we found that Sybils on Renren do not obey behavioural assumptions that underlie previous work on decentralized Sybil detectors. 80% of Sybils do not connect to other Sybils but instead they emphasize on connecting with normal users. o We also analyzed Sybil clickstream and learnt that Sybils do not waste time browsing photos or viewing profiles; they prefer visiting profiles. o Finally, we learnt that social links between Sybils are inadequate for identifying colluding behaviour. Sybils with no social connections still act in concert to spread spam.
Question?
Thank you!