Presentation is loading. Please wait.

Presentation is loading. Please wait.

Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,

Similar presentations


Presentation on theme: "Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,"— Presentation transcript:

1 Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York, NY October 1, 2015

2 Why do people abuse LinkedIn? Trusted site Money!

3 Types of abuse at LinkedIn Scraping Spam Scams Fraud

4 How is the abuse carried out? Fake accounts Automation Account Take Over

5 How do we stop the abuse? Supervised models – have training data (e.g. restricted accounts) Heuristic rules with thresholds determined from examining the data (e.g. scraping)

6 Online models Run in production in real time Registration model – stop fake accounts Login model – stop account take over Scraping models – stop automated stealing of data Invitation spam Message spam

7 Offline models Run in Hadoop with data delay Look at longer term patterns of behavior to stop the bad behavior not stopped online Look at clusters of bad behavior Latency doesn’t matter, so can use more data and more complicated algorithms and features

8 How to measure how well we’re doing? What we got right divided by what we got wrong Fake accounts caught by models Real accounts incorrectly caught by models + Fake accounts caught manually

9 We’re getting better at catching fake accounts How to get even better?

10 Asset reputation systems! Take advantage of our data – use the history of past abuse Predict the likelihood of abuse given past abuse seen on – IPs – ASNs – Countries – Email domains – Browser types – Profile picture – Etc.

11 Offline vs online reputation systems Offline More complicated features Takes into account all types of abuse across LinkedIn Uses long term abuse data Catches persistent patterns of abuse Online Simple features (counters and ratios) Specific to the model used (reg counters at reg model) Takes into account very recent data Catches new patterns of abuse

12 Leveraging reputation systems Use reputation scores or labels as features in offline and online models Online, use reputation scores and labels to create rules example rules: -Add friction to registrations from low reputation emaildomains -Force login for guest traffic from low reputation IP addresses

13 Example: Email domain reputation Every member at LinkedIn associated with at least one email address For each email domain, want a label of corporate, public, or abusive Use data from profiles, restrictions, email confirmation and bounces, connections, etc. to train models to label and score email domains Use the domain reputation at registration to decide how much friction to give

14 Example: Browser type reputation Online, suspicious if see a burst of activity from a browser type Offline, need to label browser types as good bots, bad bots, rare and common browser types

15 Deep dive: IP reputation

16 Why do we care about IPs? Helps us group activity An IP that has a history of abuse is more likely to abuse LinkedIn in the future

17 Every member action on LinkedIn involves an IP address We have data from more than 380 million members: – Viewing pages – Sending messages – Sending invitations – Getting restricted by our fake account models – Getting scored by our abuse models

18 Every guest action also involves an IP address Public page views by guests (users not logged in) Registrations and registration attempts Logins and login attempts Getting scored by our abuse models

19 Defining an IP reputation score Want to give something like a probability (between 0 and 1) of abuse from a given IP Take into account different types of abuse seen on LinkedIn Need to balance bad member and bad guest activity

20 Signals contributing to IP reputation score Member IP abuse score is: – Percent of restricted members Guest IP abuse score is max of – If IP has been caught scraping – Percent of increased friction (challenges) shown at registration – Percent of increased friction at login The maximum allows new features to be added easily

21 Balancing member and guest signals Weight member and guest abuse by the fraction of member and guest pageviews Exploring different weightings – possibly just max

22 Predicting future IP reputation Given the past IP reputation scores, predict likelihood of abuse on the IP in the future How far back in the past to look? – 3 months Calculate score for each week Take most recent non-zero abuse score as predicted score for the current week Recalculate daily

23 Model performance Trained on 3 months of data to predict for one day Left 20% out for validation R 2 of 0.84 on validation set

24 Choosing a threshold Choose a threshold defining an Abusive IP False positives (labeling a good IP as bad), is worse than false negatives (labeling a bad IP as good) In most use cases, use abuse label (in case distribution changes) Special cases can use the score itself

25 New IPs A new IP is more suspicious than a good IP, but less suspicious than an abusive IP Create rules to give friction to bursts of activity on new IPs

26 Number of members behind an IP An IP with a lot of (nonrestricted) members is generally better than one with few or no members Store the number of members seen on each IP along with the reputation score and label Leverage member count in online rules

27 Leveraging IP reputation Use IP reputation label in the online models along with other signals to decide when to give challenges or restrict Use IP reputation score and/or labels in the offline models as a feature

28 IP reputation system infrastructure User Request Daily Offline Workflow Online Models Decision & Scoring Accept or Reject Offline IP Reputation Scores IP Reputation Voldemort Store Tracking Data OfflineOnline

29 IP reputation at registration Give increased friction to sign ups coming from abusive IP addresses Increased phone challenge coverage of suspicious registrations by 20%

30 Accounts made from bad IPs more likely to be fake Accounts given phone challenge at registration with abusive IPs are half as likely to solve phone challenge than regular account

31 Examples of other online models using IP reputation Login – stricter challenges (email instead of captcha) for suspicious logins (suspected accounts take overs) from abusive IPs Phone challenge – Allow fewer attempts to solve phone challenge from bad IPs

32 False positives Some good IPs (e.g. schools) can have bursts of activity that make them look bad to a specific model, thus making their score bad Either whitelist or have separate models to classify good IPs (eg. CORP IPs)

33 Rolling up IP reputation scores Roll up IP reputation into ASN and country reputation scores Take average of IP reputation score for each IP in the larger entity – Don’t weight on total number of pageviews for each IP - biased towards catching scraping

34 Instantaneous IP reputations In online models for registration, login, etc, we also have instantaneous IP reputations – Calculated from very recent data – Uses counters or ratios like number of registration attempts from the IP in the last hour, or number of logins from the IP in the last day – Detects new abuse occurring since the last long- term IP reputations were calculated

35 Conclusions Reputation models help us detect and stop abuse at LinkedIn Offline reputation systems can use a lot of historical data from all different sources Online reputation systems are simple counters or ratios, but take into account very recent events

36 Questions? jbray@linkedin.com


Download ppt "Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,"

Similar presentations


Ads by Google