+ Collective Spammer Detection in Evolving Multi-Relational Social Networks Shobeir Fakhraei (University of Maryland) James Foulds (University of California,

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
Facebook for RSVP’ers You can do it!. What Questions Do You Have? What are you wanting to learn at this training?
Foundations of Comparative Analytics for Uncertainty in Graphs Lise Getoor, University of Maryland Alex Pang, UC Santa Cruz Lisa Singh, Georgetown University.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
The max-divergence of E’ is: Intuitively, p-divergence of d means that the probability of at least X E’,p edges occurring p-recently is 1/d A (maximal)
BotGraph: Large Scale Spamming Botnet Detection Yao Zhao Yinglian Xie *, Fang Yu *, Qifa Ke *, Yuan Yu *, Yan Chen and Eliot Gillum ‡ EECS Department,
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
Defect prediction using social network analysis on issue repositories Reporter: Dandan Wang Date: 04/18/2011.
SocialFilter: Introducing Social Trust to Collaborative Spam Mitigation Michael Sirivianos Telefonica Research Telefonica Research Joint work with Kyungbaek.
BEGIN. 1. When was the Facebook found? A. February 14,2004. February 14,2004 B. February 04, 2004 C. February 05, 2004 Show Score.
Signatures As Threats to Privacy Brian Neil Levine Assistant Professor Dept. of Computer Science UMass Amherst.
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
SOCIAL NETWORKS AND THEIR IMPACTS ON BRANDS Edwin Dionel Molina Vásquez.
Social Media Attacks By Laura Jung. How the Attacks Start Popularity of these sites with millions of users makes them perfect places for cyber attacks.
Network and Systems Security By, Vigya Sharma (2011MCS2564) FaisalAlam(2011MCS2608) DETECTING SPAMMERS ON SOCIAL NETWORKS.
MARCH 27, Meeting Agenda  Prototype 1 Design Goals  Prototype 1 Demo  Framework Overview  Prototype 2 Design Goals  Timeline Moving Forward.
SI485i : NLP Set 5 Using Naïve Bayes.
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Stenography.
HyPER: A Flexible and Extensible Probabilistic Framework for Hybrid Recommender Systems Pigi Kouki, Shobeir Fakhraei, James Foulds, Magdalini Eirinaki,
Simple Shutterfly Classroom Website Creation Presented by Kristen Frame First Grade Teacher, Propel Montour
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Collective Classification A brief overview and possible connections to -acts classification Vitor R. Carvalho Text Learning Group Meetings, Carnegie.
Joint Models of Disagreement and Stance in Online Debate Dhanya Sridhar, James Foulds, Bert Huang, Lise Getoor, Marilyn Walker University of California,
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Clever Framework Name That Doesn’t Violate Copyright Laws MARCH 27, 2015.
Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.
Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,
Mtivity Client Support System Quick start guide. Mtivity Client Support System We are very pleased to announce the launch of a new Client Support System.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro
Contribution and Proposed Solution Sequence-Based Features Collective Classification with Reports Results of Classification Using Reports Collective Spammer.
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
By Sam Whisenhunt 11/14/2014. Featured Personalization Login with your personal account to keep up with your w/l ratio and bragging rights!
By Samantha Kozar.  What are social networks?  What is Facebook?  What is Gowalla?  What are the capabilities of these sites?  Privacy Settings 
AN INTRODUCTION TO FACEBOOK. Learning Objectives A brief introduction to the social networking site Facebook. Instructions to create an account. How to.
Using the World’s #1 social networking site to promote your event HOW TO USE FACEBOOK AS AN EVENT MARKETING TOOL.
Defect Prediction Techniques He Qing What is Defect Prediction? Use historical data to predict defect. To plan for allocation of defect detection.
Learning Bayesian Networks for Complex Relational Data
Data mining in web applications
Uncovering Social Spammers: Social Honeypots + Machine Learning
Social Media Attacks.
Welcome! Back To School for Families.
Social Media & You Let’s take a look at your social media use.
Welcome! Back To School for Families.
Data Mining Jim King.
Discover How Your Business Can Benefit from a Facebook Fanpage
Discover How Your Business Can Benefit from a Facebook Fanpage
Dieudo Mulamba November 2017
Using Friendship Ties and Family Circles for Link Prediction
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
Facebook Immune System
Speaker: Jim-An Tsai Advisor: Professor Jia-ling Koh
Prepared by: Mahmoud Rafeek Al-Farra
ADVANCED ANOMALY DETECTION IN CANARY TESTING
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

+ Collective Spammer Detection in Evolving Multi-Relational Social Networks Shobeir Fakhraei (University of Maryland) James Foulds (University of California, Santa Cruz) Madhusudana Shashanka (if(we) Inc., Currently Niara Inc.) Lise Getoor (University of California, Santa Cruz)

Spam in Social Networks Recent study by Nexgate in 2013: Spam grew by more than 300% in half a year 2

Spam in Social Networks Recent study by Nexgate in 2013: Spam grew by more than 300% in half a year 1 in 200 social messages are spam 3

Spam in Social Networks Recent study by Nexgate in 2013: Spam grew by more than 300% in half a year 1 in 200 social messages are spam 5% of all social apps are spammy 4

Spam in Social Networks What’s different about social networks? Spammers have more ways to interact with users 5

Spam in Social Networks What’s different about social networks? Spammers have more ways to interact with users Messages, comments on photos, winks,… 6

Spam in Social Networks What’s different about social networks? Spammers have more ways to interact with users Messages, comments on photos, winks,… They can split spam across multiple messages 7

Spam in Social Networks What’s different about social networks? Spammers have more ways to interact with users Messages, comments on photos, winks,… They can split spam across multiple messages More available info about users on their profiles! 8

Spammers are getting smarter! 9 Want some replica luxury watches? Click here: Traditional Spam: GeorgeShobeir

Spammers are getting smarter! 10 Want some replica luxury watches? Click here: Traditional Spam:   [Report Spam] GeorgeShobeir

Spammers are getting smarter! 11 Want some replica luxury watches? Click here: Traditional Spam:(Intelligent) Social Spam: Hey Shobeir! Nice profile photo. I live in Bay Area too. Wanna chat? Hey Shobeir! Nice profile photo. I live in Bay Area too. Wanna chat?   [Report Spam] GeorgeShobeir Mary

Spammers are getting smarter! 12 Want some replica luxury watches? Click here: Traditional Spam:(Intelligent) Social Spam: Hey Shobeir! Nice profile photo. I live in Bay Area too. Wanna chat? Hey Shobeir! Nice profile photo. I live in Bay Area too. Wanna chat? Sure! :)   [Report Spam] GeorgeShobeir Mary

Spammers are getting smarter! 13 Want some replica luxury watches? Click here: Traditional Spam:(Intelligent) Social Spam: Hey Shobeir! Nice profile photo. I live in Bay Area too. Wanna chat? Hey Shobeir! Nice profile photo. I live in Bay Area too. Wanna chat? Sure! :)   [Report Spam] GeorgeShobeir Mary … I’m logging off here., too many people pinging me! I really like you, let’s chat more here: I’m logging off here., too many people pinging me! I really like you, let’s chat more here: Mary Realistic Looking Conversation

Tagged.com Founded in 2004, is a social networking site which connects people through social interactions and games Over 300 million registered members Data sample for experiments (on a laptop): 5.6 Million users (3.9% Labeled Spammers) 912 Million Links 14

Social Networks: Multi-relational and Time-Evolving 15

Social Networks: Multi-relational and Time-Evolving 16 Legitimate users

Social Networks: Multi-relational and Time-Evolving 17 Legitimate users Spammers

Social Networks: Multi-relational and Time-Evolving Link = Action at time t Actions = Profile view, message, poke, report abuse, etc 18 Legitimate users Spammers

Social Networks: Multi-relational and Time-Evolving Link = Action at time t Actions = Profile view, message, poke, report abuse, etc 19

Social Networks: Multi-relational and Time-Evolving Link = Action at time t Actions = Profile view, message, poke, report abuse, etc 20 Profile view

Social Networks: Multi-relational and Time-Evolving Link = Action at time t Actions = Profile view, message, poke, report abuse, etc 21 Profile view Message

Social Networks: Multi-relational and Time-Evolving Link = Action at time t Actions = Profile view, message, poke, report abuse, etc 22 Profile view Message Poke

Social Networks: Multi-relational and Time-Evolving Link = Action at time t Actions = Profile view, message, poke, report abuse, etc 23 Profile view Message Poke Report spammer

Our Approach 24 Predict spammers based on: Graph structure Action sequences Reporting behavior

Our Approach 25 Predict spammers based on: Graph structure Action sequences Reporting behavior

Graph Structure Feature Extraction 26 Graphs for each relation

Graph Structure Feature Extraction 27 Graphs for each relation Features

Graph Structure Features Extract features for each relation graph es for each of 10 rel PageRank Degree statistics Total degree In degree Out degree k-Core Graph coloring Connected components Triangle count 28 (8 features for each of 10 relations)

Graph Structure Features Extract features for each relation graph es for each of 10 rel PageRank Degree statistics Total degree In degree Out degree k-Core Graph coloring Connected components Triangle count 29 (8 features for each of 10 relations)

Graph Structure Features Extract features for each relation graph es for each of 10 rel PageRank Degree statistics Total degree In degree Out degree k-Core Graph coloring Connected components Triangle count 30 (8 features for each of 10 relations)

Graph Structure Features Extract features for each relation graph es for each of 10 rel PageRank Degree statistics Total degree In degree Out degree k-Core Graph coloring Connected components Triangle count 31 (8 features for each of 10 relations)

Graph Structure Features Extract features for each relation graph es for each of 10 rel PageRank Degree statistics Total degree In degree Out degree k-Core Graph coloring Connected components Triangle count 32 (8 features for each of 10 relations)

Graph Structure Features Extract features for each relation graph es for each of 10 rel PageRank Degree statistics Total degree In degree Out degree k-Core Graph coloring Connected components Triangle count 33 (8 features for each of 10 relations)

Graph Structure Features Extract features for each relation graph es for each of 10 rel PageRank Degree statistics Total degree In degree Out degree k-Core Graph coloring Connected components Triangle count 34 (8 features for each of 10 relations)

Graph Structure Features Extract features for each relation graph es for each of 10 rel PageRank Degree statistics Total degree In degree Out degree k-Core Graph coloring Connected components Triangle count 35 X (8 features for each of 10 relations)

Graph Structure Features Extract features for each relation graph es for each of 10 rel PageRank Degree statistics Total degree In degree Out degree k-Core Graph coloring Connected components Triangle count 36 Viewing profile Friend requests Message Luv Wink Pets game Buying Wishing MeetMe game Yes No Reporting abuse X (8 features for each of 10 relations)

Graph Structure Features 37 Graph Structure … PageRankTriangle CountIn-DegreeOut-Degreek-CoreGraph Coloring … PageRankTriangle CountIn-DegreeOut-Degreek-CoreGraph Coloring t(1) t(3) t(2) t(4) t(5) t(6) t(7) t(8) t(9) t(10) t(8) t(3) Classification method: Gradient Boosted Trees Viewing profileReporting abuse…

Graph Structure Features 38 ExperimentsAU-PRAU-ROC 1 Relation, 8 Feature types ± ± Relations, 1 Feature type ± ± Relations, 8 Feature types ± ± Multiple relations/features better performance!

Graph Structure Features 39 ExperimentsAU-PRAU-ROC 1 Relation, 8 Feature types ± ± Relations, 1 Feature type ± ± Relations, 8 Feature types ± ± Multiple relations/features better performance!

Graph Structure Features 40 ExperimentsAU-PRAU-ROC 1 Relation, 8 Feature types ± ± Relations, 1 Feature type ± ± Relations, 8 Feature types ± ± Multiple relations/features better performance!

Graph Structure Features 41 ExperimentsAU-PRAU-ROC 1 Relation, 8 Feature types ± ± Relations, 1 Feature type ± ± Relations, 8 Feature types ± ± Multiple relations/features better performance!

Our Approach 42 Predict spammers based on: Graph structure Action sequences Reporting behavior

Sequence of Actions Sequential Bigram Features: Short sequence segment of 2 consecutive actions, to capture sequential information 43 User1 Actions: Message, Profile_view, Message, Friend_Request, ….

Sequence of Actions Mixture of Markov Models (MMM): A.k.a. chain-augmented, tree-augmented naive Bayes 44

Sequence of Actions 45 Action Sequence … Bigram Features + Chain Augmented NB t(1) t(3) t(2) t(4) t(5) t(6) t(7) t(8) t(9) t(10) t(8) t(3)

Sequence of Actions ExperimentsAU-PRAU-ROC Bigram Features0.471 ± ± MMM0.246 ± ± Bigram + MMM ± ± Little benefit from MMM (although little overhead)

Results 47 Precision-Recall ROC We can classify 70% of the spammers that need manual labeling with about 90% accuracy

Deployment and Example Runtimes We can: Run the model on short intervals, with new snapshots of the network Update the features as events occur Example runtimes with Graphlab Create TM on a Macbook Pro: 5.6 million vertices and 350 million edges: PageRank: 6.25 minutes Triangle counting: minutes k-core: 14.3 minutes 48

Our Approach 49 Predict spammers based on: Graph structure Action sequences Reporting behavior

Refining the abuse reporting systems Abuse report systems are very noisy People have different standards Spammers report random people to increase noise Personal gain in social games Goal is to clean up the system using: Reporters’ previous history Collective reasoning over reports 50

Collective Classification with Reports 51 Report Subgraph Probabilistic Soft Logic t(1) t(3) t(2) t(4) t(5) t(6) t(7) t(8) t(9) t(10) t(8) t(3)

HL-MRFs & Probabilistic Soft Logic (PSL) 52 Probabilistic Soft Logic (PSL), a declarative modeling language based on first-order logic Weighted logical rules define a probabilistic graphical model: Instantiated rules reduce the probability of any state that does not satisfy the rule, as measured by its distance to satisfaction 52

Collective Classification with Reports Model using only reports: 53

Collective Classification with Reports Model using reports and credibility of the reporter: 54

Collective Classification with Reports Model using reports, credibility of the reporter, and collective reasoning: 55

Results of Classification Using Reports ExperimentsAU-PRAU-ROC Reports Only0.674 ± ± Reports & Credibility0.869 ± ± Reports & Credibility & Collective Reasoning ± ±

Results of Classification Using Reports ExperimentsAU-PRAU-ROC Reports Only0.674 ± ± Reports & Credibility0.869 ± ± Reports & Credibility & Collective Reasoning ± ±

Results of Classification Using Reports ExperimentsAU-PRAU-ROC Reports Only0.674 ± ± Reports & Credibility0.869 ± ± Reports & Credibility & Collective Reasoning ± ±

Results of Classification Using Reports ExperimentsAU-PRAU-ROC Reports Only0.674 ± ± Reports & Credibility0.869 ± ± Reports & Credibility & Collective Reasoning ± ±

Conclusion 60 t(1) t(3) t(2) t(4) t(5) t(6) t(7) t(8) t(9) t(10) t(8) t(3) Graph Structure … PageRankTriangle Count In-Degree Out-Degreek-Core Graph Coloring PageRankTriangle Count In-Degree Out-Degreek-Core Graph Coloring … Multiple relations are more predictive than multiple features AUPR: → Code and part of the data will be released soon:

Conclusion 61 t(1) t(3) t(2) t(4) t(5) t(6) t(7) t(8) t(9) t(10) t(8) t(3) Graph Structure … PageRankTriangle Count In-Degree Out-Degreek-Core Graph Coloring PageRankTriangle Count In-Degree Out-Degreek-Core Graph Coloring … Action Sequence … Bigram Features + Chain Augmented NB Multiple relations are more predictive than multiple features Even simple bigrams are highly predictive AUPR: → AUPR: Code and part of the data will be released soon:

Conclusion 62 t(1) t(3) t(2) t(4) t(5) t(6) t(7) t(8) t(9) t(10) t(8) t(3) Graph Structure … PageRankTriangle Count In-Degree Out-Degreek-Core Graph Coloring PageRankTriangle Count In-Degree Out-Degreek-Core Graph Coloring … Action Sequence … Bigram Features + Chain Augmented NB Multiple relations are more predictive than multiple features Even simple bigrams are highly predictive AUPR: → AUPR: Can classify 70% of the spammers that needed manual labeling with 90% accuracy AUPR: Code and part of the data will be released soon:

Conclusion 63 t(1) t(3) t(2) t(4) t(5) t(6) t(7) t(8) t(9) t(10) t(8) t(3) Graph Structure … PageRankTriangle Count In-Degree Out-Degreek-Core Graph Coloring PageRankTriangle Count In-Degree Out-Degreek-Core Graph Coloring … Action Sequence … Bigram Features + Chain Augmented NB Report Subgraph Probabilistic Soft Logic Multiple relations are more predictive than multiple features Even simple bigrams are highly predictive Jointly refining the credibility of the source is highly effective! AUPR: → AUPR: AUPR: → Can classify 70% of the spammers that needed manual labeling with 90% accuracy AUPR: Code and part of the data will be released soon:

Acknowledgements Collaborators: If(we) Inc. (Formerly Tagged Inc.): Johann Schleier-Smith, Karl Dawson, Dai Li, Stuart Robinson, Vinit Garg, and Simon Hill Dato (Formerly Graphlab): Danny Bickson, Brian Kent, Srikrishna Sridhar, Rajat Arya, Shawn Scully, and Alice Zheng 64 Shobeir Fakhraei Univ. of Maryland Lise Getoor Univ. California, Santa Cruz Madhusudana Shashanka if(we) Inc., currently Niara Inc.

Conclusion 65 t(1) t(3) t(2) t(4) t(5) t(6) t(7) t(8) t(9) t(10) t(8) t(3) Graph Structure … PageRankTriangle Count In-Degree Out-Degreek-Core Graph Coloring PageRankTriangle Count In-Degree Out-Degreek-Core Graph Coloring … Action Sequence … Bigram Features + Chain Augmented NB Report Subgraph Probabilistic Soft Logic Multiple relations are more predictive than multiple features Even simple bigrams are highly predictive Jointly refining the credibility of the source is highly effective! AUPR: → AUPR: AUPR: → Can classify 70% of the spammers that needed manual labeling with 90% accuracy AUPR: Code and part of the data will be released soon: Thank you!