1 BotGraph: Large Scale Spamming Botnet Detection Yao Zhao EECS Department Northwestern University.

Slides:



Advertisements
Similar presentations
Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
Advertisements

Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
Detecting Spam Zombies by Monitoring Outgoing Messages Zhenhai Duan Department of Computer Science Florida State University.
A Hierarchical Multiple Target Tracking Algorithm for Sensor Networks Songhwai Oh and Shankar Sastry EECS, Berkeley Nest Retreat, Jan
RB-Seeker: Auto-detection of Redirection Botnet Presenter: Yi-Ren Yeh Authors: Xin Hu, Matthew Knysz, Kang G. Shin NDSS 2009 The slides is modified from.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dynamic Hypercube Topology Stefan Schmid URAW 2005 Upper Rhine Algorithms Workshop University of Tübingen, Germany.
1 Drafting Behind Akamai (Travelocity-Based Detouring) AoJan Su, David R. Choffnes, Aleksandar Kuzmanovic, and Fabian E. Bustamante Department of Electrical.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
BotGraph: Large Scale Spamming Botnet Detection Yao Zhao Yinglian Xie *, Fang Yu *, Qifa Ke *, Yuan Yu *, Yan Chen and Eliot Gillum ‡ EECS Department,
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
1 Authors: Anirudh Ramachandran, Nick Feamster, and Santosh Vempala Publication: ACM Conference on Computer and Communications Security 2007 Presenter:
Spam Reduction Techniques Using greylisting and SpamAssassin.
SocialFilter: Introducing Social Trust to Collaborative Spam Mitigation Michael Sirivianos Telefonica Research Telefonica Research Joint work with Kyungbaek.
COVERTNESS CENTRALITY IN NETWORKS Michael Ovelgönne UMIACS University of Maryland 1 Chanhyun Kang, Anshul Sawant Computer Science Dept.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang Hao, Nadeem Ahmed Syed, Nick Feamster, Alexander G. Gray,
Revealing Botnet Membership Using DNSBL Counter-Intelligence David Dagon Anirudh Ramachandran, Nick Feamster, College of Computing,
Business Logic Abuse Detection in Cloud Computing Systems Grzegorz Kołaczek 1st International IBM Cloud Academy Conference Research Triangle Park, NC April.
S PAMMING B OTNETS : S IGNATURES AND C HARACTERISTICS Introduction of AutoRE Framework.
Network Kernel Architectures and Implementation ( ) Naming and Addressing Chaiporn Jaikaeo Department of Computer Engineering.
John P., Fang Yu, Yinglian Xie, Martin Abadi, Arvind Krishnamurthy University of California, Santa Cruz USENIX SECURITY SYMPOSIUM, August, 2010 John P.,
SpotRank : A Robust Voting System for Social News Websites
Speaker:Chiang Hong-Ren Botnet Detection by Monitoring Group Activities in DNS Traffic.
2 Outline Introduction –Motivation and Goals –Grayscale Chromosome Images –Multi-spectral Chromosome Images Contributions Results Conclusions.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
1 Characterizing Botnet from Spam Records Presenter: Yi-Ren Yeh ( 葉倚任 ) Authors: L. Zhuang, J. Dunagan, D. R. Simon, H. J. Wang, I. Osipkov, G. Hulten,
ACT: Attachment Chain Tracing Scheme for Virus Detection and Control Jintao Xiong Proceedings of the 2004 ACM workshop on Rapid malcode Presented.
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.
Hao Yang, Fan Ye, Yuan Yuan, Songwu Lu, William Arbaugh (UCLA, IBM, U. Maryland) MobiHoc 2005 Toward Resilient Security in Wireless Sensor Networks.
© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin In First Workshop on Hot Topics in Understanding Botnets,
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,
BotGraph: Large Scale Spamming Botnet Detection Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum Speaker: 林佳宜.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, and Ivan Osipkov. SIGCOMM, Presented.
Spamming Botnets: Signatures and Characteristics Authors:Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten+, Ivan Osipkov+ Presenter: Chia-Li.
April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Securing Passwords Against Dictionary Attacks Presented By Chad Frommeyer.
University “Ss. Cyril and Methodus” SKOPJE Cluster-based MDS Algorithm for Nodes Localization in Wireless Sensor Networks Ass. Biljana Stojkoska.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
SybilGuard: Defending Against Sybil Attacks via Social Networks.
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
Tracking Malicious Regions of the IP Address Space Dynamically.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Crowd Fraud Detection in Internet Advertising Tian Tian 1 Jun Zhu 1 Fen Xia 2 Xin Zhuang 2 Tong Zhang 2 Tsinghua University 1 Baidu Inc. 2 1.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
How dynamic are IP addresses? Yinglian Xie, Fang Yu, Kannan Achan, Eliot Gillum, Moises Goldszmidt, Ted Wobber SIGCOMM ‘07 Chulhyun Park
Experience Report: System Log Analysis for Anomaly Detection
De-anonymizing the Internet Using Unreliable IDs
De-anonymizing the Internet Using Unreliable IDs By Yinglian Xie, Fang Yu, and Martín Abadi Presented by Peng Cheng 03/22/2017.
Presentation transcript:

1 BotGraph: Large Scale Spamming Botnet Detection Yao Zhao EECS Department Northwestern University

2 Outline Motivation and Problem Definition BotGraph Algorithms –History based algorithm on Signup detection –Graph-based algorithm on login detection Parallel Implementation on DryadLINQ Detection Results Discussion Conclusion

3 Web-Account Abuse Attack zombie Server Captch a solver RDSXXTD3 User/Pwd

4 Problems and Challenges Web-account Abuse –Signup abuse –Spam sending Challenges –Accuracy requirement –Stealthy (in terms of spam sending) –Large scale attack and huge Hotmail log data Our Behavior-based Solutions –Correlate bot-users by their activities and identify the group properties –Design parallel algorithms on DryadLINQ to efficiently process large data

5 System Architecture Login data Login graph Graph generation Random graph based clustering Verification & prune Sendmail data Spamming botnets Suspicious clusters Signup data EWMA based change detection Aggressive signups Verification & prune Signup botnets Run on DryadLinq clusters Output locally

6 Outline Motivation and Problem Definition BotGraph Algorithms –History based algorithm on Signup detection –Graph-based algorithm on login detection Parallel Implementation on DryadLINQ Detection Results Discussion Conclusion

7 History Based Change Detection Large prediction error Back to normal

8 EWMA based Change Detection EWMA (Exponentially Weighted Moving Average) –Y t : observation at time t, S t : prediction at time t –S t = α×Y t-1 + (1 - α)×S t-1 Large Prediction Error Implies Change (or Abnormal) –E t = Y t – S t (Prediction error) –R t = Y t / Max(S t,ε) (Relative prediction error) Apply EWMA based Change Detection to Signup Time Series of Each IP Address

9 Outline Motivation and Problem Definition BotGraph Algorithms –History based algorithm on Signup detection –Graph-based algorithm on login detection Parallel Implementation on DryadLINQ Detection Results Discussion Conclusion

10 Normal and Bot-user Behaviors General Behaviors of Normal Users –Login Hotmail from home and/or office –The account shares IPs in one AS with others if dynamic IP is used General Behaviors of Bot-users –A pool of bots (e.g. thousands) and a pool of bot- users (e.g. hundreds of thousands) Each bot hosts multiple bot-users –Bot-user assigned to different random bots every day Fixed binding is not adopted now –A pair of bot-users have large chance to share several different IPs in different ASes

11 User-user Graph Graph Model –A hotmail account => a node –A pair of accounts share IPs => an edge Edge weight = Number of different ASes the shared IPs belong to Consider edges with weight>1 Key Observations –Bot-users form a giant connected component –Normal users do not form large connected component –Interpreted by the random graph theory

12 Random Graph Theory Random Graph G(n,p) –n nodes and a pair of nodes has an edge with probability p Theorem –A graph generated by G(n, p) has average weight d = n · p. –If d < 1, then with high probability the largest component in the graph has size less than O(log n). –If d > 1, with high probability the graph will contain a giant component with size at the order of O(n).

13 Typical Bot-user Graphs Strategy 1 –Bot-user accounts are randomly assigned to bots. Strategy 2 –Keeps a queue of the bot-users. –A bot comes online and gets the top k available (currently not used) bot-users in the queue. Strategy 3 –Similar to the second case, except that there is no limit on the number of bot-users a bot can request for one day.

14 Typical Bot-user Graphs –10000 bot-users, 10-day activity, k = 20

15 Bot-user Detection Algorithm Issues –Different bot-user groups may be connected (in the graph with weight threshold 2) Shared bots Shared bot-users –No fixed weight threshold T –Exceptions: exist large connected components formed by normal users Detection Algorithm –Hierarchical algorithm to extract connected components –Pruning

16 Hierarchical Connected Component Extraction G A B T=2 CD E T=3 T=4

17 Exceptions: Connected Subgraphs of Normal Users Potential Reasons –Some web service providers login Hotmail accounts for users (e.g. Facebook, Linkedin) –National proxies –Cell phones (e.g. iPhone) –Tor (Onion routing) Solutions –Filter out some IPs –Prune potential good connected components

18 Prune Good Groups Sending Frequency –Normal users: generally don’t send many s in average –Bot-users: to be efficient, send several spams every day Size –Normal users: random size –Bot-users: (currently) similar size

19 Prune Good Groups Bad: Good:

20 Prune Good Groups Metrics –s 1 : the percentage of users who send more than 3 s per day –s 2 : the percentage of users who send out s with similar size (peak detection) Pruning –Threshold of s 1 is 0.8 (conservative and wide margins around 0.8) –s 2 is used in validation

21 Outline Motivation and Problem Definition BotGraph Algorithms –History based algorithm on Signup detection –Graph-based algorithm on login detection Parallel Implementation on DryadLINQ Detection Results Discussion Conclusion

22 Parallel Implementation on DryadLINQ EWMA Algorithm of Signup Abuse Detection –Partition data by IP (straightforward) Graph Construction –Two algorithms Connected Component Extraction –Divide and conquer

23 Connected Component Extraction Partitions of Edges –(User1, User2, weight) (A, B) (D, G) (B, C) (C, E) (C, D) (E, F) (B, G) (G, D)

24 Connected Component Extraction (A, B) (D, G) (B, C) (C, E) (C, D) (E, F) (B, G) (G, D) (A, B) (D, G) (B, C) (B, E) (C, D) (E, F) (B, G) (B, D) Local Algo

25 Connected Component Extraction (A, B) (D, G) (B, C) (B, E) (C, D) (E, F) (B, G) (B, D) (A, B), (A, C), (A, E), (D, G) (B, D), (B, C), (B, G), (E, F) Merge and local algo

26 Connected Component Extraction (A, B), (A, C), (A, E), (D, G) (B, D), (B, C), (B, G), (E, F) (A, B), (A, C), (A, D), (A, E), (A, F), (A, G)

27 Connected Component Extraction Analysis –M partitions and log(M) steps –Partition size ≤ N (number of users) –Overall communication overhead O(N·log(M)) –Computational overhead

28 Outline Motivation and Problem Definition BotGraph Algorithms –History based algorithm on Signup detection –Graph-based algorithm on login detection Parallel Implementation on DryadLINQ Detection Results Discussion Conclusion

29 Detection of Signup Abuse

30 Detection by User-user Graph

31 Validations Manual Check –Verified by Hotmail group Comparison with Known Spamming Users –Complained Hotmail accounts Sending Patterns – size False Positive Estimation –Naming pattern –Signup time

32 Comparison with Complained Users K s : known spammer accounts signed up in the studied month H : set of bot-users detected by EWMA

33 Comparison with Complained Users K s : known spammer accounts that log in from at least 2 ASes L : set of bot-users detected by user-user graph

34 Validation of Sending Pattern

35 False Positive Estimation (1) Naming Pattern –Clear pattern in names of (current) bot-users E.g. w9168d4dc8c5c25f9 –Naming pattern score The largest fraction of users that follow a single naming template from a regular expression pool The regular expressions don’t quite match normal user names

36 False Positive Estimation (1) Naming Score

37 False Positive Estimation (1) Naming Score –A majority of the bot-user groups have close to 1 naming pattern scores –A few small bot-user groups with scores lower than 95% –In total, 0.44% of identified bot-users do not strictly follow the naming templates of their corresponding groups. –Take this 0.44% as false positive bound

38 False Positive Estimation (2) Signup dates of the detected bot-users –Conservatively take all the accounts signed up before 2007 as legitimate –0.08% bot-users were signed up before year 2007 –Among all the accounts in the 2008-dataset, about 59.1% of accounts were signed up before 2007 –False positive Assuming normal users' behaviors don’t change 0.08% / 59.1% = 0.13%

39 Outline Motivation and Problem Definition BotGraph Algorithms –History based algorithm on Signup detection –Graph-based algorithm on login detection Parallel Implementation on DryadLINQ Detection Results Discussion Conclusion

40 Evasion Signup detection –Be stealthy Login detection –Fixed binding Low utilization rate Bot-accounts bound to one host are easy to be grouped –Fixed AS assignment Redefine the edge weight to consider IP prefix Similar to fixed binding –Be stealthy (sending as few s as normal user)

41 Related Work Botnet Detection –Hard in general –HoneyNet Content-based Spam Detection –Bayesian filtering, AutoRE –Countermeasures: good words, image Behavior-based Spam Detection –SpamTracker

42 Conclusions BotGraph –History-based change detection on Signup –Graph-based component to detect stealthy bot-user logins Parallel Algorithms on DryadLINQ –Quick process of huge Hotmail log Detection –Detect more than 26M bot-accounts in two-month log –Low false positive

43 Q & A? Thanks!