Download presentation
Presentation is loading. Please wait.
Published byAntony Chambers Modified over 9 years ago
1
2009/6/221 BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure- Independent Botnet Detection Reporter : Fong-Ruei, Li Machine Learning and Bioinformatics Lab Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee. In Proceedings of the 17th USENIX Security Symposium (Security'08), San Jose, CA, 2008.
2
Outline Introduction BotMiner : Detection Framework Problem statement Architecture overview Experiments Conclusion 2009/6/222Machine Learning and Bioinformatics Lab
3
Introduction Botnets are becoming one of the most serious threats to Internet security Such as SPAM, DDoS … Botnet is a network of compromised machines under the influence of malware code Bot BotMaster 2009/6/223Machine Learning and Bioinformatics Lab
4
Introduction Most of the current botnet detection approaches work on Specific botnet command and control(C&C) protocol e.g., IRC Structure e.g., centralized 2009/6/224Machine Learning and Bioinformatics Lab
5
Introduction Almost all of these approaches are designed for detecting botnets that use IRC or HTTP based C&C Rish is designed to detect IRC botnets using known bot nickname patterns as signature Another recent system is designed for detecting C&C activities with centralized servers BotSniffer 2009/6/225Machine Learning and Bioinformatics Lab
6
Introduction We need to develop a next generation botnet detection system which should be independent of the C&C protocol and Structure 2009/6/226Machine Learning and Bioinformatics Lab
7
Botnet is characterized by C&C communication channel Malicious activities Botnet structure Centralized P2P 2009/6/227 Problem Statement Machine Learning and Bioinformatics Lab
8
Assumptions We assume that bots within the same botnet will be characterized by similar malicious activities and similar C&C communications 2009/6/228Machine Learning and Bioinformatics Lab
9
Architecture overview 2009/6/229 Clustering similar malicious activities Clustering similar communication Cross-checking Machine Learning and Bioinformatics Lab
10
C-plane Monitor The C-plane monitor captures network flows and records information on who is talking to whom We limit our interest to TCP and UDP flows Each flow record contains the information: Time, Duration IP 、 Port (Source, Destination) Number of packets Bytes transferred 2009/6/2210Machine Learning and Bioinformatics Lab
11
A-plane Monitor The A-plane monitor logs information on who is doing what It analyzes : Outbound traffic through the monitored network Detecting several malicious activities that the internal hosts may perform 2009/6/2211Machine Learning and Bioinformatics Lab
12
C-plane Clustering Be responsible for : Reading the logs generated by the C- plane monitor Finding clusters of machines that share similar communication patterns 2009/6/2212Machine Learning and Bioinformatics Lab
13
C-plane Clustering-Flow Chart Flow Record Basic Filtering White Listing Aggregation (C-Flow) Feature Extraction Feature Reduction Coarse-grain Clustering Refined Clustering Clustering Report 2009/6/2213 Filter out irrelevant traffic flows Machine Learning and Bioinformatics Lab
14
C-plane Clustering- Basic Filtering Filter Rule 1 (F1): Ignore the flows that are not directly from internal host to external hosts Filter Rule 2 (F2): Ignore the flows that only contain one- way traffic 2009/6/2214Machine Learning and Bioinformatics Lab
15
Filter Rule 3 (F3): Ignore the flows whose destinations are well known as the legitimate servers Google Yahoo! 2009/6/2215 C-plane Clustering- White Listing Machine Learning and Bioinformatics Lab
16
Aggregate related flows into communication flows Given an period, all m TCP/UDP flows share the same protocol, source IP, destination IP and port aggregate them into the same C-flow 2009/6/2216 C-plane Clustering- Aggregation (C-Flow) Machine Learning and Bioinformatics Lab
17
C-plane Clustering- Vector representation Extract a number of statistical features from each C-flow C i Translate them into d-dimensional pattern vectors : 2009/6/2217Machine Learning and Bioinformatics Lab
18
Discrete sample distribution of four random variable : 1. the number of flows per hour (fph). fph is computed by counting the number of TCP/IP flows in c i that are present for each hour of the epoch E. 2. the number of packets per flow (ppf). ppf is computed by summing the total number of packets sent within each TCP/UDP flow in c i. 2009/6/2218 C-plane Clustering- Vector representation Machine Learning and Bioinformatics Lab
19
3. the average number of bytes per packets (bpp). For each TCP/UDP flow f j c i we divide the overall number of bytes transferred within f j by the number of packets sent within f j. 4. the average number of bytes per second (bps). bps is computed as the total number of bytes transferred within each f j c i divided by the duration of f j. 2009/6/2219 C-plane Clustering- Vector representation Machine Learning and Bioinformatics Lab
20
2009/6/2220 C-plane Clustering- Vector representation Machine Learning and Bioinformatics Lab 13 intervals as [0, k1], (k1, k2],..., (k12,1). Quantiles : q5%, q10%, q15%, q20%, q25%, q30%, q40%, q50%, q60%, q70%, q80%, q90%, The quantile ql% of a random variable X is the value q for which P(X < q) = l%. 13 intervals as [0, k1], (k1, k2],..., (k12,1). Quantiles : q5%, q10%, q15%, q20%, q25%, q30%, q40%, q50%, q60%, q70%, q80%, q90%, The quantile ql% of a random variable X is the value q for which P(X < q) = l%.
21
2009/6/2221 C-plane Clustering- Two-step clustering Machine Learning and Bioinformatics Lab
22
C-plane Clustering- Two-step clustering First Step : Data set : Using coarse-grained clustering on a reduced feature space : d=52 features into d’=8 features X-means clustering algorithm The result is a set 2009/6/2222Machine Learning and Bioinformatics Lab
23
Second Step : We use all the d=52 available features to represent the C-flows X-means clustering algorithm The result is a set 2009/6/2223 C-plane Clustering- Two-step clustering Machine Learning and Bioinformatics Lab
24
A-plane Clustering 2009/6/2224Machine Learning and Bioinformatics Lab
25
Cross-plane Correlation The idea is to cross-check clusters in the two plans to find out intersections that a host being part of a botnet In order to do this, we compute botnet score s(h) for each host h 2009/6/2225Machine Learning and Bioinformatics Lab
26
Cross-plane Correlation – botnet score 2009/6/2226Machine Learning and Bioinformatics Lab
27
Cross-plane Correlation - similarity 2009/6/2227Machine Learning and Bioinformatics Lab
28
Cross-plane Correlation - similarity We define the following similarity between bots h i and h j as where : 2009/6/2228Machine Learning and Bioinformatics Lab
29
Cross-plane Correlation - similarity 2009/6/2229Machine Learning and Bioinformatics Lab
30
Setup and Collection We set up traffic monitors to work on router at the campus network of the College of Computing at Georgia Tech. We ran the C-plane and A-plane monitors for a continuous 10-day period in late 2007. 2009/6/2230Machine Learning and Bioinformatics Lab
31
Setup and Collection 2009/6/2231Machine Learning and Bioinformatics Lab Generated by executing modified bot code Generated based on Web-based C&C communication a real-world trace containing two P2P botnets
32
Evaluation Results 2009/6/2232Machine Learning and Bioinformatics Lab FiltrationAggregation
33
Evaluation Results 2009/6/22Machine Learning and Bioinformatics Lab33 Two-step clustering
34
Evaluation Results 2009/6/2234Machine Learning and Bioinformatics Lab
35
Conclusion We proposed a novel network anomaly-base botnet detection system that is independent of the protocol and structure used by botnet 2009/6/2235Machine Learning and Bioinformatics Lab
36
Thank you for listening 2009/6/2236 The end Machine Learning and Bioinformatics Lab
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.