Information Fusion Ganesh Godavari
DDoS Data Set DARPA DDoS data set (2000) is available –MIT Lincoln Laboratory –Data Set spans approximately 3 hours The five phases of the attack scenario depicted [1]: –IPsweep of the Air Force Base from a remote site –Probe of live IP's to look for the sadmind daemon running on Solaris hosts –Breakins via the sadmind vulnerability, both successful and unsuccessful on those hosts –Installation of the trojan mstream DDoS software on three hosts at the AFB –Launching the DDoS
Related Work Charu C. Aggarwal Philip S. Yu (2001) “Outlier detection for high dimensional data”, International Conference on Management of Data, ACM SIGMOD Pg: 37 – 46 John McHugh (2000) “Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory”, ACM TISSEC, 3(4) Pg: Risto Vaarandi. (2003) A Data Clustering Algorithm for Mining Patterns From Event Logs. Work shop on IEEE IP Operations and Management
Attack Scenario [1]
Phase 1 Attack (DDoS DataSet) IdDate Time Duration SrcIPTarget IP AnalyzerService 103/07/ :51:36 00:00: tcpdump_inside icmp-E-R 203/07/ :51:36 00:00: tcpdump_inside icmp-E-Rp 3 03/07/ :51:36 00:00: tcpdump_inside icmp-E-R 4 03/07/ :51:36 00:00: tcpdump_inside icmp-E-Rp 5 03/07/ :51:38 00:00: tcpdump_inside icmp-E-R 603/07/ :51:38 00:00: tcpdump_inside icmp-E-Rp 703/07/ :51:41 00:00: tcpdump_insideicmp-E-R 803/07/ :51:50 00:00: tcpdump_insideicmp-E-R 903/07/ :51:50 00:00: tcpdump_inside icmp-E-Rp 10 03/07/ :51:51 00:00: tcpdump_inside icmp-E-R 11 03/07/ :51:51 00:00: tcpdump_inside icmp-E-Rp 12 03/07/ :51:51 00:00: tcpdump_insideicmp-E-R 13 03/07/ :51:51 00:00: tcpdump_inside icmp-E-Rp 14 03/07/ :51:52 00:00: tcpdump_inside icmp-E-R :::::: 3203/07/ :52:00 00:00: tcpdump_inside icmp-E-R 3303/07/ :52:00 00:00: tcpdump_inside icmp-E-R icmp-E-R => icmp-echo-request icmp-E-Rp => icmp-echo-reply
Algorithm Step 1: go over the data file and build vocabulary –Read all the unique fields in the data files Step 2: identify the frequent vocabulary in the data file –How to determine frequency? How can one determine the threshold for frequency ? Step 3: Generate cluster candidates –Lines containing the same frequent words form cluster Step 4: Identify temporal relationships between cluster candidates –The 24 relationships of data Step 5: Generate unique lines –Lines in the data file in based on the candidate cluster
Need Suggestions Is it safe to assume that a threshold parameter is provided? Cluster candidate generation can involve too much data generation (next slide shows how)
Cluster Candidate Generation Data Set has 8 dimensions frequent words(4byte col. # word) with threshold > 10 are – repeated 22 –000103/07/2000 repeated 33 –000300:00:00 repeated 31 –0007icmp-echo-request repeated 22 –0007icmp-echo-reply repeated 11 –0006tcpdump_inside repeated 33 – repeated 11
Candidate Generation Example Example 03/07/ :51:36 00:00: tcpdump_inside icmp-E-R 03/07/ :51:36 00:00: tcpdump_inside icmp-E-Rp 03/07/ :51:36 00:00: tcpdump_inside icmp-E-R 03/07/ :51:36 00:00: tcpdump_inside icmp-E-Rp In all data first field is common so should they be considered as a candidate cluster? Cluster 1 = { line 1, line 2, line 3, line 4} Cluster 2 = { line 1, line 3, line 4} Cluster 3 = { line 1, line 3} Cluster 4 = { line 2, line 4} Cluster 5 = { line 1, line 2, line 3, line 4} Cluster 5 = { line 1, line 3} Cluster 6 = { line 2, line 4} Reduction but loss of information? –Cluster 1 = { line 1, line 3} –Cluster 2 = { line 2} –Cluster 3 = { line 4}
Work to be done Complete the algorithm and coding part
References [1] MIT Lincoln laboratories _data_index.html