1 Characterizing Botnet from Spam Records Presenter: Yi-Ren Yeh ( 葉倚任 ) Authors: L. Zhuang, J. Dunagan, D. R. Simon, H. J. Wang, I. Osipkov, G. Hulten, and J. Tygar. USENIX LEET 2008
2 Outline Introduction Overview Methodology Metrics and Findings Conclusion
3 Introduction Spam is a driving force in the economics of botnets This work map botnet membership and other characteristics of botnets using spam traces By grouping similar messages and related spam campaigns, the authors identify a set of botnets A large trace of spam from Hotmail Web mail service is used
4 Pros and Cons of using Spam The analysis can be done on an existing trace from one of the small number of large Web mail providers Directly related to the economic motivation behind many botnets Potentially a less ad-hoc and easier task than analyzing IRC/DNS logs Unable to uncover botnets not involved in spamming
5 Contributions The first one to analyze entire botnets (in contrast to individual bot) behavior from spam messages The first to study botnet traces based on economic motivation and monetizing activities New findings about botnets involved in spamming
6 Overview The major steps in the proposed method Cluster messages into spam campaigns Spam messages with identical or similar content are sent from the same controlling entity Use fingerprints to cluster message Assess IP dynamics Extract the average time until an IP address gets reassigned The IP reassignment range (both are under each C-subset) Merge spam campaigns into botnets Via the overlapping of the sending hosts
7 Methodology Datasets and initial processing Identifying spam campaigns Skipping spam from non-bots Assessing IP dynamics Identifying botnets Estimating botnet Size
8 Datasets and Initial Processing Collected from the Hotmail Web mail service (Junk Mail) Randomly sample 5 million spam messages collected over a 9-day period from May 21, 2007 to May 29, 2007 Extract a reliable sender IP address heuristically for each message Parse the body parts to get both HTML and text from each message
9 Identifying Spam Campaigns Use ad hoc approaches to pre-clean the raw content and get only the rendered content Use the shingling algorithm to cluster near- duplicate content together Associate each spam campaign with the list {(IP i, t i )} of IP events consisting of the IP address IP i and sending time t i
10 Skipping Spam from Non-bots Exclude an if the sender IP address is on the white list Remove campaigns whose senders are all within a single C-subnet Removes campaigns with senders from less than three geographic locations (cities)
11 Assessing IP Dynamics Assume that IP address reassignment is a Poisson process Measure two IP address reassignment parameters in each C-subnet (via MSN) The average lifetime J t of an IP address on a particular host The maximum distance J r between IP addresses assigned to the same host
12 Assessing IP Dynamics Rule of Aggregation Among all IP address in the same C-subset Given (IP 1, t 1 ) and (IP 2, t 2 ) Either IP 1 or IP 2 is out of the distance range (J r ) of another, we regard these two events as from two different machines If both IP 1 and IP 2 are within the distance range (Jr) of each other Keep the same IP address after an interval of duration t2 - t1 An IP reassignment happens during an interval of duration t2 - t1
13 Assessing IP Dynamics
14 Identifying Botnets Given two spam campaigns SC 1 and SC 2, how do we know whether they share the same controller For all events in a spam campaign SC 1, we use to measure the fraction of events in SC 1 that are connected to some events in SC 2, where i and j represents IP events in SC 1 and SC 2. W, called as connectivity degree, ranges from 0 to 1 Select 0.2 as a reasonable threshold
15 Identifying Botnets
16 Estimating Botnet Size Assumption: Each bot sends approximately equal number of spam messages Some quantities in hand r: downsample rate of the dataset N: number of spam messages observed N 1 : number of bots observed with only one spam in the dataset The goal is to estimate s: the mean number of spam messages sent per bot b: number of bots (i.e. botnet size)
17 Estimating Botnet Size The estimated number of spam messages from a botnet is N/r = sb The expected number of bots observed with only one spam message is The average number of spam messages sent per bot (s) and botnet size (b):
18 Metrics and Findings Spam campaign duration Botnet sizes Per-day aspect: life span of botnets and spam campaigns Geographic distribution of botnets
19 Spam campaign duration Spam campaigns duration: the time between the first and the last seen from a campaign Over 50% of spam campaigns actually finish within 12 hours
20 Spam campaign duration Short-lived spam campaigns actually have larger volume More than 70% of spam messages are sent by spam campaigns lasting less than 8 hours
21 Botnet Sizes
22 Botnet Sizes
23 Botnet Sizes
24 Per-day aspect: life span of botnets and spam campaigns 60% of spam received from botnets each day are sent from long-lived botnets
25 Geographic Distribution of Botnets About half of botnets detected from the JMS dataset control machines in over 30 countries The total number of bots during the 9-day observation period of the JMS dataset is about 460,000 machines
26 Conclusion This work is a first step to study botnets from their economic motivations Get a picture of bot activity by directly tracing the actual operation of bots using one of their primary revenue sources (spam ) Make estimating about the size of a botnet, behavioral characteristics, and the geographical distribution of botnets