Clustering Spam MIT Spam Conference 2008 Phil Tom
Simple Clustering Algorithm Expand clusters to include similar messages: 1.Identical originating IP addresses. 2.Identical subject lines. 3.Identical message bodies. for each cluster in clusters expand cluster for each message in unclustered messages create a new cluster add message to cluster expand cluster Clustering pseudocode
Dimensional Model
update sdbf_message set cluster_id = ? where (cluster_id <> ? or cluster_id is null) and sender_ip_id in (select sender_ip_id from sdbf_message where cluster_id = ?) Expand Cluster By IP
update sdbf_message m set cluster_id = ? from sdbd_body b where (m.cluster_id <> ? or m.cluster_id is null) and m.body_id in (select body_id from sdbf_message where cluster_id = ?) and m.body_id = b.body_id and b.size_in_bytes > 25 Expand Cluster By Body
update sdbf_message m set cluster_id = ? from sdbd_subject s where (m.cluster_id <> ? or m.cluster_id is null) and m.subject_id in (select subject_id from sdbf_message where cluster_id = ?) and m.subject_id = s.subject_id and (s.word_count > 1 or length(s.subject) > 10) Expand Cluster By Subject
Test Data Set Dec 22, Dec 29, 2007 Single “Received:” header tag only No multi-part messages 1.7 million messages Roughly 20%
Cluster Results
Messages per Cluster Size *Not including the big cluster
Top Clusters by IPs cluster_id | messages | subject | bodies | ips | networks | countries | | | | | 8940 | | | 451 | | 1313 | 57 | 2 59 | | 19 | 15 | 962 | 4 | 1 68 | 1065 | 2 | 1065 | 609 | 12 | 4 69 | 4476 | 59 | 85 | 514 | 17 | | 5521 | 5 | 9 | 283 | 4 | | 722 | 149 | 333 | 275 | 16 | | 307 | 2 | 306 | 208 | 179 | | 240 | 7 | 9 | 184 | 4 | | 5581 | 15 | 5212 | 153 | 119 | | 2934 | 20 | 2934 | 150 | 1 | | 377 | 22 | 377 | 125 | 3 | | 307 | 4 | 3 | 124 | 5 | | 3399 | 48 | 169 | 114 | 17 | | 156 | 4 | 155 | 105 | 96 | | 1117 | 174 | 1100 | 101 | 4 | 1
The Big One messages | subject | bodies | ips | networks | countries | | | | 8940 | 177 messages | subjects | bodies | ips | networks | country_name | | | | 1453 | United States | 5110 | | | 170 | Germany | 6558 | | | 147 | Spain | 4705 | | | 48 | Turkey | 4624 | | | 209 | United Kingdom | 3194 | | | 42 | Peru | 2848 | | | 148 | Columbia | 3059 | | | 152 | Chile | 5063 | | 9664 | 12 | Brazil | 4381 | | 9372 | 126 | Italy Cluster 1 summary Top 10 countries by IP count
Clustering the Big One Create clusters on subject and body messages | cluster_id | ips | subjects | bodies | | | 34 | 136 fake watches | | | 330 | penis enlargement | | | 27 | online casino | | | 55 | fake name brand goods | | 7190 | 81 | viagra | | | 20 | valium | | 5990 | | online pharmacy | | 3391 | 45 | 5 stock investment | | 4149 | 3 | 5 porn | | 3483 | 9 | software | | 9240 | 17 | 9273 russian dating messages unique IPs
Clustering the Big One (cont) Number of overlapping IPs between clusters
Am I Bot or Not? cluster_id | messages | subjects | bodies | ips | networks | countries | | 451 | | 1313 | 57 | 2 Subject content widely varied Many blocks of consecutive IPs Some blocks are entire or most of a /24 messages | subjects | bodies | ips | networks | country_name | 87 | 1246 | 5 | 3 | Canada | 443 | | 1308 | 54 | United States
Failure is Success Delivery Notification cluster: cluster_id | messages | subject | bodies | ips | networks | countries | 1065 | 2 | 1065 | 609 | 12 | 4 Subject Detail messages | subject | Delivery failure 452 | failure delivery Delivery notification from legitimate mail servers Not clustered with spam or sources of spam
Chinese Spam All Chinese messages messages | ips | networks | clusters | country_name | 5179 | 197 | 922 | China 139 | 2 | 1 | 2 | Thailand 78 | 12 | 3 | 4 | United States 5 | 4 | 1 | 2 | Germany Top 10 Chinese Clusters cluster_id | messages | subject | bodies | ips | networks | countries | | 19 | 15 | 962 | 4 | | 9987 | 1803 | 8 | 19 | 3 | 1 12 | 8054 | 9 | 8 | 26 | 1 | | 5521 | 5 | 9 | 283 | 4 | 1 69 | 4476 | 59 | 85 | 514 | 17 | | 3399 | 48 | 169 | 114 | 17 | | 2347 | 10 | 10 | 1 | 1 | | 2187 | 21 | 73 | 41 | 6 | 1 56 | 2047 | 29 | 45 | 61 | 14 | | 1944 | 3 | 4 | 5 | 1 | 1
Small Clusters Varied subjects and bodies. Manual clustering of “online pharmacy” spam Coalesced clusters: messages | ips | subjects | bodies | clusters | 9685 | | | 3651 Example subjects: Buy sugar pills online cheap!!!!11one Buy sugar pills online cheap!!!1cos(0) Buy sugar pills online cheap!111pi^0
What’s Next? Improve the similarity metrics Cluster a population or random sample Add time to the analysis