Download presentation
Presentation is loading. Please wait.
Published byAlicia Banks Modified over 9 years ago
1
Exploiting Network Structure for Proactive Spam Mitigation Shobha Venkataraman * Joint work with Subhabrata Sen §, Oliver Spatscheck §, Patrick Haffner § & Dawn Song * * Carnegie Mellon University § AT&T Labs - Research
2
2 Daily Mail at Real Server All incoming mail Legitimate mail Over 90% of the mail received any day is spam!
3
3 Spam Mitigation Mail Servers Content-based spam-filtering Scalability bottleneck!
4
4 Mail Servers Content-based spam-filtering Spam Mitigation at Network-Level IP address info - Computationally-efficient - Difficult to spoof: handshake required Coarse-grained but effective technique first?
5
5 Spam Mitigation under Overload Mail Servers ? Goal: bias mail processed towards legitimate mail Overload: Server gets much more mail than it can process
6
6 Contributions Use history & structure of IP addresses as effective coarse- grained mechanism to differentiate spam from legit mail Extensive analysis of IP-based properties Individual IP Analysis: infer significant legitimate senders Analysis of IP Aggregates with network-aware clustering: infer significant (often transient) spammers Application to server overload Solution techniques derived from analysis results Trace-driven simulations show upto factor of 3 improvement in legit mail accepted
7
7 Outline Introduction IP Analysis Cluster Analysis Application under Server Overload Conclusion
8
8 Data Logs from Postfix mail server at enterprise location of large corporation 700+ user mailboxes Includes all mail sent to mail server Legitimate mail: mail deemed legitimate by SpamAssassin Spam: all the rest Total 28+ million messages over 6 months 27 million spam, 1.4 million legitimate
9
9 IP Analysis Spam characteristics at granularity of sending mail server’s IP address Find historical communication patterns of IPs to distinguish bulk of legitimate mail & spam Use IP spam-ratio to characterize IP behaviour Def: Fraction of mail sent by IP address that is spam e.g., only legit mail: spam-ratio = 0% (good) only spam: spam-ratio = 100% (bad)
10
10 IP Analysis Questions: Is IP spam-ratio a good discriminating feature? How are IPs/spam/legit mail distributed by spam-ratio? Effect on spam mitigation if spam-ratio is perfectly predicted Can long-term IP history differentiate legit mail from spam? Questions: Is IP spam-ratio a good discriminating feature? Can long-term IP history differentiate legit mail from spam?
11
11 IP Addresses CDF across IP addresses Bad IPs: ~ 90% Spam-Ratio: > 99% Nearly all IPs have spam-ratios of 0% or 100%: i.e, they send only legit mail, or only spam Good IPs: ~ 10% Spam-Ratio: 0%
12
12 Distribution of Spam Volume IPs with spam-ratio 90%-100% contribute over 99% of spam! Almost all spam comes from IPs with very high spam-ratio Define x: IP spam-ratio Fraction of Spam Sent by IPs with spam-ratio < x
13
13 Distribution of Legitimate Mail IPs with spam-ratio over 95% contribute tiny fraction (5%) Very little legit mail comes from IPs with very high spam-ratios Define x: IP spam-ratio Fraction of Legit Mail sent by IPs with spam-ratio < x
14
14 Effect on Spam Mitigation IP spam-ratio, if perfectly predicted every day, could identify most legitimate mail! e.g. accept mail from IPs with spam-ratio < 95%: accept very little spam, and most legit mail Spam Legit mail
15
15 IP Analysis Questions: Is IP spam-ratio a good discriminating feature? YES, if perfectly predicted every day Can long-term IP history differentiate legit mail from spam? Temporal Stability: Do most IP addresses fluctuate significantly in their spamming behaviour every day? Persistence: How much legit mail/spam is contributed by long-lived IPs? Next… No
16
16 IP Persistence: Legit mail & spam Also: Less than 5% of total IPs are present for 20+ days Legit mail sent by good IPs (low spam-ratio) Spam sent by bad IPs (high spam-ratio) 20% comes from IPs present 20+ days 52% comes from IPs present 20+ days IPs present on many days contribute bulk of legit mail & little spam
17
17 IP Analysis Summary Bulk of legit mail comes from small no. of IPs that appear frequently, and are consistently good History of legit senders to distinguish legit mail Most spam comes from transient IPs Purely blacklisting approach has limitations (also consistent with findings in [RF06]) [RF06] Understanding the Network-level Behavior of Spammers, Ramachandran & Feamster, SIGCOMM ‘06
18
18 Outline Introduction IP Analysis Cluster Analysis Application under Server Overload Conclusion
19
19 Why IP Clusters? Since spamming IPs are transient, can coarser IP aggregations help? Incorporate collective history of individual transient spammers Exploit network structure for guilt-by-association Network-aware clustering [KW00] Set of IP prefixes collected from BGP routing tables Each IP prefix represents a cluster of IP addresses: IP belongs to cluster with longest matching prefix in set Topologically-close, often under common admin control [KW00] On Network-Aware Clustering of Web Clients, Krishnamurty & Wang, SIGCOMM ’00
20
20 Cluster Analysis Use cluster spam-ratio to capture cluster behaviour Fraction of mail sent by cluster that is spam (sent by cluster = sent by IPs belonging to cluster) Questions: Granularity: Does cluster spam-ratio approximate IP spam-ratio well, for distinguishing spam? Persistence: how much of spam & legit mail do long-lived clusters contribute? Yes Next…
21
21 Cluster Persistence: Spam Over 95% of total spam comes from IPs in bad clusters (with high spam-ratios) 90% of total spam comes from bad clusters present for 60+ days Most of spam comes from bad clusters present for many days Bad IPs Bad Clusters
22
22 Cluster Analysis: Results Most spam originates from long-lived clusters with high spam-ratios. Much less legit mail originates from clusters with low spam- ratio, but clusters still long-lived. Network-aware clusters may provide a history for spamming IPs, even if individual IPs are transient.
23
23 Outline Introduction IP Analysis Cluster Analysis Application under Server Overload Conclusion
24
24 Server Overload Problem Mail Servers ? Problem: Server receives much more mail than it can process Goal: Maximize legitimate mail accepted for processing
25
25 Motivation to Overload Legit: 20/min Spam: 80/min Server Spam: 80 Legit: 20 No Overload Spammer has incentive to overload server with spam Legit: 20/min Spam: 180/min Spam: 90 Legit: 10 Overloaded by factor of 2 Server Spammer has capacity to overload server greatly with spam With large botnets, spammers can increase spam sent Not sufficient to increase server capacity Server Capacity: 100/min
26
26 Approach Use history and structure of IP addresses IP/cluster history assigns reputation to incoming mail Based on IP reputation & server load, decide which mail is accepted/refused for processing Details left to paper Validate by simulation of server & policies on traces Details of simulation in paper
27
27 Simulation Results Performance measure, computed for each hour: Goodput of policy: % of available legit mail accepted by policy (i.e., accepted for processing by mail server) Overload- factor Default policy IP-history policy No overload93.796.7 261.779.6 339.568.6 426.864.5 520.363.0 Summary: Server Goodput, averaged over all hours Factor of 3 improvement! Detailed analysis of performance in paper
28
28 Conclusion Use history & structure of IP addresses to prioritize legitimate mail over spam efficiently. Measurement-based analysis of IP properties Individual IP-address analysis helps identify legitimate senders Analysis with network-aware clusters helps identify transient spammers Application to server overload problem Trace-driven simulation demonstrates that analysis can help prioritize most of legitimate mail over spam
29
29 Thank you! Questions? (Contact: shobha@cs.cmu.edu)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.