Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting Network Structure for Proactive Spam Mitigation Shobha Venkataraman * Joint work with Subhabrata Sen §, Oliver Spatscheck §, Patrick Haffner.

Similar presentations


Presentation on theme: "Exploiting Network Structure for Proactive Spam Mitigation Shobha Venkataraman * Joint work with Subhabrata Sen §, Oliver Spatscheck §, Patrick Haffner."— Presentation transcript:

1 Exploiting Network Structure for Proactive Spam Mitigation Shobha Venkataraman * Joint work with Subhabrata Sen §, Oliver Spatscheck §, Patrick Haffner § & Dawn Song * * Carnegie Mellon University § AT&T Labs - Research

2 2 Daily Mail at Real Server All incoming mail Legitimate mail Over 90% of the mail received any day is spam!

3 3 Spam Mitigation Mail Servers Content-based spam-filtering Scalability bottleneck!

4 4 Mail Servers Content-based spam-filtering Spam Mitigation at Network-Level IP address info - Computationally-efficient - Difficult to spoof: handshake required Coarse-grained but effective technique first?

5 5 Spam Mitigation under Overload Mail Servers ? Goal: bias mail processed towards legitimate mail Overload: Server gets much more mail than it can process

6 6 Contributions Use history & structure of IP addresses as effective coarse- grained mechanism to differentiate spam from legit mail  Extensive analysis of IP-based properties Individual IP Analysis:  infer significant legitimate senders Analysis of IP Aggregates with network-aware clustering:  infer significant (often transient) spammers  Application to server overload Solution techniques derived from analysis results Trace-driven simulations show upto factor of 3 improvement in legit mail accepted

7 7 Outline Introduction IP Analysis Cluster Analysis Application under Server Overload Conclusion

8 8 Data Logs from Postfix mail server at enterprise location of large corporation  700+ user mailboxes  Includes all mail sent to mail server Legitimate mail: mail deemed legitimate by SpamAssassin Spam: all the rest Total 28+ million messages over 6 months  27 million spam, 1.4 million legitimate

9 9 IP Analysis Spam characteristics at granularity of sending mail server’s IP address Find historical communication patterns of IPs to distinguish bulk of legitimate mail & spam Use IP spam-ratio to characterize IP behaviour  Def: Fraction of mail sent by IP address that is spam e.g., only legit mail: spam-ratio = 0% (good) only spam: spam-ratio = 100% (bad)

10 10 IP Analysis Questions: Is IP spam-ratio a good discriminating feature?  How are IPs/spam/legit mail distributed by spam-ratio?  Effect on spam mitigation if spam-ratio is perfectly predicted Can long-term IP history differentiate legit mail from spam? Questions: Is IP spam-ratio a good discriminating feature? Can long-term IP history differentiate legit mail from spam?

11 11 IP Addresses CDF across IP addresses Bad IPs: ~ 90% Spam-Ratio: > 99% Nearly all IPs have spam-ratios of 0% or 100%: i.e, they send only legit mail, or only spam Good IPs: ~ 10% Spam-Ratio: 0%

12 12 Distribution of Spam Volume IPs with spam-ratio 90%-100% contribute over 99% of spam! Almost all spam comes from IPs with very high spam-ratio Define x: IP spam-ratio Fraction of Spam Sent by IPs with spam-ratio < x

13 13 Distribution of Legitimate Mail IPs with spam-ratio over 95% contribute tiny fraction (5%) Very little legit mail comes from IPs with very high spam-ratios Define x: IP spam-ratio Fraction of Legit Mail sent by IPs with spam-ratio < x

14 14 Effect on Spam Mitigation IP spam-ratio, if perfectly predicted every day, could identify most legitimate mail! e.g. accept mail from IPs with spam-ratio < 95%:  accept very little spam, and most legit mail Spam Legit mail

15 15 IP Analysis Questions: Is IP spam-ratio a good discriminating feature? YES, if perfectly predicted every day Can long-term IP history differentiate legit mail from spam?  Temporal Stability: Do most IP addresses fluctuate significantly in their spamming behaviour every day?  Persistence: How much legit mail/spam is contributed by long-lived IPs? Next… No

16 16 IP Persistence: Legit mail & spam Also: Less than 5% of total IPs are present for 20+ days Legit mail sent by good IPs (low spam-ratio) Spam sent by bad IPs (high spam-ratio) 20% comes from IPs present 20+ days 52% comes from IPs present 20+ days IPs present on many days contribute bulk of legit mail & little spam

17 17 IP Analysis Summary Bulk of legit mail comes from small no. of IPs that appear frequently, and are consistently good  History of legit senders to distinguish legit mail Most spam comes from transient IPs  Purely blacklisting approach has limitations (also consistent with findings in [RF06]) [RF06] Understanding the Network-level Behavior of Spammers, Ramachandran & Feamster, SIGCOMM ‘06

18 18 Outline Introduction IP Analysis Cluster Analysis Application under Server Overload Conclusion

19 19 Why IP Clusters? Since spamming IPs are transient, can coarser IP aggregations help?  Incorporate collective history of individual transient spammers  Exploit network structure for guilt-by-association Network-aware clustering [KW00]  Set of IP prefixes collected from BGP routing tables Each IP prefix represents a cluster of IP addresses: IP belongs to cluster with longest matching prefix in set  Topologically-close, often under common admin control [KW00] On Network-Aware Clustering of Web Clients, Krishnamurty & Wang, SIGCOMM ’00

20 20 Cluster Analysis Use cluster spam-ratio to capture cluster behaviour  Fraction of mail sent by cluster that is spam (sent by cluster = sent by IPs belonging to cluster) Questions:  Granularity: Does cluster spam-ratio approximate IP spam-ratio well, for distinguishing spam?  Persistence: how much of spam & legit mail do long-lived clusters contribute? Yes Next…

21 21 Cluster Persistence: Spam Over 95% of total spam comes from IPs in bad clusters (with high spam-ratios) 90% of total spam comes from bad clusters present for 60+ days Most of spam comes from bad clusters present for many days Bad IPs Bad Clusters

22 22 Cluster Analysis: Results Most spam originates from long-lived clusters with high spam-ratios. Much less legit mail originates from clusters with low spam- ratio, but clusters still long-lived. Network-aware clusters may provide a history for spamming IPs, even if individual IPs are transient.

23 23 Outline Introduction IP Analysis Cluster Analysis Application under Server Overload Conclusion

24 24 Server Overload Problem Mail Servers ? Problem: Server receives much more mail than it can process Goal: Maximize legitimate mail accepted for processing

25 25 Motivation to Overload Legit: 20/min Spam: 80/min Server Spam: 80 Legit: 20 No Overload Spammer has incentive to overload server with spam Legit: 20/min Spam: 180/min Spam: 90 Legit: 10 Overloaded by factor of 2 Server Spammer has capacity to overload server greatly with spam With large botnets, spammers can increase spam sent Not sufficient to increase server capacity Server Capacity: 100/min

26 26 Approach Use history and structure of IP addresses  IP/cluster history assigns reputation to incoming mail  Based on IP reputation & server load, decide which mail is accepted/refused for processing  Details left to paper Validate by simulation of server & policies on traces  Details of simulation in paper

27 27 Simulation Results Performance measure, computed for each hour:  Goodput of policy: % of available legit mail accepted by policy (i.e., accepted for processing by mail server) Overload- factor Default policy IP-history policy No overload93.796.7 261.779.6 339.568.6 426.864.5 520.363.0 Summary: Server Goodput, averaged over all hours Factor of 3 improvement! Detailed analysis of performance in paper

28 28 Conclusion Use history & structure of IP addresses to prioritize legitimate mail over spam efficiently. Measurement-based analysis of IP properties  Individual IP-address analysis helps identify legitimate senders  Analysis with network-aware clusters helps identify transient spammers Application to server overload problem  Trace-driven simulation demonstrates that analysis can help prioritize most of legitimate mail over spam

29 29 Thank you! Questions? (Contact: shobha@cs.cmu.edu)


Download ppt "Exploiting Network Structure for Proactive Spam Mitigation Shobha Venkataraman * Joint work with Subhabrata Sen §, Oliver Spatscheck §, Patrick Haffner."

Similar presentations


Ads by Google