1 Network-Level Spam Detection Nick Feamster Georgia Tech.

1 Network-Level Spam Detection Nick Feamster Georgia Tech

2 Spam: More than Just a Nuisance 95% of all email traffic –Image and PDF Spam (PDF spam ~12%) As of August 2007, one in every 87 emails constituted a phishing attack Targeted attacks on the rise –20k-30k unique phishing attacks per month Source: CNET (January 2008), APWG

3 Detection Detect unwanted traffic from reaching a users inbox by distinguishing spam from ham Question: What features best differentiate spam from legitimate mail? –Content-based filtering: What is in the mail? –IP address of sender: Who is the sender? –Behavioral features: How the mail is sent?

4 Content-Based Detection: Problems Low cost to evasion: Spammers can easily alter features of an emails content can be easily adjusted and changed Customized emails are easy to generate: Content- based filters need fuzzy hashes over content, etc. High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated

5 Another Approach: IP Addresses Problem: IP addresses are ephemeral Every day, 10% of senders are from previously unseen IP addresses Possible causes –Dynamic addressing –New infections

6 Filter email based on how it is sent, in addition to simply what is sent. Network-level properties are less malleable –Hosting or upstream ISP (AS number) –Membership in a botnet (spammer, hosting infrastructure) –Network location of sender and receiver –Set of target recipients Idea: Network-Based Detection

7 Behavioral Blacklisting Idea: Blacklist sending behavior (Behavioral Blacklisting) –Identify sending patterns commonly used by spammers Intuition: Much more difficult for a spammer to change the technique by which mail is sent than it is to change the content

8 Improving Classification Lower overhead Faster detection Better robustness (i.e., to evasion, dynamism) Use additional features and combine for more robust classification –Temporal: interarrival times, diurnal patterns –Spatial: sending patterns of groups of senders

9 SNARE: Automated Sender Reputation Goal: Sender reputation from a single packet? (or at least as little information as possible) –Lower overhead –Faster classification –Less malleable Key challenge –What features satisfy these properties and can distinguish spammers from legitimate senders

10 Sender-Receiver Geodesic Distance 90% of legitimate messages travel 2,200 miles or less

11 Density of Senders in IP Space For spammers, k nearest senders are much closer in IP space

12 Other Network-Level Features Time-of-day at sender Upstream AS of sender Message size (and variance) Number of recipients (and variance)

13 Combining Features Put features into the RuleFit classifier 10-fold cross validation on one day of query logs from a large spam filtering appliance provider Using only network-level features Completely automated

14 Cluster-Based Features Construct a behavioral fingerprint for each sender Cluster senders with similar fingerprints Filter new senders that map to existing clusters

15 domain1.com domain2.com domain3.com spam IP Address: 76.17.114.xxx Known Spammer DHCP Reassignment Behavioral fingerprint domain1.com domain2.com domain3.com spam IP Address: 24.99.146.xxx Unknown sender Cluster on sending behavior Similar fingerprint! Cluster on sending behavior Infection Identifying Invariants

16 Building the Classifier: Clustering Feature: Distribution of email sending volumes across recipient domains Clustering Approach –Build initial seed list of bad IP addresses –For each IP address, compute feature vector: volume per domain per time interval –Collapse into a single IP x domain matrix: –Compute clusters

17 Clustering: Fingerprint For each cluster, compute fingerprint vector: New IPs will be compared to this fingerprint IP x IP Matrix: Intensity indicates pairwise similarity

18 Evaluation Emulate the performance of a system that could observe sending patterns across many domains –Build clusters/train on given time interval Evaluate classification –Relative to labeled logs –Relative to IP addresses that were eventually listed

19 Early Detection Results Compare SpamTracker scores on accepted mail to the SpamHaus database –About 15% of accepted mail was later determined to be spam –Can SpamTracker catch this? Of 620 emails that were accepted, but sent from IPs that were blacklisted within one month –65 emails had a score larger than 5 (85 th percentile)

20 Small Samples Work Well Relatively small samples can achieve low false positive rates

21 Extensions to Phishing Goal: Detect phishing attacks based on behavioral properties of hosting site (vs. static properties of URL) Features –URL regular expressions –Registration time of domain –Uptime of hosting site –DNS TTL and redirections Next time: Discussion of phishing detection/integration

22 Integration with SMITE Sensors –Extract network features from traffic –IP addresses –Combine with auxiliary data (routing, time, etc.) Algorithms –Clustering algorithm to identify behavioral fingerprints –Learning algorithm to classify based on multiple features Correlation –Clusters formed by aggregating sending behavior observed across multiple sensors –Various features also require input from data collected across collections of IP addresses

23 Summary Spam increasing, spammers becoming agile –Content filters are falling behind –IP-Based blacklists are evadable Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month Complementary approach: behavioral blacklisting based on network-level features –Blacklist based on how messages are sent –SNARE: Automated sender reputation ~90% accuracy of existing with lightweight features –Cluster-based features to improve accuracy/reduce need for labelled data

25 Evasion Problem: Malicious senders could add noise –Solution: Use smaller number of trusted domains Problem: Malicious senders could change sending behavior to emulate normal senders –Need a more robust set of features…

26 Improvements Accuracy –Synthesizing multiple classifiers –Incorporating user feedback –Learning algorithms with bounded false positives Performance –Caching/Sharing –Streaming Security –Learning in adversarial environments

27 Sampling: Training Time

28 Dynamism: Accuracy over Time

1 Network-Level Spam Detection Nick Feamster Georgia Tech.

Similar presentations

Presentation on theme: "1 Network-Level Spam Detection Nick Feamster Georgia Tech."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Network-Level Spam Detection Nick Feamster Georgia Tech.

Similar presentations

Presentation on theme: "1 Network-Level Spam Detection Nick Feamster Georgia Tech."— Presentation transcript:

Similar presentations

About project

Feedback