Network-Level Spam and Scam Defenses

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration Global Results.
Advertisements

AP STUDY SESSION 2.
1
STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Nick Feamster Georgia Tech
Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
Dynamics of Online Scam Hosting Infrastructure
11/20/09 ONR MURI Project Kick-Off 1 Network-Level Monitoring for Tracking Botnets Nick Feamster School of Computer Science Georgia Institute of Technology.
Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
Understanding the Network- Level Behavior of Spammers Anirudh Ramachandran Nick Feamster Georgia Tech.
Network-Level Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte, Nadeem Syed, Alex Gray, Santosh Vempala, Jaeyeon.
Spam and Botnets: Characterization and Mitigation Nick Feamster Anirudh Ramachandran David Dagon Georgia Tech.
Research Summary Nick Feamster. The Big Picture Improving Internet availability by making networks easier to operate Three approaches –From the ground.
Spamming with BGP Spectrum Agility Anirudh Ramachandran Nick Feamster Georgia Tech.
Spamming with BGP Spectrum Agility Anirudh Ramachandran Nick Feamster Georgia Tech.
Understanding the Network- Level Behavior of Spammers Anirudh Ramachandran Nick Feamster Georgia Tech.
Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.
Network-Based Spam Filtering Nick Feamster Georgia Tech Joint work with Anirudh Ramachandran and Santosh Vempala.
Network Security Highlights Nick Feamster Georgia Tech.
1 Dynamics of Online Scam Hosting Infrastructure Maria Konte, Nick Feamster Georgia Tech Jaeyeon Jung Intel Research.
1 Network-Level Spam Detection Nick Feamster Georgia Tech.
Network Operations Research Nick Feamster
Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.
Network-Level Spam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Alex Gray, Santosh Vempala.
David Burdett May 11, 2004 Package Binding for WS CDL.
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
The 5S numbers game..
© Tally Solutions Pvt. Ltd. All Rights Reserved Shoper 9 License Management December 09.
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Media-Monitoring Final Report April - May 2010 News.
Welcome. © 2008 ADP, Inc. 2 Overview A Look at the Web Site Question and Answer Session Agenda.
Break Time Remaining 10:00.
The basics for simulations
EE, NCKU Tien-Hao Chang (Darby Chang)
PP Test Review Sections 6-1 to 6-6
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
FAFSA on the Web Preview Presentation December 2013.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Facebook Pages 101: Your Organization’s Foothold on the Social Web A Volunteer Leader Webinar Sponsored by CACO December 1, 2010 Andrew Gossen, Senior.
Before Between After.
: 3 00.
5 minutes.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Static Equilibrium; Elasticity and Fracture
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Clock will move after 1 minute
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 1 v3.1 Module 9 TCP/IP Protocol Suite and IP Addressing.
Select a time to count down from the clock above
9. Two Functions of Two Random Variables
A Data Warehouse Mining Tool Stephen Turner Chris Frala
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Spam Sagar Vemuri slides courtesy: Anirudh Ramachandran Nick Feamster.
Understanding the Network-Level Behavior of Spammers Anirudh Ramachandran Nick Feamster.
Network Security: Spam Nick Feamster Georgia Tech CS 6250 Joint work with Anirudh Ramachanrdan, Shuang Hao, Santosh Vempala, Alex Gray.
Understanding the Network-Level Behavior of Spammers Mike Delahunty Bryan Lutz Kimberly Peng Kevin Kazmierski John Thykattil By Anirudh Ramachandran and.
1 Authors: Anirudh Ramachandran, Nick Feamster, and Santosh Vempala Publication: ACM Conference on Computer and Communications Security 2007 Presenter:
Fighting Spam, Phishing and Online Scams at the Network Level Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Nadeem Syed, Alex Gray,
Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang Hao, Nadeem Ahmed Syed, Nick Feamster, Alexander G. Gray,
Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Jaeyeon Jung, Santosh Vempala.
Understanding the Network-Level Behavior of Spammers Best Student Paper, ACM Sigcomm 2006 Anirudh Ramachandran and Nick Feamster Ye Wang (sando)
Understanding the Network-Level Behavior of Spammers Author: Anirudh Ramachandran, Nick Feamster SIGCOMM ’ 06, September 11-16, 2006, Pisa, Italy Presenter:
Understanding the network level behavior of spammers Published by :Anirudh Ramachandran, Nick Feamster Published in :ACMSIGCOMM 2006 Presented by: Bharat.
1 Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Speaker: Jun-Yi Zheng 2010/01/18.
Presentation transcript:

Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala, Jaeyeon Jung

Spam: More than Just a Nuisance 95% of all email traffic Image and PDF Spam (PDF spam ~12%) As of August 2007, one in every 87 emails was a phishing attack Targeted attacks on rise 20k-30k unique phishing attacks per month Source: APWG

Approach: Filter Prevent unwanted traffic from reaching a user’s inbox by distinguishing spam from ham Question: What features best differentiate spam from legitimate mail? Content-based filtering: What is in the mail? IP address of sender: Who is the sender? Behavioral features: How the mail is sent?

Approach #1: Content Filters PDFs Excel sheets Images ...even mp3s!

Content Filtering: More Problems Customized emails are easy to generate: Content-based filters need fuzzy hashes over content, etc. Low cost to evasion: Spammers can easily alter features of an email’s content can be easily adjusted and changed High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated

Approach #2: IP Addresses Received: from mail-ew0-f217.google.com (mail-ew0-f217.google.com [209.85.219.217]) by mail.gtnoise.net (Postfix) with ESMTP id 2A6EBC94A1 for <feamster@gtnoise.net>; Fri, 23 Oct 2009 10:08:24 -0400 (EDT) Problem: IP addresses are ephemeral Every day, 10% of senders are from previously unseen IP addresses Possible causes Dynamic addressing New infections

Main Idea: Network-Based Filtering Filter email based on how it is sent, in addition to simply what is sent. Network-level properties: lightweight, less malleable Network/geographic location of sender and receiver Set of target recipients Hosting or upstream ISP (AS number) Membership in a botnet (spammer, hosting infrastructure)

Why Network-Level Features? Lightweight: Don’t require inspecting details of packet streams Can be done at high speeds Can be done in the middle of the network Less Malleable: Perhaps more difficult to change some network-level features than message contents

Challenges Understanding network-level behavior What network-level behaviors do spammers have? How well do existing techniques (e.g., DNS-based blacklists) work? Building classifiers using network-level features Key challenge: Which features to use? Two Algorithms: SNARE and SpamTracker Anirudh Ramachandran and Nick Feamster, “Understanding the Network-Level Behavior of Spammers”, ACM SIGCOMM, 2006 Anirudh Ramachandran, Nick Feamster, and Santosh Vempala, “Filtering Spam with Behavioral Blacklisting”, ACM CCS, 2007 Shuang Hao, Nick Feamster, Alex Gray and Sven Krasser, “SNARE: Spatio-temporal Network-level Automatic Reputation Engine”, USENIX Security, August 2009

17-Month Study: August 2004 to December 2005 Data: Spam and BGP Spam Traps: Domains that receive only spam BGP Monitors: Watch network-level reachability Domain 1 Domain 2 17-Month Study: August 2004 to December 2005

Data Collection: MailAvenger Configurable SMTP server Collects many useful statistics

Surprising: BGP “Spectrum Agility” Hijack IP address space using BGP Send spam Withdraw IP address ~ 10 minutes A small club of persistent players appears to be using this technique. Common short-lived prefixes and ASes 61.0.0.0/8 4678 66.0.0.0/8 21562 82.0.0.0/8 8717 Somewhere between 1-10% of all spam (some clearly intentional, others “flapping”)

Spectrum Agility: Big Prefixes? Flexibility: Client IPs can be scattered throughout dark space within a large /8 Same sender usually returns with different IP addresses Visibility: Route typically won’t be filtered (nice and short)

Other “Basic” Findings Top senders: Korea, China, Japan Still about 40% of spam coming from U.S. More than half of sender IP addresses appear less than twice ~90% of spam sent to traps from Windows

Top ISPs Hosting Spam Senders

How Well do IP Blacklists Work? Completeness: The fraction of spamming IP addresses that are listed in the blacklist Responsiveness: The time for the blacklist to list the IP address after the first occurrence of spam

Completeness and Responsiveness 10-35% of spam is unlisted at the time of receipt 8.5-20% of these IP addresses remain unlisted even after one month List the two experiments here Data: Spam trap data from March 2007, Spamhaus from March and April 2007

Why Do IP Blacklists Fall Short? Based on ephemeral identifier (IP address) More than 10% of all spam comes from IP addresses not seen within the past two months Dynamic renumbering of IP addresses Stealing of IP addresses and IP address space Compromised machines Often require a human to notice/validate the behavior Spamming is compartmentalized by domain and not analyzed across domains

Other Possible Approaches Option 1: Stronger sender identity [AIP, Pedigree] Stronger sender identity/authentication may make reputation systems more effective May require changes to hosts, routers, etc. Option 2: Behavior-based filtering [SNARE, SpamTracker] Can be done on today’s network Identifying features may be tricky, and some may require network-wide monitoring capabilities

Outline Understanding the network-level behavior What behaviors do spammers have? How well do existing techniques work? Classifiers using network-level features Key challenge: Which features to use? Two algorithms: SNARE and SpamTracker Network-level Scam Defenses

Finding the Right Features Goal: Sender reputation from a single packet? Low overhead Fast classification In-network Perhaps more evasion-resistant Key challenge What features satisfy these properties and can distinguish spammers from legitimate senders?

Set of Network-Level Features Single-Packet Geodesic distance Distance to k nearest senders Time of day AS of sender’s IP Status of email service ports Single-Message Number of recipients Length of message Aggregate (Multiple Message/Recipient)

Sender-Receiver Geodesic Distance 90% of legitimate messages travel 2,200 miles or less

Density of Senders in IP Space For spammers, k nearest senders are much closer in IP space XXX Fix This XXX

Local Time of Day at Sender Spammers “peak” at different local times of day

Combining Features: RuleFit Put features into the RuleFit classifier 10-fold cross validation on one day of query logs from a large spam filtering appliance provider Comparable performance to SpamHaus Incorporating into the system can further reduce FPs Using only network-level features Completely automated

Ranking of Features

SNARE: Putting it Together Email arrival Whitelisting Greylisting Retraining

Benefits of Whitelisting Whitelisting top 50 ASes: False positives reduced to 0.14%

Another Possible Feature: Coordination Idea: Blacklist sending behavior (“Behavioral Blacklisting”) Identify sending patterns commonly used by spammers Intuition: More difficult for a spammer to change the technique by which mail is sent than it is to change the content

SpamTracker: Clustering Construct a behavioral fingerprint for each sender Cluster senders with similar fingerprints Filter new senders that map to existing clusters

SpamTracker: Identify Invariant IP Address: 76.17.114.xxx Known Spammer IP Address: 24.99.146.xxx Unknown sender DHCP Reassignment spam spam spam spam spam spam Infection domain3.com domain1.com domain2.com domain3.com domain1.com domain2.com make fonts bigger Cluster on sending behavior Cluster on sending behavior Behavioral fingerprint Similar fingerprint!

Building the Classifier: Clustering Feature: Distribution of email sending volumes across recipient domains Clustering Approach Build initial seed list of bad IP addresses For each IP address, compute feature vector: volume per domain per time interval Collapse into a single IP x domain matrix: Compute clusters

Clustering: Output and Fingerprint For each cluster, compute fingerprint vector: New IPs will be compared to this “fingerprint” IP x IP Matrix: Intensity indicates pairwise similarity

Clustering Results Ham Spam Separation may not be sufficient alone, but could be a useful feature SpamTracker Score

Deployment: SpamSpotter Approach As mail arrives, lookups received at BL Queries provide proxy for sending behavior Train based on received data Return score

Challenges Scalability: How to collect and aggregate data, and form the signatures without imposing too much overhead? Dynamism: When to retrain the classifier, given that sender behavior changes? Reliability: How should the system be replicated to better defend against attack or failure? Evasion resistance: Can the system still detect spammers when they are actively trying to evade?

Performance overhead is small. Latency Performance overhead is small.

Sampling Relatively small samples can achieve low false positive rates

Improvements Accuracy Performance Security Synthesizing multiple classifiers Incorporating user feedback Learning algorithms with bounded false positives Performance Caching/Sharing Streaming Security Learning in adversarial environments

Summary Spam increasing, spammers becoming agile Content filters are falling behind IP-Based blacklists are evadable Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month Complementary approach: behavioral blacklisting based on network-level features Key idea: Blacklist based on how messages are sent SNARE: Automated sender reputation ~90% accuracy of existing with lightweight features SpamTracker: Spectral clustering catches significant amounts faster than existing blacklists SpamSpotter: Putting it together in an RBL system

Network-Level Scam Defenses

Network-Level Scam Defenses Scammers host Web sites on dynamic scam hosting infrastructure Use DNS to redirect users to different sites when the location of the sites move State of the art: URL Blacklisting Our approach: Blacklist based on network-level fingerprints Konte et al., “Dynamics of Online Scam Hosting Infrastructure”, PAM 2009

Online Scams Often advertised in spam messages URLs point to various point-of-sale sites These scams continue to be a menace As of August 2007, one in every 87 emails constituted a phishing attack Scams often hosted on bullet-proof domains Problem: Study the dynamics of online scams, as seen at a large spam sinkhole

Online Scam Hosting is Dynamic The sites pointed to by a URL that is received in an email message may point to different sites Maintains agility as sites are shut down, blacklisted, etc. One mechanism for hosting sites: fast flux

Mechanism for Dynamics: “Fast Flux” Source: HoneyNet Project

Summary of Findings What are the rates and extents of change? Different from legitimate load balance Different cross different scam campaigns How are dynamics implemented? Many scam campaigns change DNS mappings at all three locations in the DNS hierarchy A, NS, IP address of NS record Conclusion: Might be able to detect based on monitoring the dynamic behavior of URLs

Data Collection Method Three months of spamtrap data 384 scam hosting domains 21 unique scam campaigns Baseline comparison: Alexa “top 500” Web sites

Time Between Record Changes Fast-flux Domains tend to change much more frequently than legitimately hosted sites

Location: Many Distinct Subnets Scam sites appear in many more distinct networks than legitimate load-balanced sites.

Summary Scam campaigns rely on a dynamic hosting infrastructure Studying the dynamics of that infrastructure may help us develop better detection methods Dynamics Rates of change differ from legitimate sites, and differ across campaigns Dynamics implemented at all levels of DNS hierarchy Location Scam sites distributed across distinct subnets Data: http://www.gtnoise.net/scam/fast-flux.html TR: http://www.cc.gatech.edu/research/reports/GT-CS-08-07.pdf

Final Thoughts and Next Steps Duality between host security and network security. Can programmable networks (e.g., OpenFlow, NetFPGA, etc.) offer a better refactoring? Resonance: Inference-based Dynamic Access Control for Enterprise Networks, A. Nayak, A. Reimers, N. Feamster, R. Clark ACM SIGCOMM Workshop on Research on Enterprise Networks. Can better security primitives at the host help the network make better decisions about the security of network traffic? Securing Enterprise Networks with Traffic Tainting, A. Ramachandran, Y. Mundada, M. Tariq, N. Feamster. In submission.

References Anirudh Ramachandran and Nick Feamster, “Understanding the Network-Level Behavior of Spammers”, ACM SIGCOMM, 2006 Anirudh Ramachandran, Nick Feamster, and Santosh Vempala, “Filtering Spam with Behavioral Blacklisting”, ACM CCS, 2007 Shuang Hao, Nick Feamster, Alex Gray and Sven Krasser, “SNARE: Spatio-temporal Network-level Automatic Reputation Engine”, USENIX Security, August 2009 Anirudh Ramachandran, Shuang Hao, Hitesh Khandelwal, Nick Feamster, Santosh Vempala, “A Dynamic Reputation Service for Spotting Spammers”, GT-CS-08-09 Maria Konte, Nick Feamster, Jaeyeon Jung, “Dynamics of Online Scam Hosting Infrastructure”, Passive and Active Measurement Conference, April 2009.

Design Choice: Augment DNSBL Expressive queries SpamHaus: $ dig 55.102.90.62.zen.spamhaus.org Ans: 127.0.0.3 (=> listed in exploits block list)‏ SpamSpotter: $ dig \ receiver_ip.receiver_domain.sender_ip.rbl.gtnoise.net e.g., dig 120.1.2.3.gmail.com.-.1.1.207.130.rbl.gtnoise.net Ans: 127.1.3.97 (SpamSpotter score = -3.97)‏ Also a source of data Unsupervised algorithms work with unlabeled data

Evaluation Emulate the performance of a system that could observe sending patterns across many domains Build clusters/train on given time interval Evaluate classification Relative to labeled logs Relative to IP addresses that were eventually listed

Data 30 days of Postfix logs from email hosting service Time, remote IP, receiving domain, accept/reject Allows us to observe sending behavior over a large number of domains Problem: About 15% of accepted mail is also spam Creates problems with validating SpamTracker 30 days of SpamHaus database in the month following the Postfix logs Allows us to determine whether SpamTracker detects some sending IPs earlier than SpamHaus

Classifying IP Addresses Given “new” IP address, build a feature vector based on its sending pattern across domains Compute the similarity of this sending pattern to that of each known spam cluster Normalized dot product of the two feature vectors Spam score is maximum similarity to any cluster

Sampling: Training Time

Additional History: Message Size Variance Certain Spam Senders of legitimate mail have a much higher variance in sizes of messages they send Likely Spam Likely Ham Surprising: Including this feature (and others with more history) can actually decrease the accuracy of the classifier Certain Ham Message Size Range

Completeness of IP Blacklists ~95% of bots listed in one or more blacklists Fraction of all spam received ~80% listed on average Only about half of the IPs spamming from short-lived BGP are listed in any blacklist Number of DNSBLs listing this spammer Spam from IP-agile senders tend to be listed in fewer blacklists

Low Volume to Each Domain Most spammers send very little spam, regardless of how long they have been spamming. Amount of Spam Lifetime (seconds)

Some Patterns of Sending are Invariant IP Address: 76.17.114.xxx IP Address: 24.99.146.xxx DHCP Reassignment spam spam spam spam spam spam domain1.com domain2.com domain3.com domain1.com domain2.com domain3.com Spammer's sending pattern has not changed IP Blacklists cannot make this connection

Characteristics of Agile Senders IP addresses are widely distributed across the /8 space IP addresses typically appear only once at our sinkhole Depending on which /8, 60-80% of these IP addresses were not reachable by traceroute when we spot-checked Some IP addresses were in allocated, albeit unannounced space Some AS paths associated with the routes contained reserved AS numbers

Early Detection Results Compare SpamTracker scores on “accepted” mail to the SpamHaus database About 15% of accepted mail was later determined to be spam Can SpamTracker catch this? Of 620 emails that were accepted, but sent from IPs that were blacklisted within one month 65 emails had a score larger than 5 (85th percentile)

Evasion Problem: Malicious senders could add noise Solution: Use smaller number of trusted domains Problem: Malicious senders could change sending behavior to emulate “normal” senders Need a more robust set of features…