Tracking Malicious Regions of the IP Address Space Dynamically
Introduction Recent focus on network-level behaviour of malicious activity: Spam originating from different /24 prefixes [RF06] Spam originating from long-lived network-aware clusters [VSSHS07] Malicious activity originating from different /16 prefixes [CSFJWS07] Application: Real-time, effective mechanism for blocking traffic at line speed Challenges: Transience of individual IP addresses: due to e.g., bots, DHCP effects Effective clustering: variety of different clustering of IP prefixes possible Scaling to large volume of data [RF06] Ramachandran & Feamster, Understanding the Network-level Behavior of Spammers, SIGCOMM’06 [VSSHS07] Venkataraman et al. Exploiting Network Structure for Proactive Spam Mitigation. Security’ 07 [CSFJWS07] Collins et al, Using Uncleanliness to Predict Future Botnet Addresses, IMC’07
Study: Transience of Individual IPs Spamming IPs present on many days contribute little spam! Measurement Study: Analysis of spamming IP addresses & regions Data: Enterprise mail logs over 6 months, 28 million s
Network-Aware Clusters [KW00]: Pre-defined set of IP prefixes that reflect Internet routing structure IP address belongs to cluster with longest matching prefix Study: Persistence of IP Prefix Clusters 90% of total spam comes from “bad” clusters present for 60+ days Most spam comes from “bad” clusters present for many days Bad Clusters [KW00] On Network-Aware Clustering of Web Clients, Krishnamurty & Wang, SIGCOMM ’00 Bad IPs
Introduction Recent focus on network-level behaviour of malicious activity: Spam originating from different /24 prefixes [RF06] Spam originating from long-lived network-aware clusters [VSSHS07] Malicious activity originating from different /16 prefixes [CSFJWS07] Challenges: Transience of individual IP addresses: due to e.g., bots, DHCP effects Scaling to large volumes of data Effective Clustering: variety of different clustering of IP prefixes possible Question: Can we automatically compute the optimal clustering for tracking and predicting malicious activity? [RF06] Ramachandran & Feamster, Understanding the Network-level Behavior of Spammers, SIGCOMM’06 [VSSHS07] Venkataraman et al. Exploiting Network Structure for Proactive Spam Mitigation. Security’ 07 [CSFJWS07] Collins et al, Using Uncleanliness to Predict Future Botnet Addresses, IMC’07
Problem, v.1 (naïve) Input: IP addresses labeled normal(+)/malicious(-) e.g., logs with sending mail server’s IP, SpamAssassin labels spam(-) or nonspam(+) Output: Tree of IP prefixes, that optimal for classifying IPs as normal(+)/malicious(-) contains no more than k leaves Limit output IP tree to k leaves: Small output state Avoid overfitting / / / / /2 x.x.x.x/32 (IP addr) x.x.x.x/24 x.x.x.x/30
Challenges As stated, version 1 easy: use dynamic programming! However: IP tree over dynamic and evolving data: Data is collected and labelled over time (e.g., logs, traffic logs) Compromised/malicious IP prefixes may change over time Want to compute updated tree without going back to past data Low space & low overhead: up to 2 32 leaves in IP tree! Want algorithm to use space near-linear in k Adversarially-generated IP addresses: Want guarantees as function of data seen & adversary’s power Approach: Design online algorithms to address all issues
Online Learning Low space & overhead: IPs seen one at a time Algorithm only needs to maintain small internal state Naturally incorporates dynamic aspect of data Guarantees: function of mistakes made by optimal offline tree Quantifies cost of learning tree online Data may be generated adversarially Algorithm IP Predicted Label Correct Label Data e.g., mail logs labelled +/-
Problem, v.2: Online Learning of Static Tree Input: a stream of IPs labelled normal(+)/malicious(-) Output: tree of IP prefixes: Predicts nearly as well as OPTIMAL offline tree with k leaves Using low space in online model of learning / / / / /2 x.x.x.x/32 (IP addr) x.x.x.x/24 x.x.x.x/30
Dynamic IP Prefix Tree Malicious IP regions may change over time New Goal: Predict nearly as well as the optimal changing tree Optimal tree may make two kinds of changes: splitting & changing leaf sign Prediction Guarantee: Our mistakes = O(OPTIMAL tree’s cost) Optimal tree’s cost: function of mistakes made, and changes it makes + - #1: Leaf changes sign #2: Leaf splits
IP address x.x.x.x/ / / / x.x.x.x/24 x.x.x.x/ Problem: Compute IP tree online to predict nearly as well as best changing tree with k leaves, using space O(k) ~ Problem: Online Learning of Dynamic Tree
Related Problems Machine learning: Predicting as well as the best pruning of a decision tree [HS97] Our requirements: low space, tracking a dynamic tree online, real-world implementation Learning decision trees in streaming models [DH00] Our requirement: Adversarial data Simpler problem: tree fixed; no need to “learn” a tree Online algorithms: Paging Algorithms [HS97] Helmbold & Schapire, Predicting Nearly As Well As the Best Pruning of a Decision Tree. Machine Learning ‘97 [DH00] Domingos & Hulten, Mining High-Speed Data Streams, KDD 2000
Overview of our Algorithm Given IP address, predict: Trace IP path on tree & flag all nodes on path (i.e., all prefixes of input IP in tree) Label IP by combining weights of flagged nodes Given correct label, update: If predicted correctly, do nothing If predicted incorrectly: update flagged node weights grow tree by splitting leaf (if necessary) IP: a.b.c.d Predicted Label Correct Label
Four Algorithmic Questions Fixed Tree Structure: Relative importance of flagged nodes? Label of flagged node: positive/negative? Changing the Tree Structure: When to grow the tree? How to maintain a small tree?
w1w1 Relative Importance of Nodes Use sleeping experts algorithm: Every node is expert Each flagged node is an “awake” expert Best expert: leaf of optimal tree Algorithm: Each node has a “relative” weight Predict with all experts that are awake e.g., weighted coin toss Update by redistribution of weight between awake experts w0w0 w2w2 w 10 w 15 w 16 w 17
Label of Flagged Nodes Shifting experts problem: Each node has 2 internal experts: “positive” expert: label + “negative” expert: label – Best expert: label of node Algorithm: Predict by weighted coin toss between + & – experts Update by shifting weight from incorrect to correct expert Tracking a dynamic tree: Automatically incorporates leaf changing sign w+w+ w-w-
Growing the Tree Algorithm: Track # mistakes made by each leaf Split after leaf makes sufficient mistakes Tracking dynamic tree: Also incorporates changes caused by leaf splitting # mistakes?
Maintaining Small-Space Tree Convert to Paging Problem: Each node is a page Size of optimal cache: 2k Q: which node to discard? Use paging algorithms & competitive analysis: e.g. using Flush-When-Full Start from scratch after tree has grown to max size ? ? ? ? ?? ??
Analysis Define ε so that: additive change to internal node weight per mistake: ε multiplicative loss factor to relative node weight per mistake: 1/ε mistakes needed before splitting node: 1/ε Then, E[# mistakes of algorithm] = (1+ ε ) (mistakes of OPTIMAL) + (1/ ε )(sign-changes of OPTIMAL’s leaves) + (log k/ ε ) (splits of OPTIMAL’s leaves) Space required in using FWF: O(k log k/ ε 2 )
Implementation Issues Sparse data from some IP prefixes Might not see any IP addresses from some prefixes Clusters might be too big to be meaningful Include loss function to penalize prefixes for “nothingness” Efficient implementations for large-scale data: Coalesce nodes appropriately in binary tree (effectively becomes trie) Randomization calls are computationally-expensive Experimental performance: In progress