Anomaly Detection Studies in the IP Backbone Tao Ye Sprint Burlingame, CA
1 24: “Stop that packet at the router!” Detected an anomaly Specify and activate a new ACL At OC-192 In 26μs? Anomaly Detection at the IP Backbone
2 Outline Tier-1 backbone: an overview TAPS: connectionless port scan detection and tracking on the backbone Scaling up: sampling and anomaly detection
3 Today’s Tier-1 Backbones Topology – high speed routers in points-of-presence (POPs) connected by long-haul fiber >numerous small POPs (e.g., UUNet) >relatively few large POP (e.g., Sprint) Technologies >IP over SONET (POS) >IP over ATM (phasing out) >MPLS, VPN tunnel Common Engineering Practice >failure protection implemented at IP layer >“over-provisioned” core
4 What we (Research Sprint ) do Measurement: Collect a lot of data from the Internet backbone, understand the current state Monitoring: Use of measurement to detect events of (operational) interest Hardware >CMON Monitoring boxes in the POPs >Storage (30T) and analysis platform at the lab >Website for sharing results Algorithms and Software tools >Continuous monitoring >Anomaly detection >Active measurement Other: >Wireless Paging attacks Fairness implementations TCP over wireless
5 Outline Measurement and Monitoring at a tier-1 backbone: an overview from the industry perspective TAPS: Connectionless port scan detection and tracking on the backbone Scaling up: sampling and anomaly detection
6 Motivation and Challenges Our goals >Detect and track >Understand long term behavior of scanners >On the backbone network Why Backbone ? >Detection: Existing work most at stub networks, limited visibility >Tracking: Honeypots can be evaded >More scanning activities visible at core >Peering point unique vantage point Challenges >Backbone traffic unidirectional, asymmetric >High speed (OC-48, OC-192) links, needs fast algorithm >Diverse traffic mix, needs efficient data structure
7 Intuition: Access Patterns
8 TAPS: Time-based Access Pattern Sequential hypothesis testing Based on 5-tuple flow summary on unidirectional link Scanner suspects: source IPs accesses IP/port (or port/IP) ratio > k in time-bin Sequential Hypothesis Testing
9 TAPS Threshold for tagging source as scanner Increment when IP/port > K Decrement when IP/port < K Threshold for tagging source as benign
10 Performance: TCP
11 Online Implementation Architecture Use CMON to produce flows in NetFlow5 Flow Daemon distributes flows Keep flows in circular buffer CMON Flow Collector Flow Daemon Core App Handler TAPSOther Disk Writer Disk Reader Circular Buffer Disk Flow Daemon
12 Detector and Tracker Architecture
13 Design choices: Approximation Counters Issues: >Need to keep the fan-out count for each IP >Heap implementation has prohibitively high memory requirements Probabilistic Counters: >Many recently proposed counters: Small SRAM Implementation: Multi-resolution bitmap, trigger bitmap >Simple Flajolet-Martin counter FM counter performance >8 hash functions accurate enough for <>k test >256, 32 and 8 hash functions
14 Results Data set >OC48 Peering link incoming, ~320Mbps, 22 days >OC48 Peering link outgoing, ~560Mbps, 3 days
15 Scanner Duration 22 days 3 days
16 Scanner Rate
17 Number of Scanner Detected (1) Time series of Number of scanners detected (3days)
18 Scanning Ports Port accessed
19 Conclusion Online Scan Detection and Tracking >Targets unidirectional backbone link >Detector: Time-based Access Pattern Sequential Hypothesis (TAPS) Combines rate limiting with statistical tests on destination IP and port access patterns >Implementation design: Queue model and FM counter Scanner Behavior >90-10 split of scanning rate, scanning duration behavior >Spike in number of scanners detected
20 Outline Tier-1 backbone: an overview TAPS: connectionless port scan detection on the backbone Scaling up: sampling and anomaly detection
21 Motivation Sampling to reduce processing overhead in traffic monitoring Sampled data used in: >Traffic Engineering -- computing traffic matrices >Inferring flow statistics from sampled data (Duffield03, Hohn03) Anomaly Detection (DDoS attacks, worm scans): Does sampled data contain sufficient information for effective anomaly detection? The brief answer … it depends >On sampling method >On sampling rate The impact of sampling >Number of anomalies detected: decreased >False positives: increased
22 Methodology Anomaly Detection Module Traffic traces Anomaly Detection Module Sampling Module Results compare
23 Anomalies and Detection Algorithms Type of AnomalyDetection Algorithms Volume Anomaly : DoS attacks, flash crowds 1. Wavelet-based change detection [Barford02] Port Scanning: Worm/virus propergation 2. Threshold Random Walk [Jung04] 3. Access Pattern: TAPS [Sridharan06] Anomaly Detection Module Traffic traces Anomaly Detection Module Sampling Module Results compare
24 Sampling Methods Random packet sampling: each packet sampled with probability r < 1 >Simple implementation (good for busy routers) >Widely deployed (Cisco NetFlow) >Flow statistics hard to recover Random flow sampling: classify flows, each flows sampled with probability p < 1 >High resource requirement >Accurate estimation of flow statistics Anomaly Detection Module Traffic traces Anomaly Detection Module Sampling Module Results compare
25 Sampling (continue) Designer flow sampling: for catching heavy-hitters >Smart Sampling [Duffield02] – flow records selected with a probability >Sample-and-Hold [Estan02]: Each byte of a packet sampled with a small probability h. All the following packets in the flow will be sampled once the a packet in the flow gets sampled.
26 Comparing Sampling Algorithms How to compare: normalizing CPU load, or memory consumption Our choice – the percentage of flows sampled >Input to the anomaly detection based on flows, >Number of flows translates to memory consumption. Example of sampling parameter settings:
27 Impact of Sampling on Volume Anomaly Detection (1) Wavelet-base change detection on flow rate Decomposition Re-synthesize into three bands High ~ 1sec Mid ~ 1min Low ~ 15min Detection on high/mid Sliding window Deviation score
28 Impact of Sampling on Volume Anomaly Detection (2) Original detection: 21 False negatives >Random flow sampling introduces more local variance >Random packet sampling introduces even more variance >Smart sampling and sample-and-hold flatten the time series
29 Impact of Sampling on Port Scan Detection Performance Metrics Definition >Success Ratio R s = Num True Scanners Detected / Num True Scanners >False Positive Ratio R f+ = Num False Scanners Detected / Num True Scanners R s => effectiveness, R f+ = errors Ground truth: True scanner set examined by hand.
30 TRWSYN results
31 TAPS results Flow count reduction – false negatives Flow shortening – false positives shoot up in random packet sampling. >A multi-packet TCP flow shrunk to a single SYN-packet flow >The result: scanners and benign hosts are statistically indistinguishable.
32 Conclusion Implications of Our Results: >Random flow sampling is generally robust to both volume anomaly and port scan detections. >Random packet sampling is oblivious to any underlying traffic features, and causes information loss and distortion which degrade the performance of anomaly detection algorithms. Smart sampling and sample-and-hold target heavy- hitters, thus not quite suitable for anomaly detections. Ongoing work: >Design anomaly detection algorithms robust to sampling, >Design new anomaly-detection-friendly sampling methods.
33 The End! Tier-1 backbone: an overview TAPS: Connectionless port scan detection on the backbone and scanner profiling Sampling data is not NOT sufficient for anomaly detection purposes
34 A Backbone POP Peer Core Router Other POPs Edge Router