New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* Joint work with: Dawn Song*, Phillip Gibbons ¶, Avrim Blum* *Carnegie Mellon University, ¶ Intel Research Pittsburgh
2 Superspreaders k-superspreader: host that contacts at least k distinct destinations in short time pe riod. Goal: given stream of packets, find k- superspreaders Why care about superspreaders? Indicators of possible network attacks E.g., compromised host in worm propagation contacts many distinct destinations Slammer worm contacted upto 26,000 hosts per second! Automatic identification useful in logging and throttling attack traffic
3 Heavy Distinct-Hitters General problem: given a stream of (x,y) pairs, find all x paired with at least k distinct y: heavy distinct-hitter problem. Applications: Find dests contacted by many distinct srcs Find ports contacted by many distinct srcs/dests, or with high ICMP traffic Find potential spammers without per-src information Find nodes that contact many other nodes in peer-to- peer networks
4 Challenges Need very efficient algorithms for high-speed links Superspreaders often tiny fraction of network traffic: e.g., in traces, < 0.004% of total traffic Need algorithms in streaming model: Allow only one pass over data Much less storage than data Distributed monitoring desirable, must have little communication between monitors
5 Strawman Approaches Approach 1: track every src with list of distinct destinations contacted, e.g. Snort Too much storage! Approach 2: track every src with a distinct counter per src. [Estan et al 03] Also too much storage! Approach 3: Use multiple-cache data structure of Weaver et al 04. Designed for different problem, does not scale for finding superspreaders
6 Outline Introduction Problem Definition Algorithms Extensions Experiments Conclusions
7 Formal Problem Definition Given k, b > 1, and probability of failure , any k-superspreader output with probability at least 1 - any src that contacts < k/b distinct dests output with probability < srcs in between may or may not be output. Thus, expect to identify src as superspreader after it contacts more than k/b and fewer than k distinct dests
8 Example Example: k = 1000, b = 2, = Then, Pr[src output | contacts ≥ 1000 dests] > 0.95 Pr[src output | contacts < 500 dests] < 0.05 Expect gap between normal behaviour and superspreaders. No. of distinct destinations contacted d 3 = 500 d 2 = 750 d 1 = 1000 s1s1 s2s2 s3s3
9 Theoretical Guarantees Given k, b > 1, and , can set parameters so that, for N distinct flows: Pr[k-superspreader output] > 1 - Pr[false positive output] < Expected memory (fixed b): O(N/k log 1/ ) Note: as many as N/k k-superspreaders possible, so within O(log 1/ ) of lower bound Per-packet processing time: constant At most 2 hashes and 2 memory accesses per packet Most packets get 1 hash, or 1 hash and 1 memory access
10 Outline Introduction Problem Definition Algorithms One-Level Filtering Algorithm Two-Level Filtering Algorithm Extensions Experiments Conclusions
11 One-Level Filtering Algorithm (s, d) Step 2: If h(s, d) > c, discard packet Step 3: If h(s, d) < c, insert into hash table s1s1 s2s2 smsm d 1,1 d 1,2 d 1,z d 2,1 d 2,2 d 2,z’ d m,1 d m,2 d m,z” Step 1: Compute h(s, d) Step 4: Report all srcs with more than r destinations in hash table (We’re effectively sampling distinct flows at rate c.) packet
12 Example: One-Level Filtering Example: k = 1000, b = 2, = Compute that c = 0.052, r = 39 In expectation: 94.8% packets require one computation Remaining 5.2% require more processing & storage
13 Two-Level Filtering: Intuition (I) One-level filtering stores many small-dest srcs Need threshold sampling rate to distinguish between srcs contact k and k/b dests Expected distribution: most srcs contact few dests. But, all srcs sampled at threshold rate. Use two-level filtering to reduce memory usage on such traffic distributions Coarse rate: decide whether to sample at fine rate Fine rate: distinguish between srcs sending to k and k/b dests
14 Two-Level Filtering: Intuition (II) Example: k = 1000, b = 2 Suppose coarse rate is 1/100 Expect that a 1000-superspreader will show up once in first 100 dest; w.h.p. in, say, first 200 dest Use the remaining 800 dest to distinguish from a source that sends to only 500 dest w.h.p. Only store 1% of the sources that send to few dests Similar worst-case guarantees, but significantly better under some natural distributions
15 Two-Level Filtering Algorithm s 1,1 s 1,2 s 1,z s 2,1 s 2,2 s 2,z’ s m,1 s m,2 s m,z” F1F1 F2F2 FmFm s’ 1,1 s’ 1,2 s’ 1,w C (s, d) Compute h 1 (s, d) Sample: if h 1 (s, d) < r 1 and s is present in C Compute k = r 1 /m Insert s into hash-table F k Compute h 2 (s, d) Sample: if h 2 (s, d) < r 2 store s in C Return all the sources that appear in at least r of the hash-table F i packet Step 1Step 2
16 Example: Two-Level Filtering Example: k = 1000, b = 2, = Compute r 1 = 0.15, r 2 = 0.006, m = 100 Case 1: srcs that contact 1 distinct dest each 85% of flows discarded 0.6% entered into coarse filter 15% examined if present in coarse filter Case 2: srcs that are superspreaders 85% of flows discarded per superspreader 15% of flows require entry into fine filter
17 Outline Introduction Problem Definition Algorithms Extensions Experiments Conclusions
18 Extension: Deletions in Stream Goal: superspreaders when deletions allowed in stream Application: find srcs with many distinct connection failures Connection initiated: (src, dst) pair appears in stream Response received: that (src, dst) pair gets deleted (s 1,d 1,1), (s 1,d 2,1), (s 1,d 3,1), (s 2,d 2,1), (s 1,d 4,1), (s 2,d 2,-1)... (s 1,d 1,1), (s 1,d 2,1), (s 1,d 3,1), (s 2,d 2,1), (s 1,d 4,1), (s 2,d 2,-1), (s 1,d 2,- 1)...
19 Extension: Sliding Windows Goal: Find superspreaders over sliding windows of packets e.g. in only most recent t packets, or last 1 hour. … (s 1,d 1 ), (s 1,d 2 ), (s 1,d 3 ), (s 2,d 2 ), (s 2,d 4 )...… (s 1,d 1 ), (s 1,d 2 ), (s 1,d 3 ), (s 2,d 2 ), (s 2,d 4 ), (s 1,d 5 )... … (s 1,d 1 ), (s 1,d 2 ), (s 1,d 3 ), (s 2,d 2 ), (s 2,d 4 ), (s 1,d 5 ), (s 3,d 4 )...
20 Given: set of monitoring points, each point sees a stream of packets Goal: Find superspreaders in union of streams One-level filtering algorithm needs very little communication Extension: Distributed Monitoring (s 1,d 1 ), (s 1,d 2 ), (s 2,d 3 ), (s 1,d 1 )... (s 1,d 1 ), (s 1,d 3 ), (s 2,d 4 ), (s 2,d 5 )... (s 1,d 1 ), (s 2,d 2 ), (s 3,d 3 ), (s 4,d 4 )... A B C
21 Outline Introduction Problem Definition Algorithms Extensions Experiments Conclusions
22 Experimental Setup Experiments run on Pentium IV, 1.8 GHz with 1GB RAM Traces taken from NLANR archive, ranging from 2.8 million packets (65 sec) to 4.5 million packets (4.5 min) Added 100 srcs that contact k distinct dests and 100 srcs that contact k/b distinct dests Use randomly generated SHA1 hash function for each run For all experiments, = 0.05
23 Experimental Results (I) Accuracy Discussion: Both algorithms have desired accuracy False positive rate much less 0.05, since most (eligible) srcs send to many fewer than k/b dests Observed false positives only come from srcs close to the boundary
24 Experimental Results (II) 1LF = 1-Level Filtering 2LF-T = 2-Level Filtering hash-table implementation 2LF-B = 2-Level Filtering Bloom-filter implementation As expected, when b increases, sampling rates decrease, and total memory usage decreases 2LF-B has least memory usage k = 200, b = 2 k = 200, b = 5k = 200, b = 10
25 Experimental Results (III) 1LF = 1-Level Filtering 2LF-T = 2-Level Filtering hash-table implementation 2LF-B = 2-Level Filtering Bloom-filter implementation As expected, when k increases, sampling rates decrease, and total memory usage decreases 2LF-B has least memory usage k = 500, b = 2 k = 1000, b = 2k = 5000, b = 2
26 Related Work Networking: Related problems: finding heavy-hitters [Estan- Varghese 02], multidimensional traffic clusters [Estan+ 03], distribution of flow lengths [Duffield+ 03], large changes in network traffic [Cormode- Muthukrishnan 03] Streaming Algorithms: Most closely related: counting number of distinct values in a stream [Flajolet-Martin 85, Alon-Matias-Szegedy 99, Cohen 97, Gibbons-Tirthapura 02, Bar-Yossef+ 02, Cormode+ 02]
27 Summary Defined superspreader (and heavy distinct-hitter) problem One-pass streaming algorithms: Theoretical guarantees on accuracy and overhead Experimental analysis validates theoretical results Extensions to model with deletions, sliding windows and distributed monitoring Novel two-level filtering scheme may be of independent interest
28 Thank you!
29 Motivation (II) Superspreaders different from heavy-hitters! Care about many distinct destinations Few large file transfers => heavy-hitter, but not superspreader Superspreaders not necessarily heavy-hitters In test traces, superspreaders < 0.004% total traffic analyzed
30 Theoretical Guarantees Given k, b > 1, and , can set parameters for both algorithms so that: Pr[k-superspreader output] > 1 - Pr[false positive output] < Expected memory (fixed b): O(N/k log 1/ ) Per-packet processing time: constant At most 2 hashes and 2 memory accesses per packet Most packets get one hash, or 1 hash + 1 memory access Optimization: implement Two-Level Filtering with Bloom filters – decreases memory usage, increases computational cost.