Download presentation
Presentation is loading. Please wait.
Published byOctavia Morris Modified over 9 years ago
1
Measurement Algorithms: Bloom Filters and Beyond George Varghese University of California, San Diego
2
1.Basic: stateless, transparent. Tools: protocol design (e.g., soft-state) 2. Active: customizable, re-configurable Tools: Code Safety (e.g., sandboxing) 3. Introspective: pattern detection/response Tools: Streaming algorithms, statistical inference (e.g. Bloom Filters, sampling) Network Evolution? Hawkeye enables introspection for measurement & security
3
What is Introspection? Detecting patterns in data traffic, either in real- time or based on packet logs. Examples: Measurement Introspection: Identify resource usage patterns for better resource management Security Introspection: Identify attack patterns to mitigate or prevent attacks. Fault Introspection: Identify fault or anomaly patterns to allow automated fault repair. Motivated by market pull and technology push
4
Market Pull Better ROI: Optimize network resources (BGP policy, OSPF weights, light up fibers, add bandwidth) based on resource usage patterns. Better security: Allowing organization to be open for business during mass or targeted attacks is major differentiator. Better Fault Detection: Many performance anomalies can be detected by better measurement primitives (e.g., Goldman-Sachs) Customer Site 1 Customer Site 3 Customer Site 2 reroute or add B/W
5
Technology Push: Streaming Algorithms and Hardware Gates Algorithms: Recent major thrust in streaming algorithms in database, web analysis, theory, networks Hardware: Memory accesses remain expensive (< 100) and SRAM not scaling as fast as number of connections (< 32 Mbits), but gates are plentiful. Mapping: Many randomized streaming algorithms (e.g., Bloom Filters, Min-wise hashing) developed to find patterns in disk logs map well to network ASICs. Opportunity: Invent or adapt streaming algorithms for networking patterns.
6
Concerns about Network Introspection Speed: Can hardware run fast enough? Recall IP lookups in 1990’s, surprisingly complex things (branch predictors, TCP Offload) being done routinely today. Most of the algorithms described below are being implemented at 24 Gbps in Hawkeye Inflexible: Hardware not easy to change. Design hardware to identify useful “primitive” patterns that can be combined. (Exactly what Hawkeye does) Network Processors can offer flexibility & speed. End-to-end argument: Not simple, stateless core. Not required for correctness of basic forwarding, but only as an optimization or value-add.
7
Introspection as Pattern Detection Within Packet Patterns: Prefix matches, classification, signature detection (e.g., Code Red Payload) Across Packet Patterns: Scheduling, Timing, Membership Checks Heavy-hitters, large flows, partial completion, counting flows S1 S2 S5S2S1 ROUTER
8
Pattern Detection Algorithm Requirements Low memory: On-chip SRAM limited to around 10-32 Mbits. Not constant but is not scaling with number of concurrent conversations. May need to replicate. Small processing: For wire-speed at 40 Gbps, using 40 byte packets, have 8 nsec. Using 1 nsec SRAM, 8 memory accesses. Factor of 30 in parallelism buys 240 accesses.
9
Talk Outline Part 1: Motivation Part 2: Basic Patterns and Algorithms (membership checks, heavy-hitters, many flows, partial completion) Part 3: Combining patterns to solve useful application problems Part 4: Conclusions.
10
Pattern 1: Membership Check Membership Check: In a measurement interval, (e.g., 10 minutes) detect the flows (e.g., sources) on that belong to a pre-specified set (e.g., black list) S1 S6S2S5S2S8 Set contains only S2, S5 B. Bloom, Comm. ACM, July 1970
11
Field Extraction Equal to 1 ? Equal to 1 Equal to 1 ? BitMap Hash 1 Hash 2 Hash 3 Stage 1 Stage 2 Stage 3 ALERT ! If all bits are set Membership Check via Bloom Filter Set
12
Trivial Bloom Filter Analysis Assume set of size 1000. Bound probability that a flow F not in set gets through 4 stages of size 10000 each. Why trouble?: F can pass a stage if it hashes to a bit set by some real member of the set. Single stage probability: At most 1000/10,000 buckets can have set bits. Thus probability F passing a stage is less than 1000/10,000 = 0.1 Multistage probability: To be branded, F must be unlucky in all 6 stages with a probability of no more than 0.1 6 which is very small. Can play with numbers
13
Accurate Bloom Filter Analysis Assume set of size 1000. Bound probability that a flow F not in set gets through 4 stages of size 10000 each. Previous analysis ignores bit collisions Single stage probability: Probability of F passing a stage is s = (1 – (1-1/10,000)^1000) = 1 – e^{-0.1} Multistage probability: To be branded, F must be unlucky in all 6 stages with a probability of no more than s 6 which is very small.
14
Applications Replacement for a hash table: useful when storage is important, identifiers are long, false positives are acceptable, & membership check suffices Example 1: String Matching: exact strings of up to 4000 strings of 40 bytes each using only on-chip SRAM. Example 3: Reporting
15
Example 1: String Matching A0 A1 An String Database to Block A2 ST0 ST1 ST2 STn Anchor Strings Multi Stage Filter Hash Function Sushil Singh, G. Varghese, J. Huber, Sumeet Singh, Patent Application
16
String Matching Continued: String Grouping A0 A1 An A2 ST0 ST1 ST2 STn Hash Function Hash Bucket-0 Hash Bucket-1 Hash Bucket-m
17
String Matching Continued: Bit Trees A2 A8 A11 ST2 ST8 ST11 A17 ST17 1 0 0 1 LOC L1 A8 A11 ST8 ST11 A2 ST2 A17 ST17 L1 L2 L3 ST8 ST11 ST2 ST17 0 1 1 1 0 0 1 0 1 0 LOC L2 LOC L3 Strings in a single hash bucket
18
Example 3: Scalable Reporting (Carousel, NSDI 2010) Problem: When a worm breaks out, how do we report all infected machines. Logging packets w. pattern can result in millions of sources and many duplicates Solution: Use a sampled Bloom filter and a more bit. Start with no sampling. Any source IP in a worm packet is reported and placed in Bloom filter of size B to suppress duplicates. Stop when B are reported and set “more” bit. Recursive Solution: If “more” bit repeat algorithm twice for LSB of Hashed (SourceIP) = 0 and 1. If still more repeat it four times. Nearly optimal solution.
19
Timed Bloom Filters Question: How can we add notion of time to Bloom Filters without lots of memory? Solution: Use 2 Bloom Filters, Old and New. Insert: Insert into New Search: Search in both New and Old Age every T seconds: Old := New; New:= Empty Property: Any entry not refreshed for 2T is deleted. An entry refreshed within T is in. U.S Patent Application: Paul Owen & Andy Fingerhut et al
20
Pattern 2a: Heavy-hitters with Threshold Heavy-hitters: In a measurement interval, (e.g., 10 minutes) detect the flows (e.g., sources) on a link that send more than a threshold T (say 1% of the traffic) on a link. S1 S6S2S5S2 Source S2 is 30 percent of traffic sequence Estan,Varghese, ACM TOCS 2003
21
Field Extraction Equal to T? Counters Hash 1 Hash 2 Hash 3 Stage 1 Stage 2 Stage 3 ALERT ! If all counters above threshold T Heavy Hitters with Multistage Filters Increment
22
Multistage filters in Action Grey = other flows Yellow = small flow Green = large flow Stage 1 Stage 3 Stage 2 Counters Threshold...
23
Multistage Filter Analysis Assume 1 percent threshold. Bound probability that a flow F of 0.1 % or less gets through 6 stages of size 1000 each. Why trouble?: F can fall into a ``hot'' bucket if and only the sum of traffic of all other flows in that bucket is more than 0.9 % Single stage probability: At most 100/0.9 = 111 buckets that can be over 0.9 % before we bring on F. Thus probability F falls in a ``hot'' bucket is less than 111/1000 = 0.111 Multistage probability: To be branded, F must be unlucky in all 6 stages with a probability of no more than 0.111 6 which is very small. Thus at most 1000 false positives with very high probability.
24
Pattern 2b: Top K heavy-hitters Heavy-hitters: In a measurement interval, (e.g., 10 minutes) detect the flows (e.g., sources) on a link that are the top K talkers on a link. S1 S2S5S2 Source S2 and S1 are top K talkers Bonomi, Prabhakar, Zhang, Wu, Cisco Internal
25
Two simpler proposals SIFT (Prabhakar) and Sample-and-Hold (Estan-Varghese) both suggest sampling a packet with small probability p. Once sampled, place in CAM and watch all packets Idea: large flows are more likely to be sampled, and then we get exact counts Problem: CAM quickly gets muddied with mice and then elephants can be lost.
26
Elephant Traps: Sample and Recycle
27
Pattern 3: Partial Completion Partial Completion: In a measurement interval, detect the flows (e.g., destinations) which have several Start Packets (e.g., SYN) without the corresponding End (e.g., FIN). Destination X has 3 partial completions in sequence SYN x SYN Y SYN z FIN Y SYN x FIN Z Kompella,Singh,Varghese, IMC 2003
28
Field Extraction Equal to T? Counters Hash 1 Hash 2 Hash 3 Stage 1 Stage 2 Stage 3 ALERT ! If all counters above threshold Partial Completion Filters Increment for SYN, Decrement for FIN
29
Interval 1Interval 2Interval 3Interval 4 Long Lived Connection SYN y Retransmissions FIN z Retransmissions SYN x FIN x Analysis 1: Benign but Malformed Connections Model benign but malformed connections as adding extra SYN or FIN to an interval with probability 0.5
30
Greater than 6 Probability of false positives = 0.0013 Probability of false negatives = 0.0013 Analysis 2: using Gaussian approximation Counter Values Probability
31
Pattern 4: Many Flows Many Flows: In a measurement interval, find if number of tuples exceeds a threshold. S1 S6S2S5S2 6 packets but only 4 distinct sources Estan, Fisk, Varghese, IMC 2003, ACM TONS to appear
32
Simple Bitmap counting Problem: bitmap takes too much memory to count a large number of flows Hash based on flow identifier F 1 1 1 Estimate: based on the number of bits set 1111
33
Sampled Bitmap counting Problem: inaccurate if too few or too many flows Solution: keep only a sample of the bitmap 11 Estimate: scale up sampled count
34
Multi-resolution Bitmap counting Solution: multiple bitmaps, each covering a different range Estimate: use first bitmap that has less than 93.1% of its bits set, count, scale 1-10 flows 10-100 100-1000
35
Scalable Bitmap counting 1-10 flows At time 0, start with scale = 1 Later use with scale = 100 100-1000 Solution: one bitmap with an additional scale factor that is increased when all bits are set Estimate: count bits, correct, multiply by, scale factor. Can count to 1 million using 15 bit scale factor and 32-bit vector
36
Scaled Multi-resolution Bitmap counting Solution: multiple bitmaps, each covering a different range but each with a scale factor Estimate: use first bitmap that has less than 93.1% of its bits set, count, scale 1-10 flows 10-100 100-1000 Scale = 5 Scale = 2 Scale = 8 F. Shahid et al, U.S. Patent Application
37
Pattern 4: Concurrent Approximate State Machines State Machine: In a measurement interval, detect the flows which hit a specified state machine (Bloom filter is a special case where state machine is a membership check) Flow X has two packets in B frame BxBx IYIY XY PxPx x BxBx Bonomi, Mitzenmacher, Panigraphy, Singh, Varghese 2006
38
Concurrent State Machines Implementation: We know 3 good implementations. The best of these uses a good hash table implementation (d-left) and simply substitutes the identifier of a flow with a smaller signature for the flow. Results: For 64 K flows, we need roughly 1 Mbit of memory. Applications: First, for video congestion control. We found good results by dropping B-frames during congestion and then tail-dropping till the next I-frame. Can handle twice the loss rates with same quality. Second, for P2P identification.
39
Outline of Talk Part 1: Motivation Part 2: Basic Patterns and Algorithms Part 3: Combining base patterns to solve useful application problems (traffic matrix, DoS, worms) Part 4: Conclusions.
40
Application 1: Traffic Matrix Each entry router uses a multistage filter on traffic to destination prefixes to isolate subnets to which there is large traffic. Aggregating across all entry routers gives the “dominant” part of traffic matrix. ATT reports 80-20 rule for prefixes. ISP Customer Site 1 Customer Site 3 Customer Site 2 reroute or add B/W
41
Application 2: DoS Attacks Bandwidth attacks: (e.g.. Smurf). Pound victim with large traffic of certain type. Use heavy-hitter pattern relative to traffic type (e.g., ICMP) to find attacked destinations Partial Completion attacks: (e.g., TCP SYN- Flood). May not be unusual bandwidth but characterized by partial connections. Use partial completion pattern as a front-end for Riverhead Guard module in Jaffa.
42
Application 4: Worm Detection Manual signature extraction: slow and enormous effort for each new worm. Automatic signature extraction of a specific worm by automatically detecting an abstract worm ISP Infected 1 Infected N New Victim Inactive Address Sumeet Singh, G. Varghese, C. Estan, S. Savage, OSDI 2004, more in next class
43
Abstract Worm Definition and Detection F1, Content Repetition: Payload of worm is seen frequently at router. Use heavy-hitter pattern with hash H of content as index. NetSift used large multistage filters. A variant of elephant traps invented by John Huber and Sumeet Singh seems to be the best solution for Hawkeye. F2, Increasing Infection Levels: Same content is disbursed to increasing number of distinct source- destination pairs. Use many flows pattern with content hash H as index
44
Hashing Implementation Need a hash function, especially for content, that is easy to compute and random. NetSift used a Rabin hash function but that requires multiplies. For Bloom Filters can make 1 large hash and take portions/stage Much nicer hash function using Galois multiplication (Xor and Shift)
45
Conclusions Introspection/Pattern detection can be useful for the next generation of networks. Beyond faster-cheaper Can implement base patterns at high speeds. Base patterns can be combined to solve useful application issues (traffic matrix, DoS, worms, etc.) Only scratching surface: need to build a library of patterns.
46
Introspection at UCSD Ramana KompellaCristian EstanSumeet Singh
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.