Download presentation
Presentation is loading. Please wait.
Published byMartin Armstrong Modified over 9 years ago
1
Space-Time Tradeoffs in Software-Based Deep Packet Inspection Anat Bremler-Barr Yotam Harchol ⋆ David Hay IDC Herzliya, Israel Hebrew University, Israel. IEEE HPSR 2011 Parts of this work were supported by European Research Council (ERC) Starting Grant no. 259085 ⋆ Supported by the Check Point Institute for Information Security
2
2 Outline Motivation Background New Compression Techniques Experimental Results Conclusions
3
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Network Intrusion Detection Systems Classify packets according to: – Header fields: Source IP & port, destination IP & port, protocol, etc. – Packet payload (data) 3 Internet IP packet IP packet Deep Packet Inspection Motivation
4
BackgroundNew Compression TechniquesExperimental ResultsConclusions Deep Packet Inspection (D)RAM Cache Memory High Capacity Slow Memory Locality-based Low Capacity Fast Memory The environment: Motivation 4
5
BackgroundNew Compression TechniquesExperimental ResultsConclusions Our Contributions Literature assumption: try to fit data structure in cache Efforts to compress the data structures Our paper: Is it beneficial? In reality, even in non-compressed implementation, most memory accesses are done to the cache BUT One can attack the non-compressed implementation by reducing its locality, getting it out of cache - and making it much slower! How to mitigate this attack? Compress even further - our new techniques: 60% less memory 5 Motivation
6
BackgroundNew Compression TechniquesExperimental ResultsConclusions Complexity DoS Attack Find a gap between average case and worst case Engineer input that exploits this gap Launch a Denial of Service attack on the system 6 Internet Real-Life Traffic Throughput Motivation
7
7 Outline Motivation Background New Compression Techniques Experimental Results Conclusions
8
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Aho-Corasick Algorithm Build a Deterministic Finite Automaton Traverse the DFA, byte by byte Accepting state pattern found Example: {E, BE, BD, BCD, CDBCAB, BCAA} 8 [Aho, Corasick; 1975] s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 s 14 s 13 s6s6 s8s8 s9s9 s 10 s 11 C C E D B E D D B C A B A A B E CB E C B E C D E B C D E C E B C E B C E B C E C B B Background B BCDBCAB Input: s0s0 s 12 s2s2 s5s5 s6s6 s9s9 s 10 s 11
9
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Aho-Corasick Algorithm Naïve implementation: Represent the transition function in a table of |Σ|×|S| entries – Σ: alphabet – S: set of states Lookup time: one memory access per input symbol Space: In reality: 70MB to gigabytes… 9 [Aho, Corasick; 1975] Background ABCDE S0S0 02701 S1S1 02701 S2S2 02543 S3S3 02701 S4S4 02701 S5S5 132761 S6S6 09701 S7S7 02781 S8S8 09701 :
10
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Potential Complexity DoS Attack 1.Exhaustive Traversal Adversarial Traffic – Traverses as much states of the automaton – Bad locality - Bad for naïve implementation (will not utilize cache) 10 s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 C C E D B ED s 14 s 13 s6s6 D s8s8 B s9s9 C s 10 A s 11 B A A Background
11
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Alternative Implementation Failure transition goes to the state that matches the longest suffix of the input so far Lookup time: at most two memory accesses per input symbol (via amortized analysis) Space: at most, # of symbols in pattern set, depends on implementation 11 [Aho, Corasick; 1975] B E CB E C B E C D E B C D E C E B C E B C E B C E B C B B s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 s 14 s 13 s6s6 s8s8 s9s9 s 10 s 11 C C E D B E D D B C A B A A Forward Transition Failure Transition Background s 10 s5s5 s7s7 s0s0 s1s1
12
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Potential Complexity DoS Attack 1.Exhaustive Traversal Adversarial Traffic -Traverses as much states of the automaton -Bad locality - Bad for naïve implementation (will not utilize cache) 2.Failure-path Traversal Adversarial Traffic -Traverses as much failure transitions -Bad for failure-path based automaton (as much memory accesses per input symbol) 12 s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 C C E D B ED s 14 s 13 s6s6 D s8s8 B s9s9 C s 10 A s 11 B A A Background
13
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 C C E D B ED s 14 s 13 s6s6 D s8s8 B s9s9 C s 10 A s 11 B A A s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 C C E D B ED s 14 s 13 s6s6 D s8s8 B s9s9 C s 10 A s 11 B A A Prior Work: Compress the State Representation 13 symbolABCDE forward:136 Lookup Table 7 failure: False match: ABCDE 10010 Bitmap Encoded Bitmap: Length=|Σ| forward:136 7 failure: False match: symbolAD forward:136 Linear Encoded 7 failure: False match: 2 size: Background Experimental ResultsConclusions Can count bits using popcnt instruction
14
14 Outline Motivation Background New Compression Techniques Experimental Results Conclusions
15
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Path Compression One-way branches can be represented using a single state – Similarly to PATRICIA tries Problem: Incoming failure transitions Solution: Compress only states with no incoming failure transitions 15 New Compression Techniques s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 s 14 s 13 s6s6 s8s8 s9s9 s 10 s 11 C C E D B E D D B C A B A A s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 s 14 s 13 s6s6 s8s8 s9s9 s 10 s 11 C C E D B E D D B C A B A A s0s0 s7s7 s1s1 s2s2 s3s3 s5s5 s4s4 s 14 s 13 s6s6 s8s8 s9's9' C C E D B E D D BCAB A A (B) (BC) (BCA) (BCAB) Tuck et al. Our Path Compression 100% 75% 2004
16
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Pointer Compression 16 In Snort IDS pattern-set, 79% of the fail pointers point to states in depths 0, 1, 2 Add two bits to encode depth of pointer: 00: Depth 0 01: Depth 1 10: Depth 2 11: Depth 3 and deeper DepthPointers 0 (s 0 )13% 131% 235% ≥ 321% New Compression Techniques Depth ≤ 2 16 bits pointer2 bits 11 Depth > 2 16 bits pointer2 bits16 bits pointer
17
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Pointer Compression 17 DepthPointers 0 (s 0 )13% 131% 235% ≥ 321% New Compression Techniques Tuck et al. Our Path Compression 100% 75% Pointer Comp. 41% 2004 Determine next state from pointer depth: -0: Go to root -1: Use a lookup table using last symbol -2: Use a hash table using last two symbols -≥ 3: Use the stored pointer SymbolState A - B s2s2 C s7s7 D - E s1s1 Depth 1 Lookup Table:Depth 2 Hash Table: hash table Last 2 symbols Next state
18
18 Outline Motivation Background New Compression Techniques Experimental Results Conclusions
19
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Experimental Setup 19 System 1System 2 TypeMacBook ProiMac CPUCore 2 Duo 2.53GHz dual coreCore i7 2.93GHz quad core L1 Cache:16KB (data, per core) L2 Cache:3MB (shared)256KB (per core) L3 Cache:-8MB (shared) SnortClamAV* Patterns31,09416,710 States in Naïve Implementation 77,182745,303 Test Systems Pattern-Sets Experimental Results Real-life traffic logs taken from MIT DARPA * We used only half of ClamAV signatures for our tests
20
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Space Requirement 20 Experimental Results 722.14 Memory Footprint [MB] 1.5 2.59
21
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Experimental Results Memory Accesses per Input Symbol 21
22
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Experimental Results L1 Data Cache Miss Rate 22 Intel Core 2 Duo (2 cores) 16KB L1 Data Cache 3MB L2 Cache L1 Data Cache Miss Rate
23
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Experimental Results L2 Cache Miss Rate 23 Intel Core 2 Duo (2 cores) 16KB L1 Data Cache 3MB L2 Cache Real-Life Traffic: 0.7% L2 Cache Miss Rate Real-Life Traffic: 0.7% L2 Cache Miss Rate Adversarial Traffic: 23% L2 Cache Miss Rate Adversarial Traffic: 23% L2 Cache Miss Rate Maximal L2 Miss Rate: 0.06% Maximal L2 Miss Rate: 0.06% L2 Cache Miss Rate
24
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Experimental Results Space vs. Time: 24 -86% Our Implementation Naïve Implementation Experimental Results
25
25 Outline Motivation Background New Compression Techniques Experimental Results Conclusions
26
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions 26 Naïve Aho-Corasick implementation It is crucial to model the cache in software-based Deep Packet Inspection: Naïve Aho-Corasick implementation has a huge memory footprint, but works well on real-life traffic due to locality of reference Naïve implementation can be easily attacked, making it 7 times slower, even though it has constant number of memory accesses We also show new compression techniques: 60% less memory than best prior-art compression Stable throughput, better performance under attacks Conclusions
27
Questions? Thank you!
28
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Our Contributions 28 Motivation Several new compression techniques 60% less memory Several new compression techniques 60% less memory We suggest: Aho-Corasick algorithm does not run in a constant time (throughput is dependent on input!) A complexity attack on Aho-Corasick that exploits the cache-RAM architecture We analyze Aho-Corasick algorithm, the standard for exact string-matching in Deep Packet Inspection: Single memory access per input symbol Throughput is independent of the input
29
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Our Contributions 29 Motivation Literature: compress data structures to fit in cache Is it always beneficial? In reality, even in non-compressed implementation, most memory accesses are to the cache One can attack the non-compressed implementation by reducing locality to get it out of cache Several new compression techniques - 60% less memory Can data structures fit in cache?
30
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Path Compression Tuck et al., 2006 – Hardware solution: – Compress one-way branches of some fixed length (e.g. 4) into a single transition – Add a skip counter to each failure pointer In software, we can compress one-way branches of any length – Problem: Unbounded skip counter width – Solution: compress only states which have no incoming failure transition 30 s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 C C E D B ED s 14 s 13 s6s6 D s8s8 B s9s9 C s 10 A s 11 B A A s0s0 s7s7 s1s1 s2s2 s3s3 s5s5 s4s4 C C E D B ED s 14 s 13 s6s6 D BCAB s9's9' A A (B) (BC) (BCA) s8's8' 85% less states About 25% space reduction on real-life pattern-sets s0s0 s7's7' s1s1 s2s2 s3s3 s5s5 s4s4 C C E DBCA B ED S 13 's6s6 D s8's8' A Match on A (DB) (DBC) Skip 1 (ε) (B) Match on B Skip 0 s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 C C E D B ED s 14 s 13 s6s6 D s8s8 B s9s9 C s 10 A s 11 B A A Skip 0 New Compression Techniques
31
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Leaves Compression By definition, leaves have no forward transitions Their single purpose is to indicate a match – We can push this indication up by adding a bit to each pointer – Then, leaves can be eliminated from the automaton - by copying their failure transition up 31 s0s0 s7s7 s1s1 s2s2 s3s3 s5s5 s4s4 C C E D B ED s 14 s 13 s6s6 D s8's8' BCAB s9's9' A A (B) (BC) (BCA) s0s0 s7s7 s2s2 s5s5 C C E* D B D* s 13 D* BCAB* A A* (B) (BC) (BCA) E* s8's8' 3% more space reduction Reduces number of transitions taken s0s0 s7s7 s1s1 s2s2 s3s3 s5s5 s4s4 C C E* D B D* s 14 s 13 s6s6 D* s8's8' BCAB* s9's9' A A* (B) (BC) (BCA) New Compression Techniques
32
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions For pointers to states in depth > 2 (only 21% of the pointers in Snort) Original pointer width – log 2 |S| Using a global lookup table of size |∑| entries to link last symbol to next stateCannot use a global table of |∑|×|∑| entries – it is too big! Instead, use a global hash table to map last two chars to next state (of depth 2) (In Snort – 1524 hash entries replace 26784 pointers) Pointer Compression 32 In Snort IDS pattern-set, 79% of the fail pointers point to states in depths 0, 1, 2 If we compact these pointers representation we can significantly reduce memory footprint Variable-size pointers: DepthPointers 0 (s 0 )13% 131% 235% 011001011010111101001011 Go to s 0 Use last symbol to find depth 1 state Use last two symbols to find depth 2 state 45% more space reduction over path compression! New Compression Techniques
33
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Pointer Compression 33 In Snort IDS pattern-set, 79% of the fail pointers point to states in depths 0, 1, 2 DepthPointers 0 (s 0 )13% 131% 235% New Compression Techniques s0s0 s7s7 s1s1 s2s2 s3s3 s5s5 s4s4 s 14 s 13 s6s6 s8s8 s9's9' C C E D B E D D BCAB A A (B) (BC) (BCA) (BCAB) s0s0 s7s7 s1s1 s2s2 s3s3 s5s5 s4s4 s 14 s 13 s6s6 s8s8 s9's9' C C E D B E D D BCAB A A (B) (BC) (BCA) (BCAB) 4 fail-pointers: B:s0s0 (Depth 0) 0000 C:s2s2 (Depth 1) 0010 A:s5s5 (Depth 2) 0101 B:s 13 (Depth 3) 1101 00 01 10 11 1101 SymbolState A - B s2s2 C s7s7 D - E s1s1 Depth 1 Lookup Table:Depth 2 Hash Table: hash table Last 2 symbols Next state Tuck et al. Our Path Compression 100% 75% Pointer Comp. 41% 2004 In Snort: 9901 pointers to s 0 are compresses to two bits Depth 1 lookup table: 256 pointers compress 23,745 pointers Depth 2 hash table: 1,524 hash entries compress 26,748 pointers Only 21% of the pointers are widened by two bits
34
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Traffic Types Real-life Traffic Logs – Taken from MIT DARPA Exhaustive Traversal Adversarial Traffic – Traverses as much states of the automaton – Bad locality - Bad for naïve automaton Failure-path Traversal Adversarial Traffic – Traverses as much failure transitions – Bad for failure-path based automaton 34 s0s0 s7s7 s 12 s1s1 s2s2 s3s3 s5s5 s4s4 C C E D B ED s 14 s 13 s6s6 D s8s8 B s9s9 C s 10 A s 11 B A A Experimental Results
35
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Experimental Results The impact of O(1) lookup complexity: 35 Linear Encoding (Compressed) Bitmap Encoding Lookup Table Naïve
36
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Experimental Results Throughput - Snort: 36 Intel Core 2 Duo 16KB L1 Data Cache 3MB L2 Cache Two threads Experimental Results -88% -30% Good Locality Adversarial Traffic
37
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Experimental Results Throughput - ClamAV: 37 Intel Core 2 Duo 16KB L1 Data Cache 3MB L2 Cache Two threads Experimental Results -73% -50% Not scalable: 40% slower than Snort
38
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Experimental Results Throughput: 38 Snort Linear Encoding (Compressed) Bitmap Encoding Lookup Table Linear Encoding (Non-Compressed) Naïve Throughput [Mbps] 1400 1200 1000 800 600 400 200 0 ClamAV Linear Encoding (Compressed) Bitmap Encoding Lookup Table Linear Encoding (Non-Compressed) Naïve Throughput [Mbps] 1400 1200 1000 800 600 400 200 0 Good Locality Adversarial Traffic -88% -30% -73% -50% Intel Core 2 Duo 16KB L1 Data Cache 3MB L2 Cache Experimental Results
39
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Experimental Results Cache 39 Linear Encoding (Compressed) Bitmap Encoding Lookup Table Naïve Intel Core 2 Duo (2 cores) 16KB L1 Data Cache 3MB L2 Cache Real-Life Traffic: 0.7% L2 Cache Miss Rate Real-Life Traffic: 0.7% L2 Cache Miss Rate Adversarial Traffic: 23% L2 Cache Miss Rate Adversarial Traffic: 23% L2 Cache Miss Rate Maximal L2 Miss Rate: 0.06% Maximal L2 Miss Rate: 0.06%
40
Motivation BackgroundNew Compression TechniquesExperimental ResultsConclusions Experimental Results Is it all because of the cache? 40 Intel Core i7 (4 cores) 16KB L1 Data Cache 256KB L2 Cache per core 8MB L3 Cache - shared Naïve implementation achieves much higher throughput Still, adversarial traffic drops its throughput by 56% CPU cores only work on pattern matching. What if they had some more tasks? -56%
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.