Efficient Memory Utilization on Network Processors for Deep Packet Inspection Piti Piyachon Yan Luo Electrical and Computer Engineering Department University of Massachusetts Lowell
ANCS 2006U Mass Lowell Our Contributions Study parallelism of a pattern matching algorithm Propose Bit-Byte Aho-Corasick Deterministic Finite Automata Construct memory model to find optimal settings to minimize the memory usage of DFA
ANCS 2006U Mass Lowell DPI and Pattern Matching Deep Packet Inspection –Inspect: packet header & payload –Detect: computer viruses, worms, spam, etc. –Network intrusion detection application: Bro, Snort, etc. Pattern Matching requirements 1.Matching predefined multiple patterns (keywords, or strings) at the same time 2.Keywords can be any size. 3.Keywords can be anywhere in the payload of a packet. 4.Matching at line speed 5.Flexibility to accommodate new rule sets
ANCS 2006U Mass Lowell Classical Aho-Corasick (AC) DFA: example 1 A set of keywords –{he, her, him, his} accept state start state accept state Failure edges back to state 1 are shown as dash line. Failure edges back to state 0 are not shown.
ANCS 2006U Mass Lowell Memory Matrix Model of AC DFA Snort (Dec’05): 2733 keywords 256 next state pointers –width = 15 bits > 27,000 states keyword-ID width = 2733 bits x ( x 15) = 22 MB 22 MB is too big for on-chip RAM
ANCS 2006U Mass Lowell Bit-AC DFA ( Tan-Sherwood’s Bit-Split) Need 8 bit-DFA
ANCS 2006U Mass Lowell Memory Matrix of Bit-AC DFA Snort (Dec’05): 2733 keywords 2 next state pointers –width = 9 bits 361 states keyword-ID width = 16 bits 1368 DFA 1368 x 361 x ( x 9) = 2 MB
ANCS 2006U Mass Lowell Bit-AC DFA Techniques Shrinking the width of keyword-ID –From 2733 to 16 bits –By dividing 2733 keywords into 171 subsets Each subset has 16 keywords Reducing next state pointers –From 256 to 2 pointers –By dividing each input byte into 1 bits –Need 8 bit-DFA Extra benefits –The number of states (per DFA) reduces from ~27,000 to ~300 states. –The width of next state pointer reduces from 15 to 9 bits. Memory –Reduced from 22 MB to 2 MB The number of DFA = ? –With 171 subsets, each subset has 8 DFA. –Total DFA = 171 x 8 = 1,368 DFA What can we do better to reduce the memory usage?
ANCS 2006U Mass Lowell Classical AC DFA: example 2 Failure edges are not shown. 28 states
Byte-AC DFA Considering 4 bytes at a time 4 DFA < 9 states / DFA 256 next state pointers! Similar to Dharmapurikar-Lockwood’s JACK DFA, ANCS’05
ANCS 2006U Mass Lowell Bit-Byte-AC DFA 4 bytes at a time Each byte divides into bits. 32 DFA (= 4 x 8) < 6 states/DFA 2 next state pointers
ANCS 2006U Mass Lowell Memory Matrix of Bit-Byte-AC DFA Snort (Dec’05): 2733 keywords 4 bytes at a time < 36 states/DFA 2 next state pointers –width = 6 bits keyword-ID width = 3 bits DFA (= 911 x 32) x 36 x (3 + 2 x 6) = 1.9 MB 1.9 MB is a little better than 2 MB. This is because It is not any optimal setting. Each DFA has different number of states. Don’t need to provide same size of memory matrix for every DFA.
ANCS 2006U Mass Lowell Bit-Byte-AC DFA Techniques Still keeping the width of keyword-ID as low as Bit-DFA. Still keeping next state pointers as small as Bit-DFA. Reducing states per DFA by –Skipping bytes –Exploiting more shared states than Bit-DFA Results of reducing states per DFA –from ~27,000 to 36 states –The width of next state pointer reduces from 15 to 6 bits.
ANCS 2006U Mass Lowell Construction of Bit-Byte AC DFA bit 3 of byte 0 4 bytes (considered) at a time
ANCS 2006U Mass Lowell Construction of Bit-Byte AC DFA 4 bytes (considered) at a time
ANCS 2006U Mass Lowell Construction of Bit-Byte AC DFA 4 bytes (considered) at a time
ANCS 2006U Mass Lowell Construction of Bit-Byte AC DFA 4 bytes (considered) at a time
ANCS 2006U Mass Lowell Construction of Bit-Byte AC DFA 4 bytes (considered) at a time
ANCS 2006U Mass Lowell Construction of Bit-Byte AC DFA 4 bytes (considered) at a time
ANCS 2006U Mass Lowell Construction of Bit-Byte AC DFA 4 bytes (considered) at a time
ANCS 2006U Mass Lowell Construction of Bit-Byte AC DFA 4 bytes (considered) at a time
ANCS 2006U Mass Lowell Construction of Bit-Byte AC DFA 4 bytes (considered) at a time
ANCS 2006U Mass Lowell Construction of Bit-Byte AC DFA Failure edges are not shown.
ANCS 2006U Mass Lowell Construction of Bit-Byte AC DFA
ANCS 2006U Mass Lowell Construction of Bit-Byte AC DFA 32 bit-byte DFA need to be constructed.
ANCS 2006U Mass Lowell Bit-Byte-DFA: Searching
ANCS 2006U Mass Lowell A failure edge is shown as necessary. 0 Bit-Byte-DFA: Searching
ANCS 2006U Mass Lowell Bit-Byte-DFA: Searching
ANCS 2006U Mass Lowell A failure edge is shown as necessary. 0 Bit-Byte-DFA: Searching
ANCS 2006U Mass Lowell Match=> (keyword) ‘memory’ Only all 32 bit-DFA find the match in their own! Bit-Byte-DFA: Searching
ANCS 2006U Mass Lowell Find the optimal settings to minimize memory When k = keywords per subset –The width of keyword-ID = k bits –k = 1, 2, 3, …, K –when K = the number of keywords in the whole set. Snort (Dec.2005) : K = 2733 keywords b = bit(s) extracted for each byte –b = 1, 2, 4, 8 –# of next state pointers = 2 b –The example 2: b = 1 –Beyond b > 8 > 256 next state pointers B = Bytes considered at a time –B = 1, 2, 3, … –The example 2: B = 4 Total Memory (T) is a function of k, b, and B. –T = f (k, b, B)
ANCS 2006U Mass Lowell T’s Formula Total memory of all bit-ACs in all subset when,,and
ANCS 2006U Mass Lowell keywords per subset Find the optimal k Each pair of (b, B) has one optimal k for a minimal T. T_min at k=12
ANCS 2006U Mass Lowell Find the optimal b keywords per subset Each setting of k, b, and B has different optimal point. –Choosing only the optimal setting to compare. b = 2 is the best.
ANCS 2006U Mass Lowell Find the optimal B keywords per subset b = 2 T reduces while B increases. –Non-linearly B > 16, –T begins to increase. B = 16 is the best for Snort (Dec’05).
ANCS 2006U Mass Lowell Comparing with Existing Works keywords per subset Tan-Sherwood’s, Brodie-Cytron-Taylor’s, and Ours Our Bit-Byte DFA when B=16 –The optimal point at b=2 and k=12 –272 KB –14 % of 2001 KB (Tan’s) –4 % of 6064 KB (Brodie’s)
ANCS 2006U Mass Lowell Comparing with Existing Works keywords per subset Tan-Sherwood’s and Ours: At B = 1 (Tan’s on ASIC) –2001 KB –k = 16 is not the optimal setting for B=1. –Each bit-DFA uses same storage’s capacity, which fits the largest one (worst case). (Ours on NP) –396 KB < 2001 KB –k = 3 is the optimal setting for B=1. –Each bit-DFA uses exactly memory space to hold it.
ANCS 2006U Mass Lowell Results with an NP Simulator keywords per subset NePSim2 –An open source IXP24xx/28xx simulator NP Architecture based on IXP2855 –16 MicroEngines (MEs) –512 KB –1.4 GHz Bit-Byte AC DFA: b=2, B=16, k=12 –T = 272 KB –5 Gbps
ANCS 2006U Mass Lowell Conclusion keywords per subset Bit-Byte DFA model can reduce memory usage up to 86%. Implementing on NP uses on-chip memory more efficiently without wasting space, comparing to ASIC. NP has flexibility to accommodate The optimal setting of k, b, and B. Different sizes of Bit-Byte DFA. New rule sets in the future. The optimal setting may change. The performance (using a NP simulator) satisfies line speed up to 5 Gbps throughput.
ANCS 2006U Mass Lowell Thank you Question?