Data plane algorithms in routers From prefix lookup to deep packet inspection Cristian Estan, University of Wisconsin-Madison
What is the data plane? The part of the router handling the traffic Data plane algorithms applied to every packet Successive packets typically treated independently Example: deciding on which link to send a packet Throughput defined as number of packets or bytes handled per second is very important “Line speed” – keeping up with the rate at which traffic can be transmitted over the wire or fiber Example: 10Gbps router has 32 ns to handle 40 byte packet Memory usage limited by technology and costs Can afford at most tens of megabits of fast on-chip memory DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
A generic data plane problem Router has many directives composed of a guard, and an associated action (all guards distinct) There is a simple procedure for testing how well a guard matches a packet For each packet, find the guard that matches “best” and take the associated action Example – routing table lookup: Each guard is an IP prefix (between 0 and 32 bits) Matching procedure: is the guard a prefix of the 32 bit destination IP address “Best” defined as longest matching prefix DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
The rules of the game Matching against all guards in sequence is too slow We build a data structure that captures the semantics of all guards and use it for matching Primary metrics How fast the matching algorithm is How much memory the data structure needs Time to build data structure also has some importance We can cheat (but we won’t today) by: Using binary or ternary content-addressable memories Using other forms of hardware support DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Measuring “algorithm complexity” Execution cost measured in number of memory accesses to read data structure Actual data manipulation operations typically very simple On some platforms we can read wide words Worst case performance most important Worst case defined with respect to input, not guards Caching has been proven ineffective for many settings Using algorithms with good amortized complexity, but bad worst case requires large buffers DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Overview Longest matching prefix Classification on multiple fields Trie-based algorithms Uni-bit and multi-bit tries (fixed stride and variable stride) Leaf pushing Bitmap compression of multi-bit trie nodes Tree bitmap representation for multi-bit trie nodes Binary search on ranges Binary search on prefix lengths Classification on multiple fields Signature matching DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Longest matching prefix Used in routing table lookup (a.k.a. forwarding) for finding the link on which to send a packet Guard: a bit string of 0 to w bits called IP prefix Action: a single byte interface identifier Input: a w-bit string representing the destination IP address of the packet (w is 32 for IPv4,128 for IPv6) Output: the interface associated with the longest guard matching the input Size of problem: hundreds of thousands of prefixes DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Controlled prefix expansion with stride 3 Leaf pushing 000 P5 001 P4 010 011 100 101 110 111 P1 000* 001* 010* 011* P2 100* 101* 110* 111* P3 P4 100000* 100001* 100010* 100011* P5 P6 P7 P8 110010* 110011* P9 Multi-bit trie with fixed stride 000 P5 001 P4 010 011 100 P3 101 110 111 P1 000* 001* 010* 011* P3 100* P4 100001* 100010* 100011* P5 100000* P6 101* P7 110* P8 110010* 110011* P9 111* 000 P1 001 010 011 100 P3 101 P5 110 P7 111 P9 Routing table 000 P1 001 010 011 100 101 P5 110 111 P9 P1 0* P2 1* P3 100* P4 1000* P5 100000* P6 101* P7 110* P8 11001* P9 111* 000 001 010 P8 011 100 101 110 111 000 P7 001 010 P8 011 100 101 110 111 Multi-bit trie with variable stride Leaf pushing reduces memory usage but increases update time 000 P5 001 P4 010 011 100 101 110 111 000 P1 001 010 011 100 P3 3 101 P5 110 P7 2 111 P9 P5 1 1 Uni-bit trie P4 1 Given a maximum trie height h and a routing table of size n dynamic programming algorithm computes optimal variable stride trie in O(nw2h) P3 1 P6 P1 1 P2 1 P8 1 1 00 01 P8 10 11 P7 1 P9 DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Controlled prefix expansion with stride 3 Leaf pushing 000 P5 001 P4 010 011 100 101 110 111 P1 000* 001* 010* 011* P2 100* 101* 110* 111* P3 P4 100000* 100001* 100010* 100011* P5 P6 P7 P8 110010* 110011* P9 Multi-bit trie with fixed stride 000 P5 001 P4 010 011 100 P3 101 110 111 P1 000* 001* 010* 011* P3 100* P4 100001* 100010* 100011* P5 100000* P6 101* P7 110* P8 110010* 110011* P9 111* 000 P1 001 010 011 100 P3 101 P5 110 P7 111 P9 Routing table 000 P1 001 010 011 100 101 P5 110 111 P9 P1 0* P2 1* P3 100* P4 1000* P5 100000* P6 101* P7 110* P8 11001* P9 111* 000 001 010 P8 011 100 101 110 111 000 P7 001 010 P8 011 100 101 110 111 Input Multi-bit trie with variable stride Leaf pushing reduces memory usage but increases update time 11000010 000 P5 001 P4 010 011 100 101 110 111 Longest matching prefix 000 P1 001 010 011 100 P3 3 101 P5 110 P7 2 111 P9 P5 1 P7 P2 1 Uni-bit trie P4 1 Given a maximum trie height h and a routing table of size n dynamic programming algorithm computes optimal variable stride trie in O(nw2h) P3 1 P6 P1 1 P2 1 P8 1 1 00 01 P8 10 11 P7 1 P9 DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Controlled prefix expansion with stride 3 Leaf pushing 000 P5 001 P4 010 011 100 101 110 111 P1 000* 001* 010* 011* P2 100* 101* 110* 111* P3 P4 100000* 100001* 100010* 100011* P5 P6 P7 P8 110010* 110011* P9 Multi-bit trie with fixed stride 000 P5 001 P4 010 011 100 P3 101 110 111 P1 000* 001* 010* 011* P3 100* P4 100001* 100010* 100011* P5 100000* P6 101* P7 110* P8 110010* 110011* P9 111* 000 P1 001 010 011 100 P3 101 P5 110 P7 111 P9 Routing table 000 P1 001 010 011 100 101 P5 110 111 P9 P1 0* P2 1* P3 100* P4 1000* P5 100000* P6 101* P7 110* P8 11001* P9 111* 000 001 010 P8 011 100 101 110 111 000 P7 001 010 P8 011 100 101 110 111 Input Multi-bit trie with variable stride Leaf pushing reduces memory usage but increases update time 11000010 000 P5 001 P4 010 011 100 101 110 111 Longest matching prefix 000 P1 001 010 011 100 P3 3 101 P5 110 P7 2 111 P9 P5 1 P7 1 Uni-bit trie P4 1 Given a maximum trie height h and a routing table of size n dynamic programming algorithm computes optimal variable stride trie in O(nw2h) P3 1 P6 P1 1 P2 1 P8 1 1 00 01 P8 10 11 P7 1 P9 DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Lulea bitmap compression Bitmap supporting fast counting Compressed node P1 000* 001* 010* 011* P3 100* P4 100001* 100010* 100011* P5 100000* P6 101* P7 110* P8 110010* 110011* P9 111* P1 000* 001* 010* 011* P3 100* P4 100001* 100010* 100011* P5 100000* P6 101* P7 110* P8 110010* 110011* P9 111* When the compression bitmaps are large it is expensive to count bits during lookup. The bitmap is divided into chunks and a pre-computed auxiliary array stores the number of bits set before each chunk. The lookup algorithm needs to count only bits set within one chunk. 00000 1 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 000 P1 001 010 011 100 101 P5 110 111 P9 000 1 001 010 011 100 101 110 111 P1 P5 P9 Repeating entries are stored only once in the compressed array. An auxiliary bitmap is needed to find the right entry in the compressed node. It stores a 0 for positions that do not differ from the previous one. Input 00 01 4 10 8 11 13 11001010 Representing node as tree bitmap Pointers to children and prefixes are stored in separate structures. Prefixes of all lengths are stored, thus leaf pushing is not needed and update is fast. Bitmaps have 1s corresponding to entries that are not empty. Longest matching prefix 0* 1 1* 00* 01* 10* 11* 000* 001* 010* 011* 100* 101* 110* 111* P7 P2 13+0=13 000 001 010 011 100 1 101 110 111 P1 P2 P3 P6 P7 P9 P1 0* P2 1* P3 100* P4 1000* P5 100000* P6 101* P7 110* P8 11001* P9 111* DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Binary search on ranges Divide w-bit address space into maximal continuous ranges covered by same prefix Build array or balanced (binary) search tree with boundaries of ranges At lookup time perform O(log(n)) search Not better than multi-bit tries with compression, but it is not covered by patents DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Binary search on prefix lengths Core idea: for each prefix length represented in the routing table, have a hash table with the prefixes Can find longest matching prefix after looking up in each hash table the prefix of the address with corresponding length Binary search on prefix lengths is faster Simple but wrong algorithm: if you find prefix at length x store it as best match and look for longer matching prefixes, otherwise look for shorter prefixes Problem: what if there is both a shorter and a longer prefix, but no prefix at length x? Solution: insert marker at length x when there are longer prefixes. Must store with marker longest matching shorter prefix. Markers lead to moderate increase in memory usage. Promising algorithm for IPv6 (w=128) DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Papers on longest matching prefix G. Varghese “Network algorithmics an interdisciplinary approach to designing fast networked devices”, chapter 11, Morgan Kaufmann 2005 V. Srinivasan, G. Varghese “Faster IP lookups using controlled prefix expansion”, ACM Trans. on Comp. Sys., Feb. 1999 M. Degermark, A. Brodnik, S. Carlsson, S. Pink “Small forwarding tables for fast routing lookups”, ACM SIGCOMM, 1997 W. Eatherton, Z. Dittia, G. Varghese “Tree Bitmap : Hardware / Software IP Lookups with Incremental Updates”, http://www-cse.ucsd.edu/~varghese/PAPERS/willpaper.pdf B. Lampson, V. Srinivasan, G. Varghese “IP lookups using multiway and multicolumn search”, IEEE Infocom, 1998 M. Waldvogel, G. Varghese, J. Turner, B. Plattner, “Scalable high-speed IP lookups”, ACM Trans. on Comp. Sys., Nov. 2001 DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Overview Longest matching prefix Classification on multiple fields Solution for two-dimensional case: grid of tries Bit vector linear search Cross-producting Decision tree approaches Signature matching DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Packet classification problem Required for security, recognizing packets with quality of service requirements Guard: prefixes or ranges for k header fields Typically source and destination prefix, source and destination port range, and exact value or * for protocol All fields must match for rule to apply Action: drop, forward, map to a certain traffic class Input: a tuple with the values of the k header fields Output: the action associated with the first rule that matches the packet (rules are strictly ordered) Size of problem: thousands of classification rules DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Example of classification rule set Router that filters traffic External time server TO Mail gateway M Internet Net Internal time server TI Secondary name server S Destination IP Source IP Dest Port Src Port Protocol Action M * 25 R1 53 UDP R2 S R3 23 R4 TI TO 123 R5 Net R6 TCP/ACK R7 R8 DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
A geometric view of packet classification Destination address space R2 R2 Source address space Source address space In theory number of regions defined can be much larger than number of rules Any algorithm that guarantees O(n) space for all rule sets of size n needs O(log(n)k-1) time for classification DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
The two dimensional case: source and destination IP addresses For each destination prefix in rule set, link to corresponding node in destination IP trie a trie with source prefixes of rules using this destination prefix Matching algorithm must use backtracking to visit all source tries Grid of tries: by pre-computing “switch pointers” in destination tries and propagating some information about more general rules, matching may proceed without backtracking Memory used proportional to number of rules Matching time O(w) with constant depending on stride Extended grid of tries handles 5 fields and has good run time and memory in practice DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Bit vector approaches do linear search through rule set Dest IP Src IP Dest Port Src Port Proto Action M * 25 R1 53 UDP R2 S R3 23 R4 TI TO 123 R5 Net R6 TCP R7 R8 Dest IP M 11110111 TI 00001111 Net 00000111 * 00000101 Source IP S 11110011 TO 11011011 Net 11010111 * 11010011 Dest Port 25 10000111 53 01100111 23 00010111 123 00001111 * 00000111 Src Port 123 11111111 * 11110111 Proto UDP 11111101 TCP 10110111 * 10110101 00000101+ 11010011+ 00000111+ 11110111+ 10110111 00000001 R8 Bit vector approaches do linear search through rule set For each field we pre-compute a structure (e.g. trie) to find most specific prefix or range distinguished by rule set For each rule, a single bit represents whether a given most specific prefix matches rule or not We associate with each range a bitmap of size n encoding which of the rules may match a packet in that prefix or range Classification algorithm first computes for each field of the packet the most specific prefix/range it belongs to By then AND-ing together the k bitmaps of size n we find matching rules Works well for hardware solutions that allow wide memory reads Scales poorly to large rule sets DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
… Dest IP Src IP Dest Port Src Port Proto Action M * 25 R1 53 UDP R2 S 23 R4 TI TO 123 R5 Net R6 TCP R7 R8 Dest IP M TI Net * Src IP S TO Net * Dest Port 25 53 23 123 * Src Port 123 * Proto UDP TCP * 3*120+3*30+4*6+1*3+1=478 Dest IP - Src IP Rule bitmap Class M,S 11110011 C1 1 M,TO 11010011 C2 2 M,Net 11010111 C3 3 M,* 4 TI,S 00000011 C4 5 TI,T0 00001011 C5 6 TI,Net 00000111 C6 7 TI,* 8 Net,S 9 Net,TO 10 Net,Net 11 Net,* 12 *,S 00000001 C7 13 *,TO 14 *,Net 00000100 C8 15 *,* Cross Product Action M,S,25,123,UDP R1 1 M,S,25,123TCP 2 M,S,25,123,* 3 M,S,25,*,UDP 4 M,S,25,*,TCP 5 M,S,25,*,* 478 *,*,*,*,TCP R8 479 *,*,*,*,* Cross-producting performs longest prefix matching separately for all fields and combines the results in a single step by looking up the matching rule in a pre-computed table explicitly listing the first matching rule for each element of the cross-product. The size of this table is the product of the numbers of recognized prefixes/ranges for the individual fields. Due to its memory requirements this method is not feasible. … 4*4*5*2*3=480 Equivalenced cross-producting (a.k.a. recursive flow classification or RFC) combines the results of the per-field longest matching prefix operations two by two. The pairs of values are grouped in equivalence classes and in general there are much fewer equivalence classes than pairs of values. This leads to significant memory savings as compared to simple cross-producting. This algorithm provides fast packet classification, but compared to other algorithms, the memory requirements are relatively large (but feasible in some settings). 16 entries, 8 distinct classes Dest IP Src IP Dest Port Src Port Proto DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007 Final result
Decision tree approaches At each node of the tree test a bit in a field or perform a range test Large fan-out leads to shallow trees and fast classification Leaves contain a few rules traversed linearly Interior nodes may contain rules that match also Tests may look at bits from multiple fields A rule may appear in multiple nodes of the decision tree – this can lead to increased memory usage Tree built using heuristics that pick fields to compare on that divide remaining rules relatively evenly among descendants Fast and compact on rule sets used today DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Papers on packet classification G. Varghese “Network algorithmics …”, chapter 12 V. Srinivasan, G. Varghese, S. Suri, M. Waldvogel, “Fast and Scalable Layer Four Switching”, ACM SIGCOMM, Sep. 1998 F. Baboescu, S. Singh, G. Varghese, “Packet classification for core routers: Is there an alternative to CAMs?”, IEEE Infocom, 2003 P. Gupta, N. McKeown, “Packet classification on multiple fields”, ACM SIGCOMM 1999 T. Woo, “A modular approach to packet classification: Algorithms and results”, IEEE Infocom, 2000 S. Singh, F. Baboescu, G. Varghese, “Packet classification using multidimensional cutting”, SIGCOMM, 2003 DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Overview Longest matching prefix Classification on multiple fields Signature matching String matching Regular expression matching w/ DFAs and D2FAs DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Signature matching Used in intrusion prevention/detection, application classification, load balancing Guard: a byte string or a regular expression Action: drop packet, log alert, set priority, direct to specific server Input: byte string from the payload of packet(s) Hence the name “deep packet inspection” Output: the positions at which various signatures match or the identifier of the “highest priority” signature that matches Size of problem: hundreds of signatures per protocol DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
String matching Most widely used early form of deep packet inspection, but the more expressive regular expressions have superceded strings by now Still used as pre-filter to more expensive matching operations by popular open source IDS/IPS Snort Matching multiple strings a well-studied problem A. Aho, M. Corasick. “Efficient string matching: An aid to bib- liographic search”, Communications of the ACM, June 1975 Many hardware-based solutions published in last decade Matching time independent of number of strings, memory requirements proportional to sum of their sizes DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Regular expression matching Deterministic and non-deterministic finite automata (DFAs and NFAs) can match regular expressions NFAs more compact but require backtracking or keeping track of sets of states during matching Both representations used in hardware and software solutions, but only DFA based solutions can guarantee throughput in software DFAs have a state space explosion problem From DFAs recognizing individual signatures we can build a DFA that recognizes entire signature set in a single pass Size of combined DFA much larger than sum of sizes for DFAs recognizing individual signatures Multiple combined DFAs are used to match signature set DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
D2FAs with no bound on default path length S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, J. Turner, “Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection”, ACM SIGCOMM, September 2006 Deterministic finite automaton (DFA) Delayed Input DFA (D2FA) State 0 State 1 State 2 State 3 State 0 State 1 State 2 State 3 Default transitions 25 18 41 5 19 12 4 2 8 19 12 4 8 25 18 6 41 5 2 … 25 18 41 5 2 8 19 12 4 8 6 5 41 … Input …410052… If the “current state” variable meets an acceptance condition (e.g. whether the state identifier is larger than a given threshold), the automaton raises an alert. Crt. state 12 1 Set of regular expr. D2FAs with no bound on default path length D2FAs d.p.l.≤4 Avg. d.p.l. Max d.p.l. Memory Cisco590 18.32 57 0.80% 1.56% Cisco103 16.65 54 0.98% 1.54% Cisco7 19.61 61 2.58% 3.31% Linux56 7.68 30 1.64% 1.87% Linux10 5.14 20 8.59% 9.08% Snort11 5.86 9 1.57% 1.66% Bro648 6.45 17 0.45% 0.51% D2FAs build on the observation that for many pairs of states, the transition tables are very similar and it is enough to store the differences. The lookup algorithm may need to follow multiple default transitions until it finds a state that explicitly stores a pointer to the next state it needs to transition to. Since this is a throughput concern, the algorithm for constructing D2FAs allows the user to set a limit on the length of the maximum default path. The memory columns report the ratio between the number of transitions used by the D2FA and the corresponding DFA. DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
Conclusions Networking devices implement more and more complex data plane processing to better control traffic The algorithms and data structures used have big performance impact Often set of rules to be matched against has specific structure Algorithms exploiting this structure may give good performance even if it is impossible to find an algorithm that gives good performance on all possible rule sets DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007
That’s all folks! DIMACS Tutorial on Algorithms for Next Generation Networks August 6-8 2007