A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University of Pittsburgh
Background ISP Internet Subnet ISP end user router end user
Network packet processing tasks Packet forwarding Given an IP address Look up in a table (IP table) a matching prefix Make sure the chosen prefix is longest LPM (Longest Prefix Matching) requirement Rule-based packet filtering Given a set of packet fields (src/dst IP, src/dst port, protocol, …) Look up in a rule database matching entries Deep packet inspection Given a string in packet payload Look up in a signature database matching entries
Lookup performance scalability Lookup performance must match increasing line speeds For OC-768, up to 104M packets must be processed per second Network traffic has doubled every year [McKeown03] Router capacity doubles every 18 months Capacity pressure Routing tables (~200K prefixes in a core router) are growing [RIS] # of firewall rules increases; 100K rules are practical [Baboescu04] IPv6 Power and thermal issues already a critical limiting factor in network processing device design [McKeown03] Two conventional lookup solutions Software methods (tries, hash table, …) Hardware methods (TCAM, Bloom filter, …)
IP lookup using a trie Consider an IP address: “flexibility” high memory capacity requirement low memory bandwidth utilization not SCALABLE
IP lookup using TCAM Consider an IP address: * * * 01000* 01100* 01101* 11011* 0100* 0110* 1101* 10* 0* sort before storing choose the first among the matched high bandwidth, constant time lookup TCAMs are relatively small, expensive power consumption very high not SCALABLE
Recap: Why is TCAM inefficient? all bits are involved in matching large embedded match logic “large” means more work in this case
CA-RAM–a hybrid approach Can we do better than the existing schemes? Flexibility and search performance Exploit optimized RAM designs Hardware approach (software too slow) CA-RAM combines hashing w/ hardware parallel matching CA-RAM design goals High lookup performance Low power consumption Smaller chip area per stored datum Straightforward system-level integration
CA-RAM–Content Addressable RAM Separate match logic and memory Match logic for a single row, not every row Allows the use of dense RAM technology Enables highly reconfigurable match logic (Keep keys sorted in each row, not in entire array) Match logic Memory cells Conventional CAM/TCAMCA-RAM
Very simple, yet efficient Use hashing to store keys in a particular row To look up, hash the key and retrieve one row Perform matching on entire row in parallel Achieve (full) content addressability w/o paying overhead! Index generator Key i1 Match processor 1 … … Key i2 Key j2 Key j1 Match processor 2 … key
Pipelined CA-RAM operation Index generatorSearch key Key i1 Match processor 1 Key i2 Key j2 Key j1 Match processor 2 ResultMatch processor 3 Key i3 Key j3 Step 1Step 2Step 3Step 4 Index Key j2 Key j1 Key j3 Search keyMatch processor 2 Index generationMemory access Key matching Result forwarding
Dealing w/ bucket overflows Careful design of hash function Increase bucket size Reduce load factor ( ); = # of occupied entries / # of total entries Trade-off space for performance Use “chaining”; store overflows in subsequent rows Multiple accesses per lookup Use a small overflow CAM, accessed in parallel Similar to popular “victim caching” in computer architecture Use two-level hashing and employ multiple CA-RAM banks … …
CA-RAM reconfig. opportunities Reconfigurable match logic allows: Adapting key size to apps Same hardware to support multiple apps or standards … …
Adapting key size Key i1 Reconfigurable match logic Key i2 Key j2 Key j1 Key i3 Key j3 Match information Key i1 Key i2 Key j2 Key j1 Adapting key size is straightforward Will benefit supporting multiple apps/ standards Select key bits for matching
CA-RAM reconfig. opportunities Reconfigurable match logic allows: Adapting key size to apps Same hardware to support multiple apps or standards Binary and ternary matching Some apps require ternary matching, some don’t … …
Supporting binary/ternary matching Reconfigurable match logic Match information Key i1 Key i2 Key j2 Key j1 Search key Mask j1 Mask i1 Developed configurable comparator T-matching requires 2 bits / 1 symbol Supporting different types of matching in different bit positions feasible Consider mask bits or not
CA-RAM reconfig. opportunities Reconfigurable match logic allows: Adapting key size to apps Same hardware to support multiple apps or standards Binary and ternary matching Some apps require ternary matching, some don’t Storing data and keys in a CA-RAM module Cuts # of memory accesses for IP lookup by half … …
Simult. key matching & data access Reconfigurable match logic Match information Key i1 Key i2 Key j2 Key j1 Search key Data j1 Data i1 Data access follows TCAM lookup CA-RAM supports data embedding Cuts memory traffic & latency by half Match information & Data Match key & bypass data
CA-RAM reconfig. opportunities Reconfigurable match logic allows: Adapting key size to apps Same hardware to support multiple apps or standards Binary and ternary matching Some apps require ternary matching, some don’t Storing data and keys in a CA-RAM module Cuts # of memory accesses for IP lookup by half Providing range checking capabilities Beneficial for rule-based packet filtering … …
Supporting range checking Reconfigurable match logic Match information Key i1 Range i1 Range j1 Key j1 Search key (Range checking causes troubles) (Entries must be expanded) CA-RAM can upport range checking efficiently Match key & check range
Evaluation We implemented a CA-RAM design (w/ reconfigurability) and evaluated its power and area advantages over state-of-the-art TCAMs We experimented with real routing tables to estimate the load factor and the average memory accesses per lookup
Mapping a large IP routing table Consider multiple design points: Design B Design A Design D Design C Design E Design F 2,048 rows (32 entries) 4,096 rows (64 entries) ( = 0.47) ( = 0.40) ( = 0.36) ( = 0.24) ( = 0.36)
Mapping a large IP routing table Spilled entries Average memory access latency ( = 0.47)( = 0.40)( = 0.36) ( = 0.24)( = 0.36) “Uniform” traffic “Skewed” traffic With a properly chosen , CA-RAM achieves near-constant AMAL
Comparing CA-RAM and TCAM Per Cell Area (um 2 4.5x 11x 4.5Mb Power 14x 4x Cell area ( m 2 CMOS Power (W) CA-RAM area advantage 4.5x~11x CA-RAM power advantage 4x~14x
Conclusions Compared w/ software methods Less # of memory accesses; higher lookup performance Compared w/ TCAM Higher density matching that of DRAM large lookup table Exceeds the speed of TCAM Low power – a critical advantage for cost-effective system design Reconfigurability Can accommodate apps having different key/record sizes, binary vs. ternary searching requirements, range checking, … Can adopt new standards much more easily, e.g., IPv6
Mapping a large IP routing table CA-RAM advantageous over TCAM Design B
CA-RAM components Index generator Result Bus Key i1 Match processor 1 … … … Key i2 Key j2 Key j1 Match processorsMatch processor 2 C bits 2 R rows N bits