1 ECE 526 – Network Processing Systems Design System Implementation Principles II Varghese Chapter 3
2 Outline Review Principle 1-7 Implementation principles ─ Reflect what we learned Example: TCAM updating Cautionary Questions
3 Reviews P1: Avoid Obvious Waste ─ Example: copy packet pointer instead of packet P2: Shift Computation in Time ─ precompute (table lookup), ─ evaluate lazily (network forensics) ─ Share Expenses (batch processing) P3: Relax Subsystem Requirements ─ Trade certainty for time (random sampling); ─ Trade accuracy for time (hashing, bloom filter); ─ Shift computation in space (fast path/slow path)
4 Reviews P4: Leverage Off-System Components ─ Examples: Onboard Address Recognition & Filtering, cache P5: Add Hardware to Improve Performance ─ Use memory interleaving, pipelining (= parallelism); ─ Use Wide-word parallelism (save memory accesses) ─ Combine SRAM, DRAM (low-order bits each counter in SRAM for a large number of counters) P6: Replace inefficient general routines with efficient specialized ones ─ Examples: NAT using forwarding and reversing tables P7: Avoid Unnecessary Generality ─ Examples: RISC, microengine
5 P8: Don't be tied to reference implementations Key Concept: ─ Implementations are sometimes given (e.g. by manufacturers) as a way to make the specification of an interface precise, or show how to use a device ─ These do not necessarily show the right way to think about the problem—they are chosen for conceptual clarity! Examples: ─ Using parallel packet classification instead of sequential demultiplexing in TCP/IP protocols
6 P9: Pass hints across interfaces Key Concept: if the caller knows something the callee will have to compute, pass it (or something that makes it easier to compute) as an argument! ─ "hint" = something that makes the recipient's life easier, but may not be correct ─ "tip" = hint that is guaranteed to be correct ─ Caveat: callee must either trust caller, or verify (probably should do both) Example ─ Active message, the message carry the address of interrupt handler for fast dispatching
7 P10: Pass hints in protocol headers Key Concept: If sender knows something receiver will have to compute, pass it in the header Example: ─ Tag switching, packet contains extra information beside the destination address for fast lookup
8 P11: Optimize the Expected Case Key Concept: If 80% of the cases can be handled similarly, optimize for those cases P11a: Use Caches ─ A form of using state to improve performance Example: ─ TCP input "header prediction" If an incoming packet is in order and does what is expected, can process in small number of instructions
9 P12: Add or Exploit State to Gain Speed Key Concept: Remember things to make it easier to compute them later P12a: Compute incrementally ─ Here the idea is to "accumulate" as you go, rather than computing all-at-once at the end Example: ─ Incremental computation of IP checksum
10 P13: Optimize Degrees of Freedom Key Concept: be aware of variables under one’s control and evaluation criteria used determine good performance Example: memory-based string matching algorithm ─ possible transitions from each state for a character is 256 (2^^8, ASCII coding using 8 bit); ─ Bit-split algorithm using 8 machines, each machine only check for one bit, the total possible transitions for a character is 16 (2^^1 * 8)
11 P14: Use special techniques for finite universes (e.g. small integers) Key Concept: when the domain of a function is small, techniques like bucket sorting, bitmaps, etc. become feasible. Example: ─ bucket sorting for NAT table lookup NAT table is very sparse Each bucket is accessed by hashing ─ Bucket sort Partitioning an array into a finite number of bucket Each bucket is sorted individually
12 P15: Use algorithmic techniques to create efficient data structures Key Concept: once P1-P14 have been applied, think about how to build an ingenious data structure that exploits what you know Examples ─ IP forwarding lookups PATRICIA trees (data structure) were first –A special trie, with each edge of patricia tree labled with sequences of characters. Then many other more-efficient approaches
13 TCAM Ternary: 0, 1 and *(wildcard) TCAM: specified length of key and associated actions TCAM lookup: compare the query with all keys in parallel, output (in one cycle) the lowest memory location whose key matches the input IP forward uses longest-prefix matching ─ DIP matches both * and 01* Using TCAM for IP forwarding, requires put all longer prefixes occur before any shorter ones.
14 IP Lookup All prefixes with the same length are group together the shortest prefix 0* are in the highest memory address The packet with DIP: matches prefix of both P3 and P5 P5 is chosen due to longest-prefix matches
15 Routing Table Update 11* with P1 needed to insert to routing table Naïve: create space in group of length-2 prefix, and pushing up one position all prefixes of length-2 and higher Core routing table have 100, 000 entries 100, 000 memory accesses
16 Routing Table Update P13: understand the exploit degrees of freedom -- we can add 11* at any position of group 2, not required after 10*. We can add boundary of group 2 and group 3.
17 Clever Routing Table Updating the maximum memory accesses is 32 – i.
18 Cautionary Questions Q1: Is improvement really needed? Q2: Is this really the bottleneck? Q3: What impact will change have on rest of system? Q4: Does BoE-analysis indicate significant improvement? Q5: Is it worth adding custom hardware? Q6: Can protocol change be avoided? Q7: Do prototypes confirm the initial promise? Q8: Will performance gains be lost if environment changes?
19 Summary P1-P5: System-oriented Principles ─ These recognize/leverage the fact that a system is made up of components ─ Basic idea: move the problem to somebody else’s subsystem P6-P10: Improve efficiency without destroying modularity ─ “Pushing the envelope” of module specifications ─ Basic engineering: system should satisfy spec but not do more P11-P15: Local optimization techniques ─ Speeding up a key routine ─ Apply these after you have looked at the big picture
20 Reminder