Virtually Pipelined Network Memory Banit Agrawal Tim Sherwood UC SANTA BARBARA
Packet classification rules Memory Design is Hard Increasing functionalities Increasing size of data structures Increasing line rate Throughput in the worst case Need to service the traffic at the advertised rate IPv4 routing table size 100k 200k 360k Packet classification rules 2000 5000 10000 10 Gbps 40 Gbps 160 Gbps Banit Agrawal 11/14/2018
What programmers think ? Low cost Low power High capacity High bandwidth incase of some access patterns Network Programmers Network System Memory DRAM What is the problem? Banit Agrawal 11/14/2018
Worst case : Every access conflicts DRAM Bank Conflicts Variable latency Variable throughput bank 0 bank 1 Busy bank 2 bank 3 Busy row decoder row decoder row decoder row decoder sense amplifier sense amplifier sense amplifier sense amplifier column decoder column decoder column decoder column decoder DRAM macro latency, Bank interleaving, dram accesses, bank conflicts, reduction in efficiency address bus data bus Worst case : Every access conflicts Banit Agrawal 11/14/2018
Prior Work Reducing bank-conflicts in common access patterns Prefetching and memory-aware layout [Lin-HPCA’01, Mathew-HPCA’00] Reordering of requests [Hong-HPCA’99, Rixner-ISCA’00] Vector processing domain [Espasa-Micro’97] Good for desktop computing No guarantees for the worst case Reducing bank conflicts for special access patterns Packet buffering : written once write and read once Low bank conflicts - Optimizations including row-locality and scheduling [Hasan-ISCA’03, Nikologiannis-ICC’01] No bank conflicts - Reordering and clever memory management algorithms [Garcia-Micro’03, Iyer-StanTechReport’02] Not applicable in any access patterns Banit Agrawal 11/14/2018
Where network system stands ? 0% deadline failures Full determinism required No exploitable deadline failures Common-case optimized parts Best effort (co-operative) Common-case optimized Parts Banit Agrawal 11/14/2018
Virtually Pipelined Memory Normalize the overall latency Using randomization and buffering Deterministic latency for all accesses Trillions of accesses without any bank conflicts Even in case of any access patterns t Memory Controller DRAM t + D Banit Agrawal 11/14/2018
Outline Memory for networking systems Memory controller Design analysis Hardware design How we compare? Conclusion Banit Agrawal 11/14/2018
Memory Controller t t + D Bank 0 controller Bank 0 key HU 5 → 2,A 6 → 0,F 7 → 2,B 8 → 3,A t Bank 2 controller Bus Scheduler Bank 2 R address t + D Bank 3 controller data Bank 3
Non-conflicting Accesses Bank latency (L) – 15 cycles Normalized delay (D) – 30 cycles 10 20 30 40 50 60 70 80 requests A B C data ready A B C Repeated requests Banit Agrawal 11/14/2018
Redundant Accesses Conflicting requests Bank latency (L) – 15 cycles Normalized delay (D) – 30 cycles 10 20 30 40 50 60 70 80 requests A B A A B data ready A B A A B Conflicting requests Banit Agrawal 11/14/2018
Conflicting Accesses Bank latency (L) – 15 cycles Normalized delay (D) – 30 cycles 10 20 30 40 50 60 70 80 requests A B C D E Stall data ready A B C D E Banit Agrawal 11/14/2018
Implementing Virtual Pipelined Banks Delay Storage Buffer v address incr/decr ++ data words first zero Delay Storage Buffer Bank Access Queue Bank Access Queue r/w row id scheduled-access address row scheduled-access data addr data Write Buffer address data words Set 1 Set 0 out ptr access[t-3] access[t] access[t-d+1] … access[t-2] access[t-d] access[t-d+2] in ptr Circular Delay Buffer Circular Delay Buffer Control Logic Control Logic to memory Write Buffer (FIFO) Interface address Interface data
Implementing Virtual Pipelined Banks Delay Storage Buffer v address incr/decr ++ data words first zero Delay Storage Buffer Bank Access Queue Bank Access Queue r/w row id scheduled-access address row scheduled-access data addr data Write Buffer address data words Set 1 Set 0 out ptr access[t-3] access[t] access[t-d+1] … access[t-2] access[t-d] access[t-d+2] in ptr Circular Delay Buffer Circular Delay Buffer Control Logic Control Logic to memory Write Buffer (FIFO) Interface address Interface data
Delay Storage Buffer Stall Mean-time to stall (MTS) B – number of banks, 1/B is the probability of a request Stall happens when there are more than k accesses in interval D An Illustration Normalized latency (D) - 30 cycles Number of entries in the delay storage buffer (K) - 3 Banit Agrawal 11/14/2018
Delay Storage Buffer Stall +1 +1 +1 +1 +1 +1 requests A B C D E F 10 20 30 40 50 60 70 80 data ready A B C D E F -1 -1 -1 -1 -1 -1 MTS = log ( ) D - 1 K - 1 1 2 log (1 – ( ( ) * ) K-1 )) B + D Banit Agrawal 11/14/2018
MTS = probability of stall state becomes 0.5 Markovian Analysis Bank access queue stall State-based analysis Number of banks (B) - 1/B is the probability of an access to a bank If more than D cycles of work to be done, a stall occurs. An example: Bank access latency (L) = 3 Normalized delay (D) = 6 1 1 B 1 B 1 B stall idle 1 2 3 4 5 6 1- 1 B 1- 1 B MTS = probability of stall state becomes 0.5 Banit Agrawal 11/14/2018
Markovian Analysis P = I M n Find n s.t. P=50% Banit Agrawal 11/14/2018
Hardware Design and Overhead Verilog implementation Verification using ModelSim and C++ simulation model Synthesizing using Synopsys Design Compiler Hardware overhead tool Using Cacti parameters Verify one with the synthesized design Optimal design parameters using this tool 45.7 seconds MTS with area overhead of 34.1 mm2 at 77% efficiency 10 hours MTS with area overhead of 34 mm2 at 71.4% efficiency Banit Agrawal 11/14/2018
How VPNM performs ? Packet buffering Packet reassembly 35% less area Only need to store the head and tail pointers Can support arbitrarily large number of logical queues Packet reassembly Scheme Line rate (Gbps) Area (mm2) Total delay (ns) Supported interfaces RADS [17] 40 10 53 130 CFDS [12] 160 60 10000 850 Our approach 41.9 960 4096 35% less area 10x less latency 5x more queues Banit Agrawal 11/14/2018
Conclusion VPNM provides t Higher throughput Memory DRAM Controller Deterministic latency Randomization and normalization Higher throughput worst case that is impossible to exploit Handles any access patterns Ease of programmability/mapping Packet buffering Packet reassembly t Memory Controller DRAM t + D Banit Agrawal 11/14/2018
Thanks for your attention. Questions?? http://www.cs.ucsb.edu/~arch/ Banit Agrawal 11/14/2018