Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim Department of Computer Science and Engineering Texas A&M University
Multi-Core Wave & Networks-On-Chip Uniprocessors hit the power wall. Multi-processors provide high performance at lower power budget. Shared-bus architecture has scalability limitation. Networks-On-Chip (NOCs) orchestrate chip-wide communications towards future many-core processors. MIT Raw (0.18um, 300MHz) 16-core chip Four 4x4 mesh networks Intel Polaris (65nm, 4GHz) 80-core chip 8x10 mesh network First, let’s look at two changes in our processor design. Lei Wang - NOCS 2009
Challenges in On-Chip Communication High performance Low communication latency is critical for high system performance. Bandwidth-efficient Well-designed routing algorithms provide high network throughput. Power and Area Constraints Simple topologies and slim routers reduce communication power consumption and save chip area. Efficient Multicast supporting Cache coherence protocols heavily rely on multicast or broadcast communication characteristics. We propose a bandwidth-efficient routing for multicast communication in NOCs with low latency and power consumption. Lei Wang - NOCS 2009
Prior Work in Multicast Communication Routing Evaluation Criteria for Multicast Communication [Ni93] Multicast in multicomputer system Tree-based Multicast Routing for DSM Multiprocessor [Torrellas96] Short message multicast in DSM system Virtual Circuit Tree Multicasting for NOCs[Lipasti08] Demonstrate necessity of multicasting on-chip Propose table-based multicast routing Region-based Multicast for CMPs [Duato08] Multicast routing for irregular topology in CMPs Lei Wang - NOCS 2009
Outline Motivation Multicast Router Design State-of-art Unicast Router Architecture Replication Schemes Destination List Management Recursive Partitioning Multicast (RPM) Network Partitioning Routing Rules Example Deadlock Avoidance Evaluation Conclusion Lei Wang - NOCS 2009
Different Bandwidth Usage Example Source Destination 1 2 3 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 Left Path requires 11 link traversals, 12 buffer writes, 15 buffer reads, and 15 crossbar traversals Right Path requires 5 link traversals, 6 buffer writes, 10 buffer reads, and 10 cross-bar traversals Lei Wang - NOCS 2009
State-of-Art Wormhole Unicast Router RC VA SA ST LT Router Link RC VA SA ST LT Router Link RC: Route Computation VA: VC Allocation; SA: Switch Allocation ST: Switch Traversal; LT: Link Traversal Lei Wang - NOCS 2009
What we need in a Multicast Router? Packet Replication Synchronous Replication Asynchronous Replication Destination List Management All-destination Encoding Bit String Encoding Multiple-region Broadcast Encoding Lei Wang - NOCS 2009
Synchronous Replication Head flit Time (Cycle) M Middle flit 1 2 3 Tail flit T Input 0 Output 0 T M M M H H Input 1 Output 1 Input 2 Output 2 Input 3 Output 3 Packet replication happens at Switch Traversal Stage. Lei Wang - NOCS 2009
Asynchronous Replication Head flit Time (Cycle) M Middle flit 1 2 3 Tail flit T Input 0 Output 0 T M M M M H H Input 1 Output 1 Input 2 Output 2 Input 3 Output 3 Lei Wang - NOCS 2009
Network Partitioning Source node N W E S 1 2 3 7 4 8 5 Eight Parts Source node 2 N 3 7 W E 4 8 5 Eight Parts Three Parts (5, 6, 7) S Three Parts (0, 1, 7) Three Parts (3, 4, 5) Three Parts (1, 2, 3) Lei Wang - NOCS 2009
Basic Routing Rules North: top right corner. West: top left corner. South: bottom left corner. East: bottom right corner. N W E S Source N N E E W W S S Destination Lei Wang - NOCS 2009
Optimized Routing Rules Source Destination Deadlock!!! Lei Wang - NOCS 2009
RPM Example-step 1 Multicast Packet Source Destination Partitioning Lei Wang - NOCS 2009
RPM Example-step 2 Multicast Packet Source Destination Partitioning Ejection Lei Wang - NOCS 2009
RPM Example-step 3 Multicast Packet Source Destination Partitioning Lei Wang - NOCS 2009
RPM Example-step 4 Multicast Packet Source Destination Partitioning Ejection Ejection M M M M Ejection Lei Wang - NOCS 2009
RPM Example-step 5 Multicast Packet Source Destination Partitioning Ejection M M Lei Wang - NOCS 2009
Deadlock Avoidance RPM has no turn restrictions, potentially introducing deadlock. We use Virtual Network (VN) to avoid deadlock. Two VNs lie in the same physical network. Virtual Channels of each port are equally divided into each virtual network. Virtual network Id (0 or 1) for each packet is decided at the source. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Virtual Network 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Virtual Network 1 Lei Wang - NOCS 2009
Evaluation Methodology Performance Model: Cycle-accurate Network Simulator Models all router pipeline stages in detail Highly parameterized Power Model: Orion with both dynamic and leakage power models Network configuration Topology 8×8 Mesh (6×6 Mesh, 10×10 Mesh, 16×16 Mesh) Routing RPM VC/Port 4 VC Depth Packet Length (flits) Unicast Traffic Pattern Uniform Random (Bit Complement, Transpose) Multicast Packet Portion 10% (5%, 20%, 40%, 80%) Multicast Destination Number 0 -16 (uniformly distributed) Lei Wang - NOCS 2009
Uniform Random Traffic 50% 40% 40% Latency is improved around 50% before network saturation. Network throughput is extended 40%. Lei Wang - NOCS 2009
Link Utilization 33% 45% In low workload, RPM saves 33% link utilization. In high workload, RPM saves 45% link utlization. Lei Wang - NOCS 2009
Dynamic Power Consumption 50% 40% Lei Wang - NOCS 2009
Scalability Study-Network Size Over 50% Lei Wang - NOCS 2009
Scalability Study-Multicast Traffic Portion Lei Wang - NOCS 2009
Scalability Study-Destination Number Lei Wang - NOCS 2009
Conclusion Propose a new multicast routing algorithm, Recursive Partitioning Multicast (RPM) Bandwidth-efficient and Scalable Performance Improvement Up to 50% latency reduction 33% link utilization reduction Power Savings Up to 40% total dynamic power savings 25% crossbar and link power savings Lei Wang - NOCS 2009
Thank you! Lei Wang - NOCS 2009
Backup Lei Wang - NOCS 2009
Hardware Implementation of Routing logic Lei Wang - NOCS 2009
Bit Complement Traffic Lei Wang - NOCS 2009
Transpose Traffic Lei Wang - NOCS 2009