Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems Jieming Yin *, Pingqiang Zhou +, Sachin S. Sapatnekar * and Antonia Zhai * * University of Minnesota, Twin Cities, USA + ShanghaiTech University, China 28 th IEEE International Parallel & Distributed Processing Symposium
ShanghaiTech 2 Heterogeneous Multicore System GPUCPU GPU L2 MEM Interconnection Network
3 On-chip Traffic Characteristics CPU GPU Traffic PatternSwitching Mechanism Erratic Random Latency-sensitive Streaming Dedicated Throughput-intensive Packet Switching Circuit Switching NoCs must handle different traffic differently ShanghaiTech
Src node Intm. node1 Intm. node2 Intm. node3 Dest node Src node Intm. node1 Intm. node2 Intm. node3 Dest node data link traversal router pipeline Network delay setup ack Network delay Setup delay data Packet-switchedCircuit-switched link traversal router pipeline Packet Switching vs. Circuit Switching 4 Performance Perspective
Packet Switching vs. Circuit Switching Packet-switched Circuit-switched 5 Circuit-switched NoC: potentially energy efficient for certain traffic pattern Allocation & Arbitration Energy Perspective ShanghaiTech
Packet Switching Flexible, Scalable Latency, Energy Circuit Switching Latency, Energy Setup, Maintenance RegularErratic Fixed Frequency Destination Random Packet Switching Circuit Switching Packet Switching 6 Packet Switching or Circuit Switching NoC with both packet and circuit switching? ShanghaiTech
Multi-plane vs. Single-plane 7 CS PS PS+CS Multi-plane: Independent packet-switched (PS) and circuit- switched (CS) planes Single-plane: Packet and circuit switching sharing the same communication fabric Increasing hardware requirement Low resource utilization How can Packet and Circuit Switching share the same fabric? ShanghaiTech
SDM A B C D 4 bits 2 bits 1 bits Space-Division Multiplexing A B C D A B C D 8 (Space-division Multiplexing) PS+CS Physically divide a channel into sub-channels K. Lusala et al., IJRC 2012 S. Secchi et al., DSD 2008 A. K. Lusala, ReCoSoC 2011 M. Modarressi et al., DATE 2009 SDM suffers from packet serialization problem ShanghaiTech
A B C D 0 D 1 C 2 B 3 B 4 A 5 A 6 A 7 A time ABCD 8 bits TDM Time-Division Multiplexing A B C D 9 (Time-division Multiplexing) PS+CS We propose TDM-based hybrid-switched NoC ! ShanghaiTech
10 Outline Introduction Design TDM-based Hybrid-switching NoC Optimizations for Hybrid Switching Conclusion ShanghaiTech
Output 1 BW RC BW RC VA SA ST Packet-switched Pipeline HP ST HP ST Circuit-switched Pipeline Routing Logic Crossbar Input 1 Packet-switched Circuit-switched Slot Table VC Allocator SW Allocator Output n Input n Packet-switched Circuit-switched Slot Table Hybrid-switched Router 11 ShanghaiTech
R0R1R2 R3R5R4 Circuit-switched Path Setup 12 R0R1R2R3 t0 t1 t2 t3 t4 t5 t6 t7 CS t0 Set up the path before transmission Setup messages are sent through the packet-switched network Acknowledge the source upon successful setup Keep time-slot assignment in Slot Tables ShanghaiTech
in_ in_2 s0 s1 s2 s out_4 1 in_ in_2 s0 s1 s2 s out_4 1 in_ in_2 s0 s1 s2 s out_4 0 in_ in_2 s0 s1 s2 s3 setup 1 (succeed) in_1 → out_4 slot_id = 2 duration = 2 setup 2 (fail) in_1 → out_3 slot_id = 3 duration = 1 teardown 1 in_1 → out_4 slot_id = 2 duration = 2 ①② ③④ vout v v v v v v v Slot Table Configuration Walkthrough 13 ShanghaiTech
14 Slot Table Size Smaller slot table Less energy overhead Smaller packet waiting time Coarser-grain multiplexing Larger slot table More energy overhead Longer packet waiting time Finer-grain multiplexing Initial(reset) more request (reset) Slot table V.S. Slot table size should be adjusted dynamically active inactive ShanghaiTech
15 Circuit-Switched Path Exclusiveness Slot Table s0 s1 s2 s3 s4 s5 s6 s vout out_3 (PS) out_2 (PS) out_1 Crossbar SW Allocator Crossbar must be configured before a circuit-switched flit’s arrival. Time slot is wasted if circuit-switched flit is not presented. configuration signals Exclusively occupied by circuit-switched paths ShanghaiTech
16 Time-slot Stealing SW Allocator Crossbar vout Decoder Line Address valid Slot Table VC Allocator configuration signals CS flit enable From upstream router Enable path reuse between packet- and circuit-switched data paths
Routing decision is made based on the utilization of slot tables in neighbor routers Hybrid-switched Network Path Setup – Endpoint Selection: Frequent communication pairs – Route Selection: Adaptive Routing Switching Decision – Referring to packet slack * 17 * J. Yin et al., ISLPED 2012 ShanghaiTech
18 CPU Core/ GPU SM/ L2 Cache/ MC R R Full System Evaluation Platform Benchmarks – CPU: ammp, applu, art, equake, gafort, mgrid, swim, wupwise – GPU: blackscholes, lps, lib, nn, hotspot, pathfinder, sto ShanghaiTech
19 Performance Evaluation ↑ 0.3% CPU GPU ↑ 4.1% GPU performance is improved CPU performance impact is negligible ShanghaiTech
20 Network Energy Evaluation 6.3% saving ShanghaiTech
21 Overall – Basic Hybrid-switched NoC CPU SpeedupGPU SpeedupNetwork Energy 0.3% CPU performance improvement 4.1% GPU performance improvement 6.3% Network energy reduction Can we do better? ShanghaiTech
22 Outline Introduction Design TDM-based Hybrid-switching NoC Optimizations for Hybrid Switching Conclusion ShanghaiTech
Opportunity: Low Path Utilization 23 Circuit-switched paths are under utilized Large number of overlapped circuit-switched paths Circuit-switched paths are not fully utilized Waste of on-chip resource (slot-tables) Overlapped paths ShanghaiTech
Circuit-switched Path Hitchhiker-sharing Sources Optimization: Path Sharing Circuit-switched Path Vicinity-sharing Destinations Hitchhiker-sharing Vicinity-sharing 24 Enable path reuse among circuit-switched data paths
25 Performance Evaluation ↑ 0.3%↑ 0.2% CPU GPU ↑ 4.1%↑ 3.7% ShanghaiTech
26 Network Energy Evaluation Can we do EVEN better? 6.3% saving 9.0% saving ShanghaiTech
27 Percentage of flits that are circuit-switched Opportunity: Lower Buffer Pressure Packet-switched Circuit-switched GPU benchmark Circuit-switched flits percent (%) Blackscholes55.7 Hotspot29.1 Lib34.4 Lps55.0 Nn38.9 Pathfinder49.1 Sto18.5 Observation: Circuit switching diverts on-chip traffic, alleviating the buffer pressure on packet- switched data paths. ShanghaiTech
Circuit switching some of the packets alleviates buffer pressure, facilitates more aggressive power gating. Input 1 Packet-switched Circuit-switched Slot Table 28 Optimization: Aggressive Power-gating Reduce dynamic and leakage power dissipation active inactive ShanghaiTech
29 Performance Evaluation ↑ 0.3%↑ 0.2% CPU GPU ↑ 4.1%↑ 3.7% ↑ 2.6% ↓ 1.6% ShanghaiTech
30 Network Energy Evaluation Energy saving is significant 6.3% saving 9.0% saving 17.1% saving ShanghaiTech
31 Overall CPU SpeedupGPU SpeedupNetwork Energy 1.6% CPU performance degradation 2.6% GPU performance improvement 17.1% Network energy reduction ShanghaiTech
32 Conclusion TDM-based Hybrid-switched Network TDM is an efficient way to enable on-chip resource sharing Hybrid-switched NoC handles different traffic differently Performance Energy efficiency Scalability (in paper) TDM-based Hybrid-switched Network TDM is an efficient way to enable on-chip resource sharing Hybrid-switched NoC handles different traffic differently Performance Energy efficiency Scalability (in paper) ShanghaiTech