Efficient Interconnects for Clustered Microarchitectures Joan-Manuel Parcerisa Antonio González Universitat Politècnica de Catalunya – Barcelona, Spain {jmanel,antonio}@ac.upc.es Julio Sahuquillo José Duato Universitat Politècnica de València – València, Spain {jsahuqui,jduato}@disca.upv.es
Why Clustered Microarchitectures Larger issue width, window length, predictor sizes More complexity more latency and power Even worse: wire delays do not scale across technologies Deeper pipelines, fewer logic levels per stage Tight loops difficult to fit in a single cycle E.g. issue logic, bypass Partitioning critical structures attacks both problems E.g. clustered microarchitectures
A Typical Clustered uArch Partitioned processor core Instructions dynamically steered Local I-Queue Local Register File FU Interconnect-network (ICN) C0 C2 C1 C3 Fetch/Decode/Rename Steering Logic Each cluster: RF, IQ, FUs Faster issue, read, bypass Inter-cluster communications Go through slow interconnects Take 1 cycle or more Steering must maximize communication locality
Motivation ICN is a critical part of the architecture Performance very sensitive to communication latency ! ICN assumed by previous works Cross-bar does not scale Ring simple, but long delays Idealized Our proposals Several point-to-point ICN for 4 and 8 clusters Implementable, simple and efficient A topology-aware steering
Outline Clustered architecture Topology-aware steering Proposed Interconnects Experimental results Summary and conclusions
Our Assumed Clustered uArch Distributed RF Results only written to local RF Values are communicated with copy instructions Automatically inserted Each copy creates a new instance Rename Table tracks locations of multiple instances
Communication Timing Ex D F WB WB Ex D F (to C1) R1:= R2 + R2 ICN delay Wait for R1 Wakeup signals (to C1) copy R1C1->C2
Baseline Steering Scheme (dependence-based) 1. Minimize communication penalty If all source operands available Select clusters that minimize # communications If any source operand not available Select producer cluster 2. Maximize workload balance Choose the least loaded of clusters selected by rule 1 One exception: If workload imbalance > threshold, ignore rule 1
Topology-Aware Steering Scheme Also minimize distance Change part of rule 1: If all source operands are available: Baseline: “Select clusters that minimize # communications” Topology-aware: “Select clusters that minimize the longest communication distance”
Design Issues: Bandwidth For each additional input bypass path 1 tag across the IQ 1 RF write port 1 entry to FU input MUXes It increases the wakeup and bypass delays Bandwidth requirements are rather low 1 input bypass path per cluster (1 RF write port) 2 links per connected cluster pair cluster router
Design Issues: Latency Performance very sensitive to communication latency Simple routing structures and algorithms Source routing No intermediate buffering In-transit messages have priority over newly injected ones
Design Issues: Connectivity Assumed 1-cycle communication delay between adjacent clusters Number of “adjacents” dictated by technology and layout Study topologies with different connectivity degrees
Design Issues: Point-to-point vs Buses Point-to-point advantages Access to links is arbitrated locally Wires are shorter and less loaded Shared buses are studied for comparison
Interconnects for 4 clusters (I) Bus2 1 Bus per cluster, each connected to 1 write port Latency = 4 cycles (2 for arbitration + 2 for transmission) Arbitration overlaps with transmission C0 C1 C2 C3
Interconnects for 4 clusters (II) Synchronous Ring Injection rules prevent that 2 messages arrive at once: Even cycles: 1-hop: counter-clockwise/ 2-hops: clockwise Odd cycles: reverse directions Even cycles Odd cycles No conflict! Inject 1-hop message (or forward in-transit) Inject 2-hops message
Interconnects for 4 clusters (III) Partially Asynchronous Ring Messages may issue in any cycle 2 messages may arrive at once Small input queues c3 c0 c1 c2 Input Queues
Interconnects for 4 clusters (IV) Ideal Ring Contention-free unlimited number of links unlimited number of RF write ports For comparison purposes (upper-bound performance)
Interconnects for 8 Clusters (I) Buses Analogous to those for 4 clusters Bus2: same latency (optimistic): 2+2 cycles Bus4: twice the latency (realistic): 4+4 cycles Rings Synchronous and Asynchronous Max. Distance = 4 hops (average 2.29 hops)
Interconnects for 8 Clusters (II) Mesh Max. distance = 4 hops (average = 2 hops) 2 in-transit messages may compete for the same output link Constrained connectivity Only for last hop of messages Cluster datapath Left Right Top
Interconnects for 8 Clusters (III) Torus Max. distance = 3 hops Same connectivity constraints as the mesh Only for last hop of messages
Interconnects for 8 Clusters (IV) Ideal Torus Contention-free unlimited number of links unlimited number of RF write ports For comparison purposes (upper-bound performance)
Router Structures Common features to all ICN Top Link Common features to all ICN No intermediate buffering Partially asynchronous ICN Competence for a write port Add small input queues Left Link Right Link Qin Topologies with 3 adjacent nodes Competence for the same output link Constrained connectivity Cluster Datapath
Experimental Setup Simulation Architecture Extended version of sim-outorder (SimpleScalar v3.0) 14 Mediabench programs Compiled with –O4 for an Alpha AXP Architecture L1 D-cache: 64KB, 2-way, 3-cycle hit 128 ROB, 64 LSQ Each cluster: 2-way issue, 16-entry IQ, 56 physical regs.
Performance: 4 Clusters Poor performance of Bus2 Asynchronous Ring Better than Synchronous Ring Close to Ideal (within 1%)
Synchronous / Asynchronous Contention delays Lower for Async. Ring Message issues as soon as the link is available Higher for 1-hop messages a single path Sync. Ring: issue 1 cycle every 2
Distribution (% times) Length of Input Queues Max. observed occupancy < 9 entries Handle overflows by flushing the pipeline Rather than including complex control flow # occupied entries # messages Distribution (% times) 1327534 96.20 1 47136 3.42 2 4807 0.35 3 484 0.04 4 26 5 >=6 Sample statistics (djpeg)
Performance: 8 Clusters Poor performance of buses Connectivity degree has a significant impact Asynchronous Torus close to Ideal (within1.5%)
Topology-Aware Steering 16.5% IPC improvement with 8 clusters (2.5% with 4 clusters)
Summary An efficient topology-aware steering scheme Cluster point-to-point interconnects For 4 clusters and 8 clusters Designed to minimize complexity and latency Compared to Bus-based models Idealized models with unlimited bandwidth
Conclusions The choice of ICN is crucial for performance Point-to-point better than buses Asynchronous rings better than synchronous Asynchronous interconnects perform close to ideal with minimal complexity Higher connectivity significantly improves performance Topology-aware steering essential to reduce latency Especially with many clusters The main conclusion is that the choice of interconnect is key for performance We have found that point-to-point interconnects outperform bus-based models And that partially asynchronous rings outperform synchronous rings, because issue rules constrain in excess the available bandwidth We also found that partially asynchronous interconnects perform close to ideal with unlimited bandwidth, despite having minimal complexity (just 1 RF write port required). An they do not require complex control-flow, just a tiny queue The 3 topologies studied, ring, mesh and torus differ in their connectivity degree. To that respect we have shown that higher connectivity significantly improves performance Finally, we have found that the topology-aware steering scheme is essential to reduce the latency of communications, and its impact on performance grows with the number of clusters.