Efficient Interconnects for Clustered Microarchitectures

Slides:

Advertisements

Similar presentations

TRIPS Primary Memory System Simha Sethumadhavan 1.

Advertisements

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

A Novel 3D Layer-Multiplexed On-Chip Network

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

EECS 470 Dynamic Scheduling – Part II Lecture 10 Coverage: Chapter 3.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Mohamed Abdelfattah Vaughn Betz

Architecture and Algorithms for an IEEE 802

Instruction Level Parallelism

Lecture 23: Interconnection Networks

Multiscalar Processors

Lynn Choi Dept. Of Computer and Electronics Engineering

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Architecture & Organization 1

CS203 – Advanced Computer Architecture

Chapter 3 Top Level View of Computer Function and Interconnection

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Microprocessor Microarchitecture Dynamic Pipeline

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.

Lecture 17: NoC Innovations

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Architecture & Organization 1

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Patrick Akl and Andreas Moshovos AENAO Research Group

Horizon: Balancing TCP over multiple paths in wireless mesh networks

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Ka-Ming Keung Swamy D Ponpandi

Multiprocessors and Multi-computers

Presentation transcript:

Efficient Interconnects for Clustered Microarchitectures Joan-Manuel Parcerisa Antonio González Universitat Politècnica de Catalunya – Barcelona, Spain {jmanel,antonio}@ac.upc.es Julio Sahuquillo José Duato Universitat Politècnica de València – València, Spain {jsahuqui,jduato}@disca.upv.es

Why Clustered Microarchitectures Larger issue width, window length, predictor sizes More complexity  more latency and power Even worse: wire delays do not scale across technologies Deeper pipelines, fewer logic levels per stage Tight loops difficult to fit in a single cycle E.g. issue logic, bypass Partitioning critical structures attacks both problems E.g. clustered microarchitectures

A Typical Clustered uArch Partitioned processor core Instructions dynamically steered Local I-Queue Local Register File FU Interconnect-network (ICN) C0 C2 C1 C3 Fetch/Decode/Rename Steering Logic Each cluster: RF, IQ, FUs Faster issue, read, bypass Inter-cluster communications Go through slow interconnects Take 1 cycle or more Steering must maximize communication locality

Motivation ICN is a critical part of the architecture Performance very sensitive to communication latency ! ICN assumed by previous works Cross-bar  does not scale Ring  simple, but long delays Idealized Our proposals Several point-to-point ICN for 4 and 8 clusters Implementable, simple and efficient A topology-aware steering

Outline Clustered architecture Topology-aware steering Proposed Interconnects Experimental results Summary and conclusions

Our Assumed Clustered uArch Distributed RF Results only written to local RF Values are communicated with copy instructions Automatically inserted Each copy creates a new instance Rename Table tracks locations of multiple instances

Communication Timing Ex D F WB WB Ex D F (to C1) R1:= R2 + R2 ICN delay Wait for R1 Wakeup signals (to C1) copy R1C1->C2

Baseline Steering Scheme (dependence-based) 1. Minimize communication penalty If all source operands available Select clusters that minimize # communications If any source operand not available Select producer cluster 2. Maximize workload balance Choose the least loaded of clusters selected by rule 1 One exception: If workload imbalance > threshold, ignore rule 1

Topology-Aware Steering Scheme Also minimize distance Change part of rule 1: If all source operands are available: Baseline: “Select clusters that minimize # communications” Topology-aware: “Select clusters that minimize the longest communication distance”

Design Issues: Bandwidth For each additional input bypass path 1 tag across the IQ 1 RF write port 1 entry to FU input MUXes It increases the wakeup and bypass delays Bandwidth requirements are rather low 1 input bypass path per cluster (1 RF write port) 2 links per connected cluster pair cluster router

Design Issues: Latency Performance very sensitive to communication latency Simple routing structures and algorithms Source routing No intermediate buffering In-transit messages have priority over newly injected ones

Design Issues: Connectivity Assumed 1-cycle communication delay between adjacent clusters Number of “adjacents” dictated by technology and layout Study topologies with different connectivity degrees

Design Issues: Point-to-point vs Buses Point-to-point advantages Access to links is arbitrated locally Wires are shorter and less loaded Shared buses are studied for comparison

Interconnects for 4 clusters (I) Bus2 1 Bus per cluster, each connected to 1 write port Latency = 4 cycles (2 for arbitration + 2 for transmission) Arbitration overlaps with transmission C0 C1 C2 C3

Interconnects for 4 clusters (II) Synchronous Ring Injection rules prevent that 2 messages arrive at once: Even cycles: 1-hop: counter-clockwise/ 2-hops: clockwise Odd cycles: reverse directions Even cycles Odd cycles No conflict! Inject 1-hop message (or forward in-transit) Inject 2-hops message

Interconnects for 4 clusters (III) Partially Asynchronous Ring Messages may issue in any cycle 2 messages may arrive at once Small input queues c3 c0 c1 c2 Input Queues

Interconnects for 4 clusters (IV) Ideal Ring Contention-free unlimited number of links unlimited number of RF write ports For comparison purposes (upper-bound performance)

Interconnects for 8 Clusters (I) Buses Analogous to those for 4 clusters Bus2: same latency (optimistic): 2+2 cycles Bus4: twice the latency (realistic): 4+4 cycles Rings Synchronous and Asynchronous Max. Distance = 4 hops (average 2.29 hops)

Interconnects for 8 Clusters (II) Mesh Max. distance = 4 hops (average = 2 hops) 2 in-transit messages may compete for the same output link Constrained connectivity Only for last hop of messages Cluster datapath Left Right Top

Interconnects for 8 Clusters (III) Torus Max. distance = 3 hops Same connectivity constraints as the mesh Only for last hop of messages

Interconnects for 8 Clusters (IV) Ideal Torus Contention-free unlimited number of links unlimited number of RF write ports For comparison purposes (upper-bound performance)

Router Structures Common features to all ICN Top Link Common features to all ICN No intermediate buffering Partially asynchronous ICN Competence for a write port Add small input queues Left Link Right Link Qin Topologies with 3 adjacent nodes Competence for the same output link Constrained connectivity Cluster Datapath

Experimental Setup Simulation Architecture Extended version of sim-outorder (SimpleScalar v3.0) 14 Mediabench programs Compiled with –O4 for an Alpha AXP Architecture L1 D-cache: 64KB, 2-way, 3-cycle hit 128 ROB, 64 LSQ Each cluster: 2-way issue, 16-entry IQ, 56 physical regs.

Performance: 4 Clusters Poor performance of Bus2 Asynchronous Ring Better than Synchronous Ring Close to Ideal (within 1%)

Synchronous / Asynchronous Contention delays Lower for Async. Ring Message issues as soon as the link is available Higher for 1-hop messages a single path Sync. Ring: issue 1 cycle every 2

Distribution (% times) Length of Input Queues Max. observed occupancy < 9 entries Handle overflows by flushing the pipeline Rather than including complex control flow # occupied entries # messages Distribution (% times) 1327534 96.20 1 47136 3.42 2 4807 0.35 3 484 0.04 4 26 5 >=6 Sample statistics (djpeg)

Performance: 8 Clusters Poor performance of buses Connectivity degree has a significant impact Asynchronous Torus close to Ideal (within1.5%)

Topology-Aware Steering 16.5% IPC improvement with 8 clusters (2.5% with 4 clusters)

Summary An efficient topology-aware steering scheme Cluster point-to-point interconnects For 4 clusters and 8 clusters Designed to minimize complexity and latency Compared to Bus-based models Idealized models with unlimited bandwidth

Conclusions The choice of ICN is crucial for performance Point-to-point better than buses Asynchronous rings better than synchronous Asynchronous interconnects perform close to ideal with minimal complexity Higher connectivity significantly improves performance Topology-aware steering essential to reduce latency Especially with many clusters The main conclusion is that the choice of interconnect is key for performance We have found that point-to-point interconnects outperform bus-based models And that partially asynchronous rings outperform synchronous rings, because issue rules constrain in excess the available bandwidth We also found that partially asynchronous interconnects perform close to ideal with unlimited bandwidth, despite having minimal complexity (just 1 RF write port required). An they do not require complex control-flow, just a tiny queue The 3 topologies studied, ring, mesh and torus differ in their connectivity degree. To that respect we have shown that higher connectivity significantly improves performance Finally, we have found that the topology-aware steering scheme is essential to reduce the latency of communications, and its impact on performance grows with the number of clusters.