Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri
Outline Context & Introduction Rigel Design Goals Rigel Architecture Design Elements Estimates for the Rigel Design Conclusion
Context & Introduction Accelerator(e.g. GPUs): a hardware entity designed to provide advantages for a specific class of applications including: higher performance, lower power, or lower unit cost compared to a general-purpose CPUs. Accelerator: maximize throughput(operations/sec) CPU: minimize latency (sec/operation)
Context & Introduction Challenges: Inflexible programming models Lack of conventional memory model Hard to scale irregular parallel apps Challenges lead to: Operations / area ($) Operations / Watt (power) Operations / Programmer Effort
Rigel Design Goals What: Future programming models Apps and models may not exist yet Flexible design: easier to retarget How: Focus on scalability, programmer effort Raised hardware/software interface Focusing design effort: five elements
Rigel Architecture Area-optimized Dual-issue In-order RISC-like ISA(instruction set architecture) Single-precision Floating-point Registers
Rigel Architecture
45nm technology, 320mm 2 Rigel chip: (1024cores) frequency of 1.2 GHz: a peak throughput of 2.4 TFLOPS
Design Elements 1.Execution Model: ISA, SIMD vs. MIMD, VLIW vs. OoOE, MT 2.Memory Model: Caches vs. scratchpad, ordering, coherence 3.Work Distribution: Scheduling, spectrum of SW/HW choices 4.Synchronization: Scalability, influence on prog. model\ 5.Locality Management
Element 1: Execution Model Tradeoff 1: MIMD vs. SIMD -Irregular data parallelism -Task parallelism Tradeoff 2: Latency vs. Throughput -Simple in-order cores Tradeoff 3: Full RISC ISA vs. Specialized Cores
Element 2: Memory Model Tradeoff 1: Single vs. Multiple address space Tradeoff 2: Hardware caches vs. scratchpads -Hardware exploits locality -Software manages global sharing Tradeoff 3: Hierarchical vs. Distributed -Cluster cache/global cache hierarchy -ISA provides local/global memory operations -Non-uniformity: Programmer effort
Element 3: Work Distribution Tradeoff (Spectrum):HW vs. SW Implementation -software task management: Hierarchical queues -Flexible policies + little specialized hardware Rigel Task Model
Rigel Task Model
Rigel Task Model Evaluation
Element 4: Synchronization Coherence mechanisms: 1. Control synchronization 2. Data sharing Broadcast update -use cases: flags and barriers -reduce contention from polling
Area estimates for the Rigel Design
Conclusions Although Rigel is not yet a physical chip, the whole idea is novel and feasible. Future Work: Element five: Locality Management The Rigel design strikes a balance between performance and programmability
References age/Rigel.html Rigel: A Scalable Architecture for Core Accelerators, Daniel R. Johnson et al, SAAHPC'09. The PowerPoint Presented at the 36th Annual International Symposium on Computer Architecture June 22nd, 2009 by John H. Kelm et al, UIUC