Download presentation
Presentation is loading. Please wait.
Published byDoris McLaughlin Modified over 7 years ago
1
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Yakun Sophia Shao&, Sam Xi, Viji Srinivasan*, Gu-Yeon Wei, and David Brooks NVIDIA& IBM Research* Harvard University
2
Accelerator Design Algorithm Power Cycle # of Lanes Lane …
Local SRAM Bandwidth SRAM Size Harvard University
3
System-level interactions are not considered.
In a complex SoC, accelerator’s execution is more than just computation. 5 3 CPU 0 CPU 1 L1 $ L2 $ 2 ACC MEM 1 4 6 System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA Harvard University
4
Accelerator Execution Flow
MC DRAM CPU 0 CPU 1 L1 $ L2 $ System Bus Transfer Descriptors DMA 1 4 2 ACC MEM 5 6 3 md-knn running on Zynq Only 20% Harvard University
5
Balanced Accelerator Design
Lane Local SRAM … # of Lanes Local SRAM Bandwidth Compute throughput: # of Lanes Local SRAM Bandwidth Size of Local SRAM Data throughput: Data movement Coherence handling Shared resource contention SoC Interfaces Harvard University
6
Over-designed Accelerator
Lane Local SRAM … # of Lanes Local SRAM Bandwidth Compute throughput: # of Lanes Local SRAM Bandwidth Size of Local SRAM Data throughput: Data movement Coherence handling Shared resource contention SoC Interfaces Harvard University
7
Isolated vs. Co-Designed
Harvard University
8
Isolated vs. Co-Designed
Harvard University
9
Isolated vs. Co-Designed
Harvard University
10
Executive Summary Goal: Methodology: Takeaways:
Co-Design accelerators and SoC interfaces for balanced accelerator design Methodology: gem5-Aladdin: an SoC simulator Models the interactions between accelerators and SoCs < 6% error validated against the Xilinx Zynq platform Takeaways: Co-designed accelerators are less aggressively parallel, leading to more balanced designs and improved energy efficiency. The choice of local memory, i.e., cache or DMA, is highly dependent on the memory characteristics of the workload and the system architecture. Harvard University
11
Aladdin: A pre-RTL, Power-Performance Accelerator Simulator
Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance [ISCA’14, TopPicks’15] Harvard University
12
gem5-Aladdin: An SoC Simulator
ACC0 MEM Lane 0 Lane 1 Lane 2 Lane 3 ARR 0 ARR 1 BUF 0 BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus ACC1 MEM Lane 0 Lane 1 Lane 2 Lane 3 TLB Cache Cache Interface MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA Harvard University
13
gem5-Aladdin: An SoC Simulator
DMA Engine Leverage the DMA engine in gem5; Insert dmaLoad and dmaStore APIs to Aladdin’s trace; Model CPU cache flush/invalidate latency through Zynq board characterization. Cache Interface Model Intel’s HARP and IBM’s CAPI-like platforms; Use gem5’s cache model. Implement Aladdin’s TLB for address translation; CPU-Accelerator Interface Invoke accelerators through ioctl system call. Harvard University
14
Validation DMA IP Block FPGA ARM Core Xilinx Zynq SoC Flush Latency
Accel Latency DMA Latency Application Accelerated Kernel Vivado HLS Verilog Harvard University
15
Validation Harvard University
16
Designed-in-Isolation vs. Co-Designed
ACC Cache MC DRAM System Bus CPU 1 L1 $ Co-Design w/ a wider bus 4 ACC Cache MC DRAM System Bus CPU 1 L1 $ Co-Design 3 ACC SPAD SRC ADDR DEST ADDR LENGTH CHAN 0 CHAN 3 Channel Selection DMA MC DRAM System Bus CPU 1 L1 $ Co-Design 2 ACC SPAD Isolated Design 1 Harvard University
17
Designed-in-Isolation vs. Co-Designed
Isolated Design 1 Compare: # of Lanes Size of Local SRAM BW of Local SRAM DMA Co-Design 2 Cache Co-Design 3 Cache Co-Design w/ a wider bus 4 Harvard University
18
Designed-in-Isolation vs. Co-Designed
32 Lanes 45KB 4 Lanes 4 Ports DMA Co-Design 45KB 16 Ports Isolated Design 16KB 8 Lanes 4 Ports Cache Co-Design 16KB 32 Lanes 4 Ports Cache Co-Design w/ a wider bus Harvard University
19
Designed-in-Isolation vs. Co-Designed
Harvard University
20
EDP Improvement Harvard University
21
Also in the paper… DMA Optimizations: Cache Design Space Exploration:
Pipelined DMA DMA-Triggered Computation Cache Design Space Exploration: Latency vs. Bandwidth Time Impact of Datapath Parallelism DMA vs. Cache Comparisons Pareto-Frontier Designs Harvard University
22
Conclusions Architects must take a holistic view when it comes to accelerator design. Accelerators that are designed in isolation tend to overprovision hardware resources. gem5-Aladdin enables accelerator and SoC interfaces co-design. Download gem5-Aladdin here: Harvard University
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.