Presentation is loading. Please wait.

Presentation is loading. Please wait.

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Similar presentations


Presentation on theme: "Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin"— Presentation transcript:

1 Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Yakun Sophia Shao&, Sam Xi, Viji Srinivasan*, Gu-Yeon Wei, and David Brooks NVIDIA& IBM Research* Harvard University

2 Accelerator Design Algorithm Power Cycle # of Lanes Lane …
Local SRAM Bandwidth SRAM Size Harvard University

3 System-level interactions are not considered.
In a complex SoC, accelerator’s execution is more than just computation. 5 3 CPU 0 CPU 1 L1 $ L2 $ 2 ACC MEM 1 4 6 System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA Harvard University

4 Accelerator Execution Flow
MC DRAM CPU 0 CPU 1 L1 $ L2 $ System Bus Transfer Descriptors DMA 1 4 2 ACC MEM 5 6 3 md-knn running on Zynq Only 20% Harvard University

5 Balanced Accelerator Design
Lane Local SRAM # of Lanes Local SRAM Bandwidth Compute throughput: # of Lanes Local SRAM Bandwidth Size of Local SRAM Data throughput: Data movement Coherence handling Shared resource contention SoC Interfaces Harvard University

6 Over-designed Accelerator
Lane Local SRAM # of Lanes Local SRAM Bandwidth Compute throughput: # of Lanes Local SRAM Bandwidth Size of Local SRAM Data throughput: Data movement Coherence handling Shared resource contention SoC Interfaces Harvard University

7 Isolated vs. Co-Designed
Harvard University

8 Isolated vs. Co-Designed
Harvard University

9 Isolated vs. Co-Designed
Harvard University

10 Executive Summary Goal: Methodology: Takeaways:
Co-Design accelerators and SoC interfaces for balanced accelerator design Methodology: gem5-Aladdin: an SoC simulator Models the interactions between accelerators and SoCs < 6% error validated against the Xilinx Zynq platform Takeaways: Co-designed accelerators are less aggressively parallel, leading to more balanced designs and improved energy efficiency. The choice of local memory, i.e., cache or DMA, is highly dependent on the memory characteristics of the workload and the system architecture. Harvard University

11 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator
Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance [ISCA’14, TopPicks’15] Harvard University

12 gem5-Aladdin: An SoC Simulator
ACC0 MEM Lane 0 Lane 1 Lane 2 Lane 3 ARR 0 ARR 1 BUF 0 BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus ACC1 MEM Lane 0 Lane 1 Lane 2 Lane 3 TLB Cache Cache Interface MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA Harvard University

13 gem5-Aladdin: An SoC Simulator
DMA Engine Leverage the DMA engine in gem5; Insert dmaLoad and dmaStore APIs to Aladdin’s trace; Model CPU cache flush/invalidate latency through Zynq board characterization. Cache Interface Model Intel’s HARP and IBM’s CAPI-like platforms; Use gem5’s cache model. Implement Aladdin’s TLB for address translation; CPU-Accelerator Interface Invoke accelerators through ioctl system call. Harvard University

14 Validation DMA IP Block FPGA ARM Core Xilinx Zynq SoC Flush Latency
Accel Latency DMA Latency Application Accelerated Kernel Vivado HLS Verilog Harvard University

15 Validation Harvard University

16 Designed-in-Isolation vs. Co-Designed
ACC Cache MC DRAM System Bus CPU 1 L1 $ Co-Design w/ a wider bus 4 ACC Cache MC DRAM System Bus CPU 1 L1 $ Co-Design 3 ACC SPAD SRC ADDR DEST ADDR LENGTH CHAN 0 CHAN 3 Channel Selection DMA MC DRAM System Bus CPU 1 L1 $ Co-Design 2 ACC SPAD Isolated Design 1 Harvard University

17 Designed-in-Isolation vs. Co-Designed
Isolated Design 1 Compare: # of Lanes Size of Local SRAM BW of Local SRAM DMA Co-Design 2 Cache Co-Design 3 Cache Co-Design w/ a wider bus 4 Harvard University

18 Designed-in-Isolation vs. Co-Designed
32 Lanes 45KB 4 Lanes 4 Ports DMA Co-Design 45KB 16 Ports Isolated Design 16KB 8 Lanes 4 Ports Cache Co-Design 16KB 32 Lanes 4 Ports Cache Co-Design w/ a wider bus Harvard University

19 Designed-in-Isolation vs. Co-Designed
Harvard University

20 EDP Improvement Harvard University

21 Also in the paper… DMA Optimizations: Cache Design Space Exploration:
Pipelined DMA DMA-Triggered Computation Cache Design Space Exploration: Latency vs. Bandwidth Time Impact of Datapath Parallelism DMA vs. Cache Comparisons Pareto-Frontier Designs Harvard University

22 Conclusions Architects must take a holistic view when it comes to accelerator design. Accelerators that are designed in isolation tend to overprovision hardware resources. gem5-Aladdin enables accelerator and SoC interfaces co-design. Download gem5-Aladdin here: Harvard University


Download ppt "Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin"

Similar presentations


Ads by Google