Please do not distribute 4/17/2017 Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone Accelerator Generation: High-Level Synthesis 10:30 am – 11:00 am HLS-Based Accelerator-Rich Architecture Simulation: PARADE 11:00 am – 11:30 am Break 11:30 am – 12:00 pm Pre-RTL SoC Simulation: gem5-Aladdin 12:00 pm – 12:30 pm FPGA Prototyping: ARACompiler 12:30 pm – 2:00 pm Lunch 2:00 pm – 3:00 pm Panel on Accelerator Research 3:00 pm – 3:30 pm Accelerator Benchmarks and Workload Characterization 3:30 pm – 4:00 pm 4:00 pm – 5:00 pm Hands-on Exercise Amortize optimization phase GYW
Please do not distribute 4/17/2017 Integration for Heterogeneous SoC Modeling Yakun Sophia Shao, Sam Xi, Gu-Yeon Wei, David Brooks Harvard University GYW
Accelerator-CPU Integration: Today’s Conventional SoCs Easy to integrate lots of IP, simple accelerator design Hard to program and share data Core L2 $ … L3 $ DMA On-Chip System Bus Acc #1 Scratchpad Acc #n
Accelerator Integration Trend Users design application-specific hardware accelerators. System vendors provide Host Service Layer with virtual memory and cache coherence support Intel QuickAssist QPI-Based FPGA Accelerator Platform (QAP) IBM POWER8’s Coherent Accelerator Processor Interface (CAPI) Main CPU/SoC FPGA or user-defined ASIC Core … Core Accelerator L2 $ L2 $ Acc Agent Host Service Layer L3 $
IBM CAPI: Two part solution Example of state-of-the-art: IBM POWER8’s Coherent Accelerator Processor Interface (CAPI) Virtual Addressing & Data Caching Easier, Natural Programming Model
IBM CAPI: Two part solution Coherent Accelerator Processor Proxy (CAPP) Snoops PowerBus on behalf of accelerator Power Service Layer (PSL) Performs address translations, page table walker support Provides cache and interface logic Accelerator Core … Core PCIe L2 $ L2 $ PSL CAPP L3 $ On-Chip Coherent PowerBus … Memory Cache TLB
But… accelerators are not one size fits all Problem: PSL layer consumes ~20-30% of FPGA resources… for one accelerator Applications have drastically different requirements. Memory design customization is often more important than datapath customization
gem5-Aladdin Integration CPU Acc Datapath Cache Scratchpad TLB DMA Engine Cache LLC DRAM
Code example: Sift void imsmooth(F2D* array, float sigma, F2D* product); void sift() { … imsmooth(I, temp, gss[0]); mapArrayToAccelerator(imsmooth, “array”, (void *)I, sizeof(I)); mapArrayToAccelerator(imsmooth, “product”, (void *)product, sizeof(product)); invokeAcceleratorAndBlock(imsmooth); }
Start Aladdin Simulation Code example: Sift void imsmooth(F2D* array, float sigma, F2D* product); void sift() { … // imsmooth(I, temp, gss[0]); mapArrayToAccelerator(imsmooth, “array”, (void *)I, sizeof(I)); mapArrayToAccelerator(imsmooth, “product”, (void *)product, sizeof(product)); invokeAccelerator(imsmooth); } Start Aladdin Simulation
Simulating Accelerator with Memory System using Aladdin Cache Memory
Acc Cache Memory CPU Cache Memory
Modeling Accelerators in an SoC-like Environment Please do not distribute 4/17/2017 Modeling Accelerators in an SoC-like Environment Acc Core Core Cache Memory GYW
Modeling Accelerators in an SoC-like Environment Core Cache Memory
Accelerator Research Infrastructure Standalone System Integration Modeling Aladdin gem5-Aladdin High-Level Synthesis PARADE RTL Prototyping FPGA
Tutorial References Y.S. Shao and D. Brooks, “ISA-Independent Workload Characterization and its Implications for Specialized Architectures,” ISPASS’13. B. Reagen, Y.S. Shao, G.-Y. Wei, D. Brooks, “Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware,” ISLPED’13. Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, “Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures,” ISCA’14. B. Reagen, B. Adolf, Y.S. Shao, G.-Y. Wei, D. Brooks, “MachSuite: Benchmarks for Accelerator Design and Customized Architectures,” IISWC’14.