Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Slides:

Advertisements

Similar presentations

Please do not distribute

Advertisements

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood Based on HPCA 20 paper UW-Madison Computer Sciences.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.

Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.

Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Memory System Characterization of Big Data Workloads

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

Toward Cache-Friendly Hardware Accelerators

Please do not distribute

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Please do not distribute

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

The MachSuite Benchmark

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University

Caches for Accelerators

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.

Content Project Goals. Workflow Background. System configuration. Working environment. System simulation. System synthesis. Benchmark. Multicore.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.

PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,

Please do not distribute

Please do not distribute

Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei Zhou

Please do not distribute

Framework For Exploring Interconnect Level Cache Coherency

Memory COMPUTER ARCHITECTURE

Please do not distribute

Please do not distribute

ECE354 Embedded Systems Introduction C Andras Moritz.

Supporting x86-64 Address Translation for 100s of GPU Lanes

18-447: Computer Architecture Lecture 23: Caches

Please do not distribute

ISPASS th April Santa Rosa, California

Application-Specific Customization of Soft Processor Microarchitecture

Stash: Have Your Scratchpad and Cache it Too

Texas Instruments TDA2x and Vision SDK

Architecture & Organization 1

Cache Memory Presentation I

Improving java performance using Dynamic Method Migration on FPGAs

Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula

Short Circuiting Memory Traffic in Handheld Platforms

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Architecture & Organization 1

High Level Synthesis Overview

A High Performance SoC: PkunityTM

Optimizing stencil code for FPGA

Introduction to Heterogeneous Parallel Computing

CSC3050 – Computer Architecture

Exascale Programming Models in an Era of Big Computation and Big Data

A Case for Interconnect-Aware Architectures

Application-Specific Customization of Soft Processor Microarchitecture

Border Control: Sandboxing Accelerators

Computer Architecture Lecture 30: In-memory Processing

Presentation transcript:

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin Yakun Sophia Shao&, Sam Xi, Viji Srinivasan*, Gu-Yeon Wei, and David Brooks NVIDIA& IBM Research* Harvard University

Accelerator Design Algorithm Power Cycle # of Lanes Lane … Local SRAM Bandwidth SRAM Size Harvard University

System-level interactions are not considered. In a complex SoC, accelerator’s execution is more than just computation. 5 3 CPU 0 CPU 1 L1 $ L2 $ 2 ACC MEM 1 4 6 System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA Harvard University

Accelerator Execution Flow MC DRAM CPU 0 CPU 1 L1 $ L2 $ System Bus Transfer Descriptors DMA 1 4 2 ACC MEM 5 6 3 md-knn running on Zynq Only 20% Harvard University

Balanced Accelerator Design Lane Local SRAM … # of Lanes Local SRAM Bandwidth Compute throughput: # of Lanes Local SRAM Bandwidth Size of Local SRAM Data throughput: Data movement Coherence handling Shared resource contention SoC Interfaces Harvard University

Over-designed Accelerator Lane Local SRAM … # of Lanes Local SRAM Bandwidth Compute throughput: # of Lanes Local SRAM Bandwidth Size of Local SRAM Data throughput: Data movement Coherence handling Shared resource contention SoC Interfaces Harvard University

Isolated vs. Co-Designed Harvard University

Isolated vs. Co-Designed Harvard University

Isolated vs. Co-Designed Harvard University

Executive Summary Goal: Methodology: Takeaways: Co-Design accelerators and SoC interfaces for balanced accelerator design Methodology: gem5-Aladdin: an SoC simulator Models the interactions between accelerators and SoCs < 6% error validated against the Xilinx Zynq platform Takeaways: Co-designed accelerators are less aggressively parallel, leading to more balanced designs and improved energy efficiency. The choice of local memory, i.e., cache or DMA, is highly dependent on the memory characteristics of the workload and the system architecture. Harvard University

Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance [ISCA’14, TopPicks’15] Harvard University

gem5-Aladdin: An SoC Simulator ACC0 MEM Lane 0 Lane 1 Lane 2 Lane 3 ARR 0 ARR 1 BUF 0 BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus ACC1 MEM Lane 0 Lane 1 Lane 2 Lane 3 TLB Cache Cache Interface MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA Harvard University

gem5-Aladdin: An SoC Simulator DMA Engine Leverage the DMA engine in gem5; Insert dmaLoad and dmaStore APIs to Aladdin’s trace; Model CPU cache flush/invalidate latency through Zynq board characterization. Cache Interface Model Intel’s HARP and IBM’s CAPI-like platforms; Use gem5’s cache model. Implement Aladdin’s TLB for address translation; CPU-Accelerator Interface Invoke accelerators through ioctl system call. Harvard University

Validation DMA IP Block FPGA ARM Core Xilinx Zynq SoC Flush Latency Accel Latency DMA Latency Application Accelerated Kernel Vivado HLS Verilog Harvard University

Validation Harvard University

Designed-in-Isolation vs. Co-Designed ACC Cache MC DRAM System Bus CPU 1 L1 $ Co-Design w/ a wider bus 4 ACC Cache MC DRAM System Bus CPU 1 L1 $ Co-Design 3 ACC SPAD SRC ADDR DEST ADDR LENGTH CHAN 0 CHAN 3 Channel Selection DMA MC DRAM System Bus CPU 1 L1 $ Co-Design 2 ACC SPAD Isolated Design 1 Harvard University

Designed-in-Isolation vs. Co-Designed Isolated Design 1 Compare: # of Lanes Size of Local SRAM BW of Local SRAM DMA Co-Design 2 Cache Co-Design 3 Cache Co-Design w/ a wider bus 4 Harvard University

Designed-in-Isolation vs. Co-Designed 32 Lanes 45KB 4 Lanes 4 Ports DMA Co-Design 45KB 16 Ports Isolated Design 16KB 8 Lanes 4 Ports Cache Co-Design 16KB 32 Lanes 4 Ports Cache Co-Design w/ a wider bus Harvard University

Designed-in-Isolation vs. Co-Designed Harvard University

EDP Improvement Harvard University

Also in the paper… DMA Optimizations: Cache Design Space Exploration: Pipelined DMA DMA-Triggered Computation Cache Design Space Exploration: Latency vs. Bandwidth Time Impact of Datapath Parallelism DMA vs. Cache Comparisons Pareto-Frontier Designs Harvard University

Conclusions Architects must take a holistic view when it comes to accelerator design. Accelerators that are designed in isolation tend to overprovision hardware resources. gem5-Aladdin enables accelerator and SoC interfaces co-design. Download gem5-Aladdin here: http://vlsiarch.eecs.harvard.edu/gem5-aladdin/ Harvard University