Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Slides:



Advertisements
Similar presentations
Please do not distribute
Advertisements

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood Based on HPCA 20 paper UW-Madison Computer Sciences.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.
Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Memory System Characterization of Big Data Workloads
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Toward Cache-Friendly Hardware Accelerators
Please do not distribute
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Please do not distribute
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
The MachSuite Benchmark
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Caches for Accelerators
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
Content Project Goals. Workflow Background. System configuration. Working environment. System simulation. System synthesis. Benchmark. Multicore.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.
PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration Zhenman Fang, Michael Gill Jason Cong,
Please do not distribute
Please do not distribute
Jason Cong, Yu-Ting Chen, Zhenman Fang, Bingjun Xiao, Peipei Zhou
Please do not distribute
Framework For Exploring Interconnect Level Cache Coherency
Memory COMPUTER ARCHITECTURE
Please do not distribute
Please do not distribute
ECE354 Embedded Systems Introduction C Andras Moritz.
Supporting x86-64 Address Translation for 100s of GPU Lanes
18-447: Computer Architecture Lecture 23: Caches
Please do not distribute
ISPASS th April Santa Rosa, California
Application-Specific Customization of Soft Processor Microarchitecture
Stash: Have Your Scratchpad and Cache it Too
Texas Instruments TDA2x and Vision SDK
Architecture & Organization 1
Cache Memory Presentation I
Improving java performance using Dynamic Method Migration on FPGAs
Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula
Short Circuiting Memory Traffic in Handheld Platforms
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Architecture & Organization 1
High Level Synthesis Overview
A High Performance SoC: PkunityTM
Optimizing stencil code for FPGA
Introduction to Heterogeneous Parallel Computing
CSC3050 – Computer Architecture
Exascale Programming Models in an Era of Big Computation and Big Data
A Case for Interconnect-Aware Architectures
Application-Specific Customization of Soft Processor Microarchitecture
Border Control: Sandboxing Accelerators
Computer Architecture Lecture 30: In-memory Processing
Presentation transcript:

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin Yakun Sophia Shao&, Sam Xi, Viji Srinivasan*, Gu-Yeon Wei, and David Brooks NVIDIA& IBM Research* Harvard University

Accelerator Design Algorithm Power Cycle # of Lanes Lane … Local SRAM Bandwidth SRAM Size Harvard University

System-level interactions are not considered. In a complex SoC, accelerator’s execution is more than just computation. 5 3 CPU 0 CPU 1 L1 $ L2 $ 2 ACC MEM 1 4 6 System Bus MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA Harvard University

Accelerator Execution Flow MC DRAM CPU 0 CPU 1 L1 $ L2 $ System Bus Transfer Descriptors DMA 1 4 2 ACC MEM 5 6 3 md-knn running on Zynq Only 20% Harvard University

Balanced Accelerator Design Lane Local SRAM … # of Lanes Local SRAM Bandwidth Compute throughput: # of Lanes Local SRAM Bandwidth Size of Local SRAM Data throughput: Data movement Coherence handling Shared resource contention SoC Interfaces Harvard University

Over-designed Accelerator Lane Local SRAM … # of Lanes Local SRAM Bandwidth Compute throughput: # of Lanes Local SRAM Bandwidth Size of Local SRAM Data throughput: Data movement Coherence handling Shared resource contention SoC Interfaces Harvard University

Isolated vs. Co-Designed Harvard University

Isolated vs. Co-Designed Harvard University

Isolated vs. Co-Designed Harvard University

Executive Summary Goal: Methodology: Takeaways: Co-Design accelerators and SoC interfaces for balanced accelerator design Methodology: gem5-Aladdin: an SoC simulator Models the interactions between accelerators and SoCs < 6% error validated against the Xilinx Zynq platform Takeaways: Co-designed accelerators are less aggressively parallel, leading to more balanced designs and improved energy efficiency. The choice of local memory, i.e., cache or DMA, is highly dependent on the memory characteristics of the workload and the system architecture. Harvard University

Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance [ISCA’14, TopPicks’15] Harvard University

gem5-Aladdin: An SoC Simulator ACC0 MEM Lane 0 Lane 1 Lane 2 Lane 3 ARR 0 ARR 1 BUF 0 BUF 1 Lane 4 SPAD Interface CPU 0 CPU 1 L1 $ L2 $ System Bus ACC1 MEM Lane 0 Lane 1 Lane 2 Lane 3 TLB Cache Cache Interface MC DRAM SRC ADDR DEST ADDR LENGTH Transfer Descriptors CHAN 0 CHAN 3 Channel Selection DMA Harvard University

gem5-Aladdin: An SoC Simulator DMA Engine Leverage the DMA engine in gem5; Insert dmaLoad and dmaStore APIs to Aladdin’s trace; Model CPU cache flush/invalidate latency through Zynq board characterization. Cache Interface Model Intel’s HARP and IBM’s CAPI-like platforms; Use gem5’s cache model. Implement Aladdin’s TLB for address translation; CPU-Accelerator Interface Invoke accelerators through ioctl system call. Harvard University

Validation DMA IP Block FPGA ARM Core Xilinx Zynq SoC Flush Latency Accel Latency DMA Latency Application Accelerated Kernel Vivado HLS Verilog Harvard University

Validation Harvard University

Designed-in-Isolation vs. Co-Designed ACC Cache MC DRAM System Bus CPU 1 L1 $ Co-Design w/ a wider bus 4 ACC Cache MC DRAM System Bus CPU 1 L1 $ Co-Design 3 ACC SPAD SRC ADDR DEST ADDR LENGTH CHAN 0 CHAN 3 Channel Selection DMA MC DRAM System Bus CPU 1 L1 $ Co-Design 2 ACC SPAD Isolated Design 1 Harvard University

Designed-in-Isolation vs. Co-Designed Isolated Design 1 Compare: # of Lanes Size of Local SRAM BW of Local SRAM DMA Co-Design 2 Cache Co-Design 3 Cache Co-Design w/ a wider bus 4 Harvard University

Designed-in-Isolation vs. Co-Designed 32 Lanes 45KB 4 Lanes 4 Ports DMA Co-Design 45KB 16 Ports Isolated Design 16KB 8 Lanes 4 Ports Cache Co-Design 16KB 32 Lanes 4 Ports Cache Co-Design w/ a wider bus Harvard University

Designed-in-Isolation vs. Co-Designed Harvard University

EDP Improvement Harvard University

Also in the paper… DMA Optimizations: Cache Design Space Exploration: Pipelined DMA DMA-Triggered Computation Cache Design Space Exploration: Latency vs. Bandwidth Time Impact of Datapath Parallelism DMA vs. Cache Comparisons Pareto-Frontier Designs Harvard University

Conclusions Architects must take a holistic view when it comes to accelerator design. Accelerators that are designed in isolation tend to overprovision hardware resources. gem5-Aladdin enables accelerator and SoC interfaces co-design. Download gem5-Aladdin here: http://vlsiarch.eecs.harvard.edu/gem5-aladdin/ Harvard University