Please do not distribute

Slides:

Advertisements

Similar presentations

Please do not distribute

Advertisements

Thoughts on Shared Caches Jeff Odom University of Maryland.

Toward Cache-Friendly Hardware Accelerators

Please do not distribute

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Please do not distribute

The MachSuite Benchmark

Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.

Buses Warning: some of the terminology is used inconsistently within the field.

Microprocessor-based systems Curse 7 Memory hierarchies.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.

Caches for Accelerators

Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.

1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.

Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.

Introduction to Operating Systems Concepts

Software and Communication Driver, for Multimedia analyzing tools on the CEVA-X Platform. June 2007 Arik Caspi Eyal Gabay.

Chapter 6 System Integration and Performance

Lecture 2. A Computer System for Labs

DIRECT MEMORY ACCESS and Computer Buses

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.

Please do not distribute

Bus Interfacing Processor-Memory Bus Backplane Bus I/O Bus

TI Information – Selective Disclosure

Please do not distribute

Framework For Exploring Interconnect Level Cache Coherency

Memory COMPUTER ARCHITECTURE

Please do not distribute

Please do not distribute

Improving Memory Access 1/3 The Cache and Virtual Memory

Please do not distribute

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers

Architecture & Organization 1

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

CS 105 Tour of the Black Holes of Computing

CS703 - Advanced Operating Systems

Chapter III Desktop Imaging Systems & Issues

Bruhadeshwar Meltdown Bruhadeshwar

Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula

Computer Architecture

Derek Chiou The University of Texas at Austin

OS Virtualization.

CSCI 315 Operating Systems Design

CMSC 611: Advanced Computer Architecture

Cache Memories September 30, 2008

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

GEOMATIKA UNIVERSITY COLLEGE CHAPTER 2 OPERATING SYSTEM PRINCIPLES

Architecture & Organization 1

CS 105 “Tour of the Black Holes of Computing!”

Chapter 6 Memory System Design

Optimizing stencil code for FPGA

Final Project presentation

Chapter 2: Operating-System Structures

CS 3410, Spring 2014 Computer Science Cornell University

CS 105 “Tour of the Black Holes of Computing!”

CS 105 “Tour of the Black Holes of Computing!”

CSE 471 Autumn 1998 Virtual memory

Main Memory Background

UNISIM (UNIted SIMulation Environment) walkthrough

6- General Purpose GPU Programming

Overview Problem Solution CPU vs Memory performance imbalance

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Please do not distribute 4/11/2018 Integration for Heterogeneous SoC Modeling Yakun Sophia Shao, Sam Xi, Gu-Yeon Wei, David Brooks Harvard University GYW

Today’s Accelerator-CPU Integration Simple interface to accelerators: DMA Easy to integrate lots of IP Hard to program and share data Core … Core Acc #1 Acc #n L1 $ L1 $ SPAD SPAD L2 $ On-Chip System Bus DMA DRAM

Today’s Accelerator-CPU Integration Simple interface to accelerators: DMA Easy to integrate lots of IP Hard to program and share data Core … Core Acc #1 Acc #n L1 $ L1 $ SPAD SPAD L2 $ On-Chip System Bus DMA DRAM

Typical DMA Flow Flush and invalidate input data from CPU caches. Invalidate a region of memory to be used for receiving accelerator output. Program a buffer descriptor describing the transfer (start, length, source, destination). When data is large, program multiple descriptors Initiate accelerator. Initiate data transfer. Wait for accelerator to complete.

DMA can be very expensive Only 20% of total time! 16-way parallel md-knn accelerator

Co-Design vs. Isolated Design

Co-Design vs. Isolated Design

Co-Design vs. Isolated Design No need to build such an aggressively parallel design!

gem5-Aladdin: An SoC Simulator

Features End-to-end simulation of accelerated workloads. Models hardware-managed caches and DMA + scratchpad memory systems. Supports multiple accelerators. Enables system-level studies of accelerator-centric platforms. Xenon: A powerful design sweep system. Highly configurable and extensible.

DMA Engine Extend the existing DMA engine in gem5 to accelerators. Special dmaLoad and dmaStore functions. Insert into accelerated kernel. Trace will capture them. gem5-Aladdin will handle them. Data is sent back and forth as required. Analytical model for cache flush and invalidation latency. Flush throughput is 1 cache line per 56 cycles (84ns), plus 150 cycles overhead. Invalidate throughput is 1 cache line per 47 cycles (70ns), plus 400 cycles overhead All cycles are CPU clock cycles at 666MHz.

DMA Engine /* Code representing the accelerator */ void fft1D_512(TYPE work_x[512], TYPE work_y[512]){ int tid, hi, lo, stride; /* more setup */ } Maybe note that this is a bit different from how some platforms do DMA. In our case, the accelerator initiates the transfer, while on most platforms, the CPU initiates the transfer. But it doesn’t really matter for us – one way or another, the cost of DMA will be paid, and it’s just a matter of a small bit of power on the CPU side.

DMA Engine /* Code representing the accelerator */ void fft1D_512(TYPE work_x[512], TYPE work_y[512]){ int tid, hi, lo, stride; /* more setup */ dmaLoad(&work_x[0], 0, 512 * sizeof(TYPE)); dmaLoad(&work_y[0], 0, 512 * sizeof(TYPE)); } Maybe note that this is a bit different from how some platforms do DMA. In our case, the accelerator initiates the transfer, while on most platforms, the CPU initiates the transfer. But it doesn’t really matter for us – one way or another, the cost of DMA will be paid, and it’s just a matter of a small bit of power on the CPU side.

DMA Engine /* Code representing the accelerator */ void fft1D_512(TYPE work_x[512], TYPE work_y[512]){ int tid, hi, lo, stride; /* more setup */ dmaLoad(&work_x[0], 0, 512 * sizeof(TYPE)); dmaLoad(&work_y[0], 0, 512 * sizeof(TYPE)); /* Run FFT here ... */ } Maybe note that this is a bit different from how some platforms do DMA. In our case, the accelerator initiates the transfer, while on most platforms, the CPU initiates the transfer. But it doesn’t really matter for us – one way or another, the cost of DMA will be paid, and it’s just a matter of a small bit of power on the CPU side.

DMA Engine /* Code representing the accelerator */ void fft1D_512(TYPE work_x[512], TYPE work_y[512]){ int tid, hi, lo, stride; /* more setup */ dmaLoad(&work_x[0], 0, 512 * sizeof(TYPE)); dmaLoad(&work_y[0], 0, 512 * sizeof(TYPE)); /* Run FFT here ... */ dmaStore(&work_x[0], 0, 512 * sizeof(TYPE)); dmaStore(&work_y[0], 0, 512 * sizeof(TYPE)); } Maybe note that this is a bit different from how some platforms do DMA. In our case, the accelerator initiates the transfer, while on most platforms, the CPU initiates the transfer. But it doesn’t really matter for us – one way or another, the cost of DMA will be paid, and it’s just a matter of a small bit of power on the CPU side.

Caches and Virtual Memory Gaining traction on multiple platforms. Intel QuickAssist QPI-Based FPGA Accelerator Platform (QAP) IBM POWER8’s Coherent Accelerator Processor Interface (CAPI) System vendors provide a Host Service Layer with virtual memory and cache coherence support. Host service layer communicates with CPUs through an agent. Processors FPGA Host service layer might contain a cache and TLB. Accelerator agent would snoop the system buses on behalf of the accelerator and service cache misses/TLB misses. QPI/PCIe Core … Core Accelerator Acc Agent L1 $ L1 $ Host Service Layer L2 $

Caches and Virtual Memory Accelerator caches are connected directly to system bus. Support for multi-level cache hierarchies. Hybrid memory system: can use both caches and scratchpads. MOESI coherence protocol. Special Aladdin TLB model. Map trace address space to simulated address space.

Two ways to run gem5-Aladdin Standalone Aladdin + gem5 memory system models No CPUs in the system Easily test accelerator and memory system designs With-CPU Write user program to invoke one or more accelerators. Evaluate end-to-end workload performance.

Validation Implemented accelerators in Vivado HLS Designed complete system in Vivado Design Suite 2015.1.

Case study: reducing dma overheads

Reducing DMA Overhead

Reducing DMA Overhead

Reducing DMA Overhead

DMA Optimization Results

DMA Optimization Results Overlap of flush and data transfer

DMA Optimization Results Overlap of data transfer and compute

DMA Optimization Results md-knn is able to completely overlap computation with communication!

DMA Optimization Results

CPU – Accelerator Cosimulation CPU can invoke an attached accelerator. We use the ioctl system call. Communicate status through shared memory. Spin wait for accelerator, or do something else (e.g. start another accelerator).

Code example /* Code running on the CPU. */ void run_benchmark(TYPE work_x[512], TYPE work_y[512]) { }

Code example /* Code running on the CPU. */ void run_benchmark(TYPE work_x[512], TYPE work_y[512]) { /* Establish a mapping from simulated to trace * address space */ mapArrayToAccelerator(MACHSUITE_FFT_TRANSPOSE, "work_x", work_x, sizeof(work_x)); } ioctl request code Associate this array name with the addresses of memory accesses in the trace. Starting address and length of one memory region that the accelerator can access.

Code example /* Code running on the CPU. */ void run_benchmark(TYPE work_x[512], TYPE work_y[512]) { /* Establish a mapping from simulated to trace * address space */ mapArrayToAccelerator(MACHSUITE_FFT_TRANSPOSE, "work_x", work_x, sizeof(work_x)); mapArrayToAccelerator(MACHSUITE_FFT_TRANSPOSE, "work_y", work_y, sizeof(work_y)); }

Code example /* Code running on the CPU. */ void run_benchmark(TYPE work_x[512], TYPE work_y[512]) { /* Establish a mapping from simulated to trace * address space */ mapArrayToAccelerator(MACHSUITE_FFT_TRANSPOSE, "work_x", work_x, sizeof(work_x)); mapArrayToAccelerator(MACHSUITE_FFT_TRANSPOSE, "work_y", work_y, sizeof(work_y)); // Start the accelerator and spin until it finishes. invokeAcceleratorAndBlock(MACHSUITE_FFT_TRANSPOSE); }

One accelerator, multiple calls Call an accelerated function in a loop with different data each time. i=0 i=1 i=2 CPU code CPU code CPU code ACCEL ACCEL ACCEL

One accelerator, multiple calls Build the trace as usual. Trace will contain all iterations of this loop. i=0 i=1 i=2 ACCEL ACCEL ACCEL call ret call ret call ret

One accelerator, multiple calls Aladdin identifies call and ret instructions to mark as boundaries of an invocation. i=0 i=1 i=2 ACCEL ACCEL ACCEL call ret call ret call ret

One accelerator, multiple calls Aladdin only reads this part of the trace. Continue as usual. i=0 i=1 i=2 ACCEL ACCEL ACCEL call ret call ret call ret

One accelerator, multiple calls On the next iteration, Aladdin resumes reading the trace at the last position. i=0 i=1 i=2 ACCEL ACCEL ACCEL call ret call ret call ret

Multiple accelerators Build the trace as usual. Then: Divide them up into separate traces for each kernel. In the user code, we call invokeAccelerator() with a different request code for each accelerator. Easier to distinguish output of different accelerators. Leave it as a single trace. invokeAccelerator() has the same request code each time, even though a different workload is modeled.

How can I use gem5-Aladdin? Investigate optimizations to the DMA flow. Study cache-based accelerators. Study impact of system-level effects on accelerator design. Multi-accelerator systems. Near-data processing. All these will require design sweeps!

Xenon: Design Sweep System A small declarative command language for generating design sweep configurations. Implemented as a Python embedded DSL. Highly extensible. Not gem5-Aladdin specific. Not limited to sweeping parameters on benchmarks. Why “Xenon”?

1,000 ft view of Xenon Xenon operates on Python objects and attributes. Define a data model Instantiate the data model Execute Xenon commands over the data

Xenon: Data Model md-knn md_kernel loop_i loop_j force_x force_y force_z cycle_time pipelining partition_type partition_factor memory_type unrolling

Xenon: Commands set unrolling 4 set partition_type “cyclic” set unrolling for md_knn.* 8 set partition_type for md_knn.force_x “block” sweep cycle_time from 1 to 5 sweep partition_factor from 1 to 8 expstep 2 set partition_factor for md_knn.force_x 8 generate configs generate trace

Xenon: Generation Procedure Read sweep configuration file Execute sweep commands Generate all configurations Export configurations to JSON Backend: read JSON, rewrite into desired format. Backend: Generate any additional outputs

Xenon: Execute Every configuration in a JSON file. "Benchmark(\"md-knn\")": { "Array(\"NL\")": { "memory_type": "cache", "name": "NL", "partition_factor": 1, "partition_type": "cyclic", "size": 4096, "type": "Array", "word_length": 8 }, "Array(\"force_x\")": { "name": "force_x", "size": 256, "Array(\"force_y\")": { "name": "force_y", } ... Every configuration in a JSON file. A backend is then invoked to load this JSON object and write application specific config files.

gem5-aladdin System effects have significant impacts on accelerator performance and design. gem5-Aladdin enables the study of end-to-end accelerated workloads, including data movement, cache coherency, and shared resource contention. Download gem5-Aladdin at: http://vlsiarch.eecs.harvard.edu/gem5-aladdin

demos

Demo: DMA Exercise: change system bus width and see effect on accelerator performance. Open up your VM. Go to: ~gem5-aladdin/sweeps/tutorial/dma/stencil-stencil2d/0 Examine these files: stencil-stencil2d.cfg ../inputs/dynamic_trace.gz gem5.cfg run.sh

Demo: DMA Run the accelerator with DMA simulation Change the system bus width to 32 bits Set xbar_width=4 in run.sh Run again. Compare results. At 64-bits, cycles is 37058 At 32 bits, cycles is 45246

Demo: Caches Exercise: see effect of cache size on accelerator performance. Go to: ~gem5-aladdin/sweeps/tutorial/cache/stencil-stencil2d/0 Examine these files: ../inputs/dynamic_trace.gz stencil-stencil2d.cfg gem5.cfg

Demo: Caches Run the accelerator with caches simulation Change the cache size to 1kB. Set cache_size = 1kB in gem5.cfg. Run again. Compare results. Play with some other parameters (associativity, line size, etc.) Run with cache size = 1kB (92225) Change cache size = 4kB (77738)

Demo: disparity You can just watch for this one. If you want to follow along: ~/gem5-aladdin/sweeps/tutorial/cortexsuite_sweep/0 This is a multi-kernel, CPU + accelerator cosimulation.

Tutorial References Y.S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, D. Brooks, “Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin”, MICRO, 2016. Y.S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, D. Brooks, “Toward Cache-Friendly Hardware Accelerators”, SCAW, 2015. Y.S. Shao and D. Brooks, “ISA-Independent Workload Characterization and its Implications for Specialized Architectures,” ISPASS’13. B. Reagen, Y.S. Shao, G.-Y. Wei, D. Brooks, “Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware,” ISLPED’13. Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, “Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures,” ISCA’14. B. Reagen, B. Adolf, Y.S. Shao, G.-Y. Wei, D. Brooks, “MachSuite: Benchmarks for Accelerator Design and Customized Architectures,” IISWC’14.