Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE 587 Hardware/Software Co- Design Lecture 26/27 CUDA to FPGA Flow Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.

Similar presentations


Presentation on theme: "ECE 587 Hardware/Software Co- Design Lecture 26/27 CUDA to FPGA Flow Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute."— Presentation transcript:

1 ECE 587 Hardware/Software Co- Design Lecture 26/27 CUDA to FPGA Flow Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute of Technology April 18/20, 2016 ECE 587 Hardware/Software Co-DesignSpring 20161

2 CUDA to FPGA Flow Alexandros Papakonstantinou et al., Efficient Compilation of CUDA Kernels for High- Performance Computing on FPGAs, ACM TECS, 13(2):25, Sep. 2013 ECE 587 Hardware/Software Co-DesignSpring 20162

3 Multicore heterogeneous multiprocessors become popular. Combine processors of different compute characteristics to boost performance per watt. Rise of new programming models for parallel processing. e.g. GPUs for compute-intensive kernels and CUDA programming How to leverage these investments for other kinds of accelerators? FCUDA: use the CUDA programming model for FPGA design Map the coarse- and fine-grained parallelism exposed in CUDA onto FPGA. Source-to-source compilation to transform SIMT CUDA code into task-level parallel C code. AutoPilot (derived from xPilot) to transform from C to RTL for FPGA. ECE 587 Hardware/Software Co-DesignSpring 20163 Quick Summary

4 Introduction ECE 587 Hardware/Software Co-DesignSpring 20164

5 Parallel processing was used mainly in supercomputing servers and clusters. Become widely available now. Allow to utilize more transistors due to power wall. Prefer more cores to support concurrency among cores than more complicated cores to support instruction-level-parallelism. Massive parallelism on regular structure. Eg. Cell-BE [IBM 2006], TILE [TILERA 2012], GeForce [NVIDIA 2012b] ECE 587 Hardware/Software Co-DesignSpring 20165 History of Parallel Processing

6 These massive parallelism devices have very diverse characteristics. To make them optimal for different types of applications and different usage scenarios. IBM Cell: MPSoC w/ heterogeneous cores, serve either as a stand-alone multiprocessor or a multicore accelerator GPUs: consist of hundreds of processing cores clustered into Streaming Multiprocessors (SMs), which require control of a host processor. Heterogeneity is the key for systems to utilize accelerators. Reconfigurable FPGA can offer flexible and power-efficient application- specific high-performance acceleration via. ECE 587 Hardware/Software Co-DesignSpring 20166 Heterogeneous Accelerators

7 One has to program those devices to gain advantage for performance and power. NOT EASY! Cost of learning parallel programming beyond sequential programs (OO). The need to morph applications into the model that is supported by the devices, e.g. DirectX/OpenGL for GPUs. The need to understand/work with underlying hardware architecture, e.g. RTL design for FPGA. Recent advances: HLS, CUDA/OpenCL FCUDA as a solution Adopt CUDA as a common programming interface for GPU/CPU/FPGA Allow to program FPGAs efficiently for massively parallel application kernels ECE 587 Hardware/Software Co-DesignSpring 20167 Challenges

8 ECE 587 Hardware/Software Co-DesignSpring 20168 FCUDA Flow

9 Source-to-Source Transformation and Optimization Data communication and compute optimizations Analysis of the kernel dataflow followed by data communication and computation reorganization. Enable efficient mapping of the kernel computation and data communication onto the FPGA hardware. Parallelism mapping transformations Expose CUDA parallelism in AutoPilot-C for HLS to generate parallel RTL Processing Engines. ECE 587 Hardware/Software Co-DesignSpring 20169 FCUDA SSTO

10 CUDA provides a C-styled API for expressing massive parallelism in a very concise fashion. Much easier to learn and to use than doing so in AutoPilot-C directly. CUDA as a common programming interface for heterogeneous compute clusters with both GPUs and FPGAs. Simplify application development No need to port applications to evaluate alternatives. Wide adoption of CUDA and its popularity render a large body of existing applications available to FPGA acceleration. ECE 587 Hardware/Software Co-DesignSpring 201610 Why CUDA?

11 Background ECE 587 Hardware/Software Co-DesignSpring 201611

12 Capacity of FPGAs have been increasing dramatically. 28nm FPGAs host PLLs, ADCs, PCIe interfaces, general-purpose processors, and DSPs, along with millions of reconfigurable logic cells and thousands of distributed memories. Hard IP modules offer compute and data communication efficiency. Reconfigurability enables one to leverage different types of application-specific parallelism (coarse- and fine-grained, data- and task-level, pipelining). Multi-FPGA systems can further facilitate massive parallelism. HC-1 Application-Specific Instruction Processor (ASIP) [Convey 2011] combines a multicore CPU with multi-FPGA-based custom instruction accelerators. Novo-G supercomputer [CHREC 2012] hosts 192 reconfigurable devices. Novo-G consumes almost three orders of magnitude less power compared to Opteron-based Jaguar and Cell-based Roadrunner supercomputers for comparable performance for bioinformatics-related applications. ECE 587 Hardware/Software Co-DesignSpring 201612 The FPGA Platform

13 HLS tools allow to use sequential programs instead of RTL. However, parallelism extraction from sequential programs is restricted at granularities coarser than loop iterations. Possible solutions from HLS frameworks Introduction of new parallel programming models. Language extensions for coarse-grained parallelism annotation. But developers need to learn and use them. There are efforts to adopt OpenCL to use FPGA accelerators. But how to program FPGAs into those accelerators remains unsolved. FCUDA uses CUDA as a unified interface to program GPU and FPGA. ECE 587 Hardware/Software Co-DesignSpring 201613 FPGA Programmability

14 Threadblocks map to SMs. Execute independently. Synchronization possible in recent version of CUDA. Threads in a threadblock map to SPs in a SM. Allow barrier-like synchronizations Execute in groups called warps using the same control flow. Memory hierarchy: register (SP), shared memory (SM), global memory, constant/texture memory (read-only). ECE 587 Hardware/Software Co-DesignSpring 201614 CUDA Overview

15 C/C++ to RTL Frontend based on LLVM Not all features are supported W/ annotations Procedures map to RTL modules Storage On-chip: allocated statically, scalar variables map to FPGA slice registers, constant size arrays/structures and aggregated nonscalar variables map to FPGA BRAMs Off-chip: inferred from pointers w/ annotations, developers assume full control ECE 587 Hardware/Software Co-DesignSpring 201615 AutoPilot-C Overview

16 By utilizing CUDA, FCUDA provides efficient kernel portability across GPUs and FPGAs. CUDA is a good model for programming platforms other than GPU CUDA C provides higher abstraction and incurs a lower learning curve compared to AutoPilot-C FCUDA uses source-to-source transformations instead of IR translation to exploit different levels of coarse-grained parallelism. Memory hierarchy available from CUDA C fits well with the memory view within hardware synthesis flows. ECE 587 Hardware/Software Co-DesignSpring 201616 Programming Model Translation

17 FCUDA Framework ECE 587 Hardware/Software Co-DesignSpring 201617

18 Objective Convert the implicit workload hierarchies (threads and threadblocks) of the CUDA programming model into explicit AutoPilot-C work items. Expose the coarse-grained parallelism and synchronization restrictions of the kernel in the AutoPilot-C programming model. Generate AutoPilot-C code that can be converted into high-performance RTL implementations. Map the hierarchy of thread groups in CUDA onto the spatial parallelism of the reconfigurable FPGA. Implement code restructuring optimizations to facilitate better performance onto the reconfigurable architecture. e.g. kernel decomposition into compute and data-transfer tasks ECE 587 Hardware/Software Co-DesignSpring 201618 CUDA C to Autopilot-C Translation

19 ECE 587 Hardware/Software Co-DesignSpring 201619 Example CUDA Code

20 ECE 587 Hardware/Software Co-DesignSpring 201620 Kernel Restructuring

21 ECE 587 Hardware/Software Co-DesignSpring 201621 Task Procedures (per Threadblock)

22 Coarse-grained parallelism in AutoPilot C is exposed via loop unroll- and-jam transformations. CUDA threadblocks map to PEs on FPGA Threadblocks are executed independently in CUDA. So each PE is statically scheduled to execute a disjoint subset of threadblocks. Threads within a threadblock are scheduled to execute in warps of size equal to the thread loop unroll degree on PE. Offer opportunity for resource sharing ECE 587 Hardware/Software Co-DesignSpring 201622 Some Translation Details

23 FPGA BRAMs are better suited for use as scratchpad memories rather than caches. FPGA lacks of complex hardware for dynamic threadblock context switching as well as dynamic coalescing of concurrent off-chip memory accesses. The FCUDA SSTO engine implements static coalescing by aggregating all the off-chip accesses into burst block transfers after formulating and combining data transfer tasks. ECE 587 Hardware/Software Co-DesignSpring 201623 Some Translation Details (Cont.)

24 ECE 587 Hardware/Software Co-DesignSpring 201624 Overlapping Computation and Data Transfer

25 ECE 587 Hardware/Software Co-DesignSpring 201625 Memory Interface

26 FCUDA Transformation and Optimization Algorithms ECE 587 Hardware/Software Co-DesignSpring 201626

27 ECE 587 Hardware/Software Co-DesignSpring 201627 Translation Overview

28 ECE 587 Hardware/Software Co-DesignSpring 201628 Translation Overview (Cont.)

29 Constant memory Faster read access in CUDA than global memory via SM-private caches FCUDA maps constant memory to BRAM per PE, and used prefetching before kernel execution to improve access latency. Global memory CUDA developer is responsible to organize global memory access in coalesced ways to utilize all off-chip memory bandwidth. FCUDA decouples global memory access from computation by introducing new local variables. Registers Shared between threads, or vectorized and implemented in BRAM Shared memory Threadblock-private, map to BRAM ECE 587 Hardware/Software Co-DesignSpring 201629 CUDA Memory Space Mapping

30 ECE 587 Hardware/Software Co-DesignSpring 201630 Example Transformation/Optimization

31 ECE 587 Hardware/Software Co-DesignSpring 201631 Example Transformation/Optimization (Cont.)

32 ECE 587 Hardware/Software Co-DesignSpring 201632 Example Transformation/Optimization (Cont.)

33 Experimental Results ECE 587 Hardware/Software Co-DesignSpring 201633

34 ECE 587 Hardware/Software Co-DesignSpring 201634 CUDA Kernels for Experiments

35 FPGA computing latency depends on Concurrency, e.g., PE count Cycles, e.g., functional unit allocation Frequency, e.g., interconnects Controlled by PE count Loop unrolling Array partitioning Task synchronization PE Clustering – set to 9 for all experiments ECE 587 Hardware/Software Co-DesignSpring 201635 Parallelism Extraction Impact on Performance

36 ECE 587 Hardware/Software Co-DesignSpring 201636 Threadblock- and Thread-Level Parallelism

37 Off-chip memory bandwidth: BW1 << BW2 << BW3 ECE 587 Hardware/Software Co-DesignSpring 201637 Compute and Data-Transfer Task Parallelism

38 GPU G92 Run at 1500MHz 128 SPs 64GB/s peak off-chip BW FPGA Xilinx Virtex-5-SX240T Run at 100~200MHz 1056 DSPs 1032 BRAMs (~ 2MB total) High off-chip memory bandwidth is extremely important. ECE 587 Hardware/Software Co-DesignSpring 201638 FPGA vs. GPU


Download ppt "ECE 587 Hardware/Software Co- Design Lecture 26/27 CUDA to FPGA Flow Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute."

Similar presentations


Ads by Google