Presentation is loading. Please wait.

Presentation is loading. Please wait.

Professor Jia Wang Department of Electrical and Computer Engineering

Similar presentations


Presentation on theme: "Professor Jia Wang Department of Electrical and Computer Engineering"— Presentation transcript:

1 ECE 587 Hardware/Software Co-Design Lecture 21/22 Hardware Acceleration
Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute of Technology April 9 and April 11, 2019 ECE 587 Hardware/Software Co-Design

2 Hardware Acceleration
Provide (much) better performance, performance per cost, and/or performance per power/energy than general purpose processors. On specific applications: data analytics, deep learning, bioinformatics, etc. In specific environments: cell phone, cloud, data center, etc. (Much) less NRE cost and shorter time-to-market than ASIC designs. Use commercial off-the-shelf hardware platform. Provide flexibility in functionality via software. What hardware platforms are available? What language(s) should designers use? ECE 587 Hardware/Software Co-Design

3 LLVM Chris Lattner, Chapter 11 LLVM, The Architecture of Open Source Applications ECE 587 Hardware/Software Co-Design

4 The LLVM Compiler Infrastructure
Start as a project (Low Level Virtual Machine) to modernize open source compiler development. At then (early 2000) most open source compilers like GCC are of monolithic architecture, making reusing part of their code almost impossible. Huge obstacle if you want to try new ideas on compilers for real world impact. Now an umbrella project for a set low-level compiler toolchain components. Notably Clang. Also closely related to many new development of programming languages like OpenCL and Swift. ECE 587 Hardware/Software Co-Design

5 Classical Three-Phase Compiler Design
Frontend: parse source code to build Abstract Syntax Tree (AST), and then into Intermediate Representation (IR). Optimizer: perform language/target-independent transformations. Backend: generate binary code good for particular target. ECE 587 Hardware/Software Co-Design

6 Architecture Implications
Sharing of code makes it easier for people to contribute: no need to implement everything by themselves upfront. ECE 587 Hardware/Software Co-Design

7 The Reality (before LLVM)
Observations Many language implementations don’t share code. Some language implementations may retarget to multiple processors in very language-specific ways. There are needs to build compilers for specific application domains. Successful stories Java and .NET virtual machines, translation to C code: good retargetability only if programming model matches. GCC: though it is extremely difficult to reuse its code as frontend and backend are tightly coupled. ECE 587 Hardware/Software Co-Design

8 LLVM IR ECE 587 Hardware/Software Co-Design

9 LLVM IR (Cont.) ECE 587 Hardware/Software Co-Design

10 LLVM IR Optimization ECE 587 Hardware/Software Co-Design

11 Compiler Design with LLVM
ECE 587 Hardware/Software Co-Design

12 LLVM IR as the Only Interface between Phases
Both in textual form for easy exchanging and as data structure for easy algorithmic manipulation. Frontend and backend developers can work independently, without the need to knowing much from the other group. As LLVM IR is very similar to CDFG, one may build HW/SW co-design tools for popular languages without a lot of knowledge of compiler frontend design, as long as one is familiar with CDFG/LLVM IR. Not the case for many other compilers like GCC. ECE 587 Hardware/Software Co-Design

13 LLVM IR as I/O of Optimization Passes
Each optimization pass reads LLVM IR in, performs certain transformations, then emits LLVM IR as the output. Hopefully the output LLVM IR will execute faster than the input. To optimize LLVM IR is then to choose a set of passes that will be applied sequentially. Can be easily extended. Can be application and target specific. ECE 587 Hardware/Software Co-Design

14 LLVM IR Optimization Example
ECE 587 Hardware/Software Co-Design

15 LLVM Target Code Generation
Similar to optimization passes, code generation is also divided into passes to promote code sharing. Instruction selection Register allocation Scheduling Code layout optimization Assembly emission All can be replaced or customized for flexibility. ECE 587 Hardware/Software Co-Design

16 Interesting Capabilities
Optimization across the boundary of languages. ECE 587 Hardware/Software Co-Design

17 Interesting Capabilities (Cont.)
Code generation after target is known. ECE 587 Hardware/Software Co-Design

18 xPilot Chen et al., xPilot: A Platform-Based Behavioral Synthesis System, SRC Techcon Conference 2005 ECE 587 Hardware/Software Co-Design

19 HLS Trend Why did previous HLS tools fail (commercially)?
Design complexity was still manageable at the RT level in 90’s. Lack of dependable RTL to GDSII flow due to the timing closure problem. HLS tools then were often inferior to manual designs. Advantages of HLS tools Better complexity management: 300K lines of code for typical RTL design vs. 40K lines of code of behavioral description. Shorter verification/simulation cycle Rapid system exploration Higher quality of results ECE 587 Hardware/Software Co-Design

20 xPilot System Overview
Provide platform-based behavior synthesis technologies to optimize logic, interconnects, performance, and power simultaneously. Features Applicable to a wide range of application domains Amenable to a rich set of synthesis constraints Platform-based synthesis and optimization Extensible to consider physical information ECE 587 Hardware/Software Co-Design

21 xPilot Frontend Support SystemC and C
LLVM GCC front-end first compiles SystemC/C into IR. Then high-level constructs, e.g. processes, ports, channels, are recovered from IR. Perform platform characterization Characterize the delay, area, and power for each type of available resource under different input/output count and bit width configurations. ECE 587 Hardware/Software Co-Design

22 Synthesis Engine Use linear programming based scheduling algorithm to support a variety of optimization techniques for both data-flow-intensive and control intensive applications. Perform simultaneous functional units and register binding Allow to consider the impacts from interconnects. Allow to explore the design space based on realistic platform-based measurements. ECE 587 Hardware/Software Co-Design

23 Experimental Results ECE 587 Hardware/Software Co-Design

24 CUDA to FPGA Flow Alexandros Papakonstantinou et al., Efficient Compilation of CUDA Kernels for High- Performance Computing on FPGAs, ACM TECS, 13(2):25, Sep. 2013 ECE 587 Hardware/Software Co-Design

25 Quick Summary Multicore heterogeneous multiprocessors become popular.
Combine processors of different compute characteristics to boost performance per watt. Rise of new programming models for parallel processing. e.g. GPUs for compute-intensive kernels and CUDA programming How to leverage these investments for other kinds of accelerators? FCUDA: use the CUDA programming model for FPGA design Map the coarse- and fine-grained parallelism exposed in CUDA onto FPGA. Source-to-source compilation to transform SIMT CUDA code into task-level parallel C code. AutoPilot (derived from xPilot) to transform from C to RTL for FPGA. ECE 587 Hardware/Software Co-Design

26 History of Parallel Processing
Parallel processing was used mainly in supercomputing servers and clusters. Become widely available now. Allow to utilize more transistors due to power wall. Prefer more cores to support concurrency among cores than more complicated cores to support instruction-level-parallelism. Massive parallelism on regular structure. Eg. Cell-BE [IBM 2006], TILE [TILERA 2012], GeForce [NVIDIA 2012b] ECE 587 Hardware/Software Co-Design

27 Heterogeneous Accelerators
These massive parallelism devices have very diverse characteristics. To make them optimal for different types of applications and different usage scenarios. IBM Cell: MPSoC w/ heterogeneous cores, serve either as a stand-alone multiprocessor or a multicore accelerator GPUs: consist of hundreds of processing cores clustered into Streaming Multiprocessors (SMs), which require control of a host processor. Heterogeneity is the key for systems to utilize accelerators. Reconfigurable FPGA can offer flexible and power-efficient application- specific high-performance acceleration via. ECE 587 Hardware/Software Co-Design

28 Challenges One has to program those devices to gain advantage for performance and power. NOT EASY! Cost of learning parallel programming beyond sequential programs (OO). The need to morph applications into the model that is supported by the devices, e.g. DirectX/OpenGL for GPUs. The need to understand/work with underlying hardware architecture, e.g. RTL design for FPGA. Recent advances: HLS, CUDA/OpenCL FCUDA as a solution Adopt CUDA as a common programming interface for GPU/CPU/FPGA Allow to program FPGAs efficiently for massively parallel application kernels ECE 587 Hardware/Software Co-Design

29 FCUDA Flow ECE 587 Hardware/Software Co-Design

30 FCUDA SSTO Source-to-Source Transformation and Optimization
Data communication and compute optimizations Analysis of the kernel dataflow followed by data communication and computation reorganization. Enable efficient mapping of the kernel computation and data communication onto the FPGA hardware. Parallelism mapping transformations Expose CUDA parallelism in AutoPilot-C for HLS to generate parallel RTL Processing Engines. ECE 587 Hardware/Software Co-Design

31 Why CUDA? CUDA provides a C-styled API for expressing massive parallelism in a very concise fashion. Much easier to learn and to use than doing so in AutoPilot-C directly. CUDA as a common programming interface for heterogeneous compute clusters with both GPUs and FPGAs. Simplify application development No need to port applications to evaluate alternatives. Wide adoption of CUDA and its popularity render a large body of existing applications available to FPGA acceleration. ECE 587 Hardware/Software Co-Design

32 The FPGA Platform Overview
Capacity of FPGAs have been increasing dramatically. 28nm FPGAs host PLLs, ADCs, PCIe interfaces, general-purpose processors, and DSPs, along with millions of reconfigurable logic cells and thousands of distributed memories. Hard IP modules offer compute and data communication efficiency. Reconfigurability enables one to leverage different types of application-specific parallelism (coarse- and fine-grained, data- and task-level, pipelining). Multi-FPGA systems can further facilitate massive parallelism. HC-1 Application-Specific Instruction Processor (ASIP) [Convey 2011] combines a multicore CPU with multi-FPGA-based custom instruction accelerators. Novo-G supercomputer [CHREC 2012] hosts 192 reconfigurable devices. Novo-G consumes almost three orders of magnitude less power compared to Opteron-based Jaguar and Cell-based Roadrunner supercomputers for comparable performance for bioinformatics-related applications. ECE 587 Hardware/Software Co-Design

33 FPGA Programmability HLS tools allow to use sequential programs instead of RTL. However, parallelism extraction from sequential programs is restricted at granularities coarser than loop iterations. Possible solutions from HLS frameworks Introduction of new parallel programming models. Language extensions for coarse-grained parallelism annotation. But developers need to learn and use them. There are efforts to adopt OpenCL to use FPGA accelerators. But how to program FPGAs into those accelerators remains unsolved. FCUDA uses CUDA as a unified interface to program GPU and FPGA. ECE 587 Hardware/Software Co-Design

34 CUDA Overview Threadblocks map to SMs. Threads in a threadblock map
Execute independently. Synchronization possible in recent version of CUDA. Threads in a threadblock map to SPs in a SM. Allow barrier-like synchronizations Execute in groups called warps using the same control flow. Memory hierarchy: register (SP), shared memory (SM), global memory, constant/texture memory (read-only). ECE 587 Hardware/Software Co-Design

35 AutoPilot-C Overview C/C++ to RTL Storage Frontend based on LLVM
Not all features are supported W/ annotations Procedures map to RTL modules Storage On-chip: allocated statically, scalar variables map to FPGA slice registers, constant size arrays/structures and aggregated nonscalar variables map to FPGA BRAMs Off-chip: inferred from pointers w/ annotations, developers assume full control ECE 587 Hardware/Software Co-Design

36 Programming Model Translation
By utilizing CUDA, FCUDA provides efficient kernel portability across GPUs and FPGAs. CUDA is a good model for programming platforms other than GPU CUDA C provides higher abstraction and incurs a lower learning curve compared to AutoPilot-C FCUDA uses source-to-source transformations instead of IR translation to exploit different levels of coarse-grained parallelism. Memory hierarchy available from CUDA C fits well with the memory view within hardware synthesis flows. ECE 587 Hardware/Software Co-Design

37 FCUDA: CUDA C to Autopilot-C Translation
Objective Convert the implicit workload hierarchies (threads and threadblocks) of the CUDA programming model into explicit AutoPilot-C work items. Expose the coarse-grained parallelism and synchronization restrictions of the kernel in the AutoPilot-C programming model. Generate AutoPilot-C code that can be converted into high-performance RTL implementations. Map the hierarchy of thread groups in CUDA onto the spatial parallelism of the reconfigurable FPGA. Implement code restructuring optimizations to facilitate better performance onto the reconfigurable architecture. e.g. kernel decomposition into compute and data-transfer tasks ECE 587 Hardware/Software Co-Design

38 Example CUDA Code ECE 587 Hardware/Software Co-Design

39 Kernel Restructuring ECE 587 Hardware/Software Co-Design

40 Task Procedures (per Threadblock)
ECE 587 Hardware/Software Co-Design

41 Some Translation Details
Coarse-grained parallelism in AutoPilot C is exposed via loop unroll- and-jam transformations. CUDA threadblocks map to PEs on FPGA Threadblocks are executed independently in CUDA. So each PE is statically scheduled to execute a disjoint subset of threadblocks. Threads within a threadblock are scheduled to execute in warps of size equal to the thread loop unroll degree on PE. Offer opportunity for resource sharing ECE 587 Hardware/Software Co-Design

42 Some Translation Details (Cont.)
FPGA BRAMs are better suited for use as scratchpad memories rather than caches. FPGA lacks of complex hardware for dynamic threadblock context switching as well as dynamic coalescing of concurrent off-chip memory accesses. The FCUDA SSTO engine implements static coalescing by aggregating all the off-chip accesses into burst block transfers after formulating and combining data transfer tasks. ECE 587 Hardware/Software Co-Design

43 Overlapping Computation and Data Transfer
ECE 587 Hardware/Software Co-Design

44 Memory Interface ECE 587 Hardware/Software Co-Design

45 FCUDA Algorithms ECE 587 Hardware/Software Co-Design

46 Binding/Mapping Example
ECE 587 Hardware/Software Co-Design

47 CUDA Memory Space Mapping
Constant memory Faster read access in CUDA than global memory via SM-private caches FCUDA maps constant memory to BRAM per PE, and used prefetching before kernel execution to improve access latency. Global memory CUDA developer is responsible to organize global memory access in coalesced ways to utilize all off-chip memory bandwidth. FCUDA decouples global memory access from computation by introducing new local variables. Registers Shared between threads, or vectorized and implemented in BRAM Shared memory Threadblock-private, map to BRAM ECE 587 Hardware/Software Co-Design

48 Example Transformation/Optimization
ECE 587 Hardware/Software Co-Design

49 Example Transformation/Optimization (Cont.)
ECE 587 Hardware/Software Co-Design

50 Example Transformation/Optimization (Cont.)
ECE 587 Hardware/Software Co-Design

51 CUDA Kernels for Experiments
ECE 587 Hardware/Software Co-Design

52 Parallelism Extraction Impact on Performance
FPGA computing latency depends on Concurrency, e.g., PE count Cycles, e.g., functional unit allocation Frequency, e.g., interconnects Controlled by PE count Loop unrolling Array partitioning Task synchronization PE Clustering – set to 9 for all experiments ECE 587 Hardware/Software Co-Design

53 Threadblock- and Thread-Level Parallelism
ECE 587 Hardware/Software Co-Design

54 Compute and Data-Transfer Task Parallelism
Off-chip memory bandwidth: BW1 << BW2 << BW3 ECE 587 Hardware/Software Co-Design

55 FPGA vs. GPU GPU G92 FPGA Xilinx Virtex-5-SX240T
Run at 1500MHz 128 SPs 64GB/s peak off-chip BW FPGA Xilinx Virtex-5-SX240T Run at 100~200MHz 1056 DSPs 1032 BRAMs (~ 2MB total) High off-chip memory bandwidth is extremely important. ECE 587 Hardware/Software Co-Design


Download ppt "Professor Jia Wang Department of Electrical and Computer Engineering"

Similar presentations


Ads by Google