Professor Jia Wang Department of Electrical and Computer Engineering

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU platforms GP - General Purpose computation using GPU
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Automated Design of Custom Architecture Tulika Mitra
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
ECE 587 Hardware/Software Co- Design Lecture 26/27 CUDA to FPGA Flow Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.
ECE 587 Hardware/Software Co- Design Lecture 23 LLVM and xPilot Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Computer Engg, IIT(BHU)
Computer Organization and Architecture Lecture 1 : Introduction
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
System-on-Chip Design
Advanced Computer Systems
Presenter: Darshika G. Perera Assistant Professor
Overview Parallel Processing Pipelining
Chapter 1 Introduction.
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Ph.D. in Computer Science
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
SOFTWARE DESIGN AND ARCHITECTURE
Chapter 1 Introduction.
课程名 编译原理 Compiling Techniques
FPGAs in AWS and First Use Cases, Kees Vissers
Lecture 5: GPU Compute Architecture
Anne Pratoomtong ECE734, Spring2002
Lecture 26: Multiprocessors
CSCI1600: Embedded and Real Time Software
Lecture 5: GPU Compute Architecture for the last time
Performance Optimization for Embedded Software
A High Performance SoC: PkunityTM
Chapter 1 Introduction.
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
HIGH LEVEL SYNTHESIS.
Multithreaded Programming
Chapter 4: Threads & Concurrency
Combinational Circuits
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CSCI1600: Embedded and Real Time Software
Presentation transcript:

ECE 587 Hardware/Software Co-Design Lecture 21/22 Hardware Acceleration Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute of Technology April 9 and April 11, 2019 ECE 587 Hardware/Software Co-Design

Hardware Acceleration Provide (much) better performance, performance per cost, and/or performance per power/energy than general purpose processors. On specific applications: data analytics, deep learning, bioinformatics, etc. In specific environments: cell phone, cloud, data center, etc. (Much) less NRE cost and shorter time-to-market than ASIC designs. Use commercial off-the-shelf hardware platform. Provide flexibility in functionality via software. What hardware platforms are available? What language(s) should designers use? ECE 587 Hardware/Software Co-Design

LLVM Chris Lattner, Chapter 11 LLVM, The Architecture of Open Source Applications ECE 587 Hardware/Software Co-Design

The LLVM Compiler Infrastructure Start as a project (Low Level Virtual Machine) to modernize open source compiler development. At then (early 2000) most open source compilers like GCC are of monolithic architecture, making reusing part of their code almost impossible. Huge obstacle if you want to try new ideas on compilers for real world impact. Now an umbrella project for a set low-level compiler toolchain components. Notably Clang. Also closely related to many new development of programming languages like OpenCL and Swift. ECE 587 Hardware/Software Co-Design

Classical Three-Phase Compiler Design Frontend: parse source code to build Abstract Syntax Tree (AST), and then into Intermediate Representation (IR). Optimizer: perform language/target-independent transformations. Backend: generate binary code good for particular target. ECE 587 Hardware/Software Co-Design

Architecture Implications Sharing of code makes it easier for people to contribute: no need to implement everything by themselves upfront. ECE 587 Hardware/Software Co-Design

The Reality (before LLVM) Observations Many language implementations don’t share code. Some language implementations may retarget to multiple processors in very language-specific ways. There are needs to build compilers for specific application domains. Successful stories Java and .NET virtual machines, translation to C code: good retargetability only if programming model matches. GCC: though it is extremely difficult to reuse its code as frontend and backend are tightly coupled. ECE 587 Hardware/Software Co-Design

LLVM IR ECE 587 Hardware/Software Co-Design

LLVM IR (Cont.) ECE 587 Hardware/Software Co-Design

LLVM IR Optimization ECE 587 Hardware/Software Co-Design

Compiler Design with LLVM ECE 587 Hardware/Software Co-Design

LLVM IR as the Only Interface between Phases Both in textual form for easy exchanging and as data structure for easy algorithmic manipulation. Frontend and backend developers can work independently, without the need to knowing much from the other group. As LLVM IR is very similar to CDFG, one may build HW/SW co-design tools for popular languages without a lot of knowledge of compiler frontend design, as long as one is familiar with CDFG/LLVM IR. Not the case for many other compilers like GCC. ECE 587 Hardware/Software Co-Design

LLVM IR as I/O of Optimization Passes Each optimization pass reads LLVM IR in, performs certain transformations, then emits LLVM IR as the output. Hopefully the output LLVM IR will execute faster than the input. To optimize LLVM IR is then to choose a set of passes that will be applied sequentially. Can be easily extended. Can be application and target specific. ECE 587 Hardware/Software Co-Design

LLVM IR Optimization Example ECE 587 Hardware/Software Co-Design

LLVM Target Code Generation Similar to optimization passes, code generation is also divided into passes to promote code sharing. Instruction selection Register allocation Scheduling Code layout optimization Assembly emission All can be replaced or customized for flexibility. ECE 587 Hardware/Software Co-Design

Interesting Capabilities Optimization across the boundary of languages. ECE 587 Hardware/Software Co-Design

Interesting Capabilities (Cont.) Code generation after target is known. ECE 587 Hardware/Software Co-Design

xPilot Chen et al., xPilot: A Platform-Based Behavioral Synthesis System, SRC Techcon Conference 2005 ECE 587 Hardware/Software Co-Design

HLS Trend Why did previous HLS tools fail (commercially)? Design complexity was still manageable at the RT level in 90’s. Lack of dependable RTL to GDSII flow due to the timing closure problem. HLS tools then were often inferior to manual designs. Advantages of HLS tools Better complexity management: 300K lines of code for typical RTL design vs. 40K lines of code of behavioral description. Shorter verification/simulation cycle Rapid system exploration Higher quality of results ECE 587 Hardware/Software Co-Design

xPilot System Overview Provide platform-based behavior synthesis technologies to optimize logic, interconnects, performance, and power simultaneously. Features Applicable to a wide range of application domains Amenable to a rich set of synthesis constraints Platform-based synthesis and optimization Extensible to consider physical information ECE 587 Hardware/Software Co-Design

xPilot Frontend Support SystemC and C LLVM GCC front-end first compiles SystemC/C into IR. Then high-level constructs, e.g. processes, ports, channels, are recovered from IR. Perform platform characterization Characterize the delay, area, and power for each type of available resource under different input/output count and bit width configurations. ECE 587 Hardware/Software Co-Design

Synthesis Engine Use linear programming based scheduling algorithm to support a variety of optimization techniques for both data-flow-intensive and control intensive applications. Perform simultaneous functional units and register binding Allow to consider the impacts from interconnects. Allow to explore the design space based on realistic platform-based measurements. ECE 587 Hardware/Software Co-Design

Experimental Results ECE 587 Hardware/Software Co-Design

CUDA to FPGA Flow Alexandros Papakonstantinou et al., Efficient Compilation of CUDA Kernels for High- Performance Computing on FPGAs, ACM TECS, 13(2):25, Sep. 2013 ECE 587 Hardware/Software Co-Design

Quick Summary Multicore heterogeneous multiprocessors become popular. Combine processors of different compute characteristics to boost performance per watt. Rise of new programming models for parallel processing. e.g. GPUs for compute-intensive kernels and CUDA programming How to leverage these investments for other kinds of accelerators? FCUDA: use the CUDA programming model for FPGA design Map the coarse- and fine-grained parallelism exposed in CUDA onto FPGA. Source-to-source compilation to transform SIMT CUDA code into task-level parallel C code. AutoPilot (derived from xPilot) to transform from C to RTL for FPGA. ECE 587 Hardware/Software Co-Design

History of Parallel Processing Parallel processing was used mainly in supercomputing servers and clusters. Become widely available now. Allow to utilize more transistors due to power wall. Prefer more cores to support concurrency among cores than more complicated cores to support instruction-level-parallelism. Massive parallelism on regular structure. Eg. Cell-BE [IBM 2006], TILE [TILERA 2012], GeForce [NVIDIA 2012b] ECE 587 Hardware/Software Co-Design

Heterogeneous Accelerators These massive parallelism devices have very diverse characteristics. To make them optimal for different types of applications and different usage scenarios. IBM Cell: MPSoC w/ heterogeneous cores, serve either as a stand-alone multiprocessor or a multicore accelerator GPUs: consist of hundreds of processing cores clustered into Streaming Multiprocessors (SMs), which require control of a host processor. Heterogeneity is the key for systems to utilize accelerators. Reconfigurable FPGA can offer flexible and power-efficient application- specific high-performance acceleration via. ECE 587 Hardware/Software Co-Design

Challenges One has to program those devices to gain advantage for performance and power. NOT EASY! Cost of learning parallel programming beyond sequential programs (OO). The need to morph applications into the model that is supported by the devices, e.g. DirectX/OpenGL for GPUs. The need to understand/work with underlying hardware architecture, e.g. RTL design for FPGA. Recent advances: HLS, CUDA/OpenCL FCUDA as a solution Adopt CUDA as a common programming interface for GPU/CPU/FPGA Allow to program FPGAs efficiently for massively parallel application kernels ECE 587 Hardware/Software Co-Design

FCUDA Flow ECE 587 Hardware/Software Co-Design

FCUDA SSTO Source-to-Source Transformation and Optimization Data communication and compute optimizations Analysis of the kernel dataflow followed by data communication and computation reorganization. Enable efficient mapping of the kernel computation and data communication onto the FPGA hardware. Parallelism mapping transformations Expose CUDA parallelism in AutoPilot-C for HLS to generate parallel RTL Processing Engines. ECE 587 Hardware/Software Co-Design

Why CUDA? CUDA provides a C-styled API for expressing massive parallelism in a very concise fashion. Much easier to learn and to use than doing so in AutoPilot-C directly. CUDA as a common programming interface for heterogeneous compute clusters with both GPUs and FPGAs. Simplify application development No need to port applications to evaluate alternatives. Wide adoption of CUDA and its popularity render a large body of existing applications available to FPGA acceleration. ECE 587 Hardware/Software Co-Design

The FPGA Platform Overview Capacity of FPGAs have been increasing dramatically. 28nm FPGAs host PLLs, ADCs, PCIe interfaces, general-purpose processors, and DSPs, along with millions of reconfigurable logic cells and thousands of distributed memories. Hard IP modules offer compute and data communication efficiency. Reconfigurability enables one to leverage different types of application-specific parallelism (coarse- and fine-grained, data- and task-level, pipelining). Multi-FPGA systems can further facilitate massive parallelism. HC-1 Application-Specific Instruction Processor (ASIP) [Convey 2011] combines a multicore CPU with multi-FPGA-based custom instruction accelerators. Novo-G supercomputer [CHREC 2012] hosts 192 reconfigurable devices. Novo-G consumes almost three orders of magnitude less power compared to Opteron-based Jaguar and Cell-based Roadrunner supercomputers for comparable performance for bioinformatics-related applications. ECE 587 Hardware/Software Co-Design

FPGA Programmability HLS tools allow to use sequential programs instead of RTL. However, parallelism extraction from sequential programs is restricted at granularities coarser than loop iterations. Possible solutions from HLS frameworks Introduction of new parallel programming models. Language extensions for coarse-grained parallelism annotation. But developers need to learn and use them. There are efforts to adopt OpenCL to use FPGA accelerators. But how to program FPGAs into those accelerators remains unsolved. FCUDA uses CUDA as a unified interface to program GPU and FPGA. ECE 587 Hardware/Software Co-Design

CUDA Overview Threadblocks map to SMs. Threads in a threadblock map Execute independently. Synchronization possible in recent version of CUDA. Threads in a threadblock map to SPs in a SM. Allow barrier-like synchronizations Execute in groups called warps using the same control flow. Memory hierarchy: register (SP), shared memory (SM), global memory, constant/texture memory (read-only). ECE 587 Hardware/Software Co-Design

AutoPilot-C Overview C/C++ to RTL Storage Frontend based on LLVM Not all features are supported W/ annotations Procedures map to RTL modules Storage On-chip: allocated statically, scalar variables map to FPGA slice registers, constant size arrays/structures and aggregated nonscalar variables map to FPGA BRAMs Off-chip: inferred from pointers w/ annotations, developers assume full control ECE 587 Hardware/Software Co-Design

Programming Model Translation By utilizing CUDA, FCUDA provides efficient kernel portability across GPUs and FPGAs. CUDA is a good model for programming platforms other than GPU CUDA C provides higher abstraction and incurs a lower learning curve compared to AutoPilot-C FCUDA uses source-to-source transformations instead of IR translation to exploit different levels of coarse-grained parallelism. Memory hierarchy available from CUDA C fits well with the memory view within hardware synthesis flows. ECE 587 Hardware/Software Co-Design

FCUDA: CUDA C to Autopilot-C Translation Objective Convert the implicit workload hierarchies (threads and threadblocks) of the CUDA programming model into explicit AutoPilot-C work items. Expose the coarse-grained parallelism and synchronization restrictions of the kernel in the AutoPilot-C programming model. Generate AutoPilot-C code that can be converted into high-performance RTL implementations. Map the hierarchy of thread groups in CUDA onto the spatial parallelism of the reconfigurable FPGA. Implement code restructuring optimizations to facilitate better performance onto the reconfigurable architecture. e.g. kernel decomposition into compute and data-transfer tasks ECE 587 Hardware/Software Co-Design

Example CUDA Code ECE 587 Hardware/Software Co-Design

Kernel Restructuring ECE 587 Hardware/Software Co-Design

Task Procedures (per Threadblock) ECE 587 Hardware/Software Co-Design

Some Translation Details Coarse-grained parallelism in AutoPilot C is exposed via loop unroll- and-jam transformations. CUDA threadblocks map to PEs on FPGA Threadblocks are executed independently in CUDA. So each PE is statically scheduled to execute a disjoint subset of threadblocks. Threads within a threadblock are scheduled to execute in warps of size equal to the thread loop unroll degree on PE. Offer opportunity for resource sharing ECE 587 Hardware/Software Co-Design

Some Translation Details (Cont.) FPGA BRAMs are better suited for use as scratchpad memories rather than caches. FPGA lacks of complex hardware for dynamic threadblock context switching as well as dynamic coalescing of concurrent off-chip memory accesses. The FCUDA SSTO engine implements static coalescing by aggregating all the off-chip accesses into burst block transfers after formulating and combining data transfer tasks. ECE 587 Hardware/Software Co-Design

Overlapping Computation and Data Transfer ECE 587 Hardware/Software Co-Design

Memory Interface ECE 587 Hardware/Software Co-Design

FCUDA Algorithms ECE 587 Hardware/Software Co-Design

Binding/Mapping Example ECE 587 Hardware/Software Co-Design

CUDA Memory Space Mapping Constant memory Faster read access in CUDA than global memory via SM-private caches FCUDA maps constant memory to BRAM per PE, and used prefetching before kernel execution to improve access latency. Global memory CUDA developer is responsible to organize global memory access in coalesced ways to utilize all off-chip memory bandwidth. FCUDA decouples global memory access from computation by introducing new local variables. Registers Shared between threads, or vectorized and implemented in BRAM Shared memory Threadblock-private, map to BRAM ECE 587 Hardware/Software Co-Design

Example Transformation/Optimization ECE 587 Hardware/Software Co-Design

Example Transformation/Optimization (Cont.) ECE 587 Hardware/Software Co-Design

Example Transformation/Optimization (Cont.) ECE 587 Hardware/Software Co-Design

CUDA Kernels for Experiments ECE 587 Hardware/Software Co-Design

Parallelism Extraction Impact on Performance FPGA computing latency depends on Concurrency, e.g., PE count Cycles, e.g., functional unit allocation Frequency, e.g., interconnects Controlled by PE count Loop unrolling Array partitioning Task synchronization PE Clustering – set to 9 for all experiments ECE 587 Hardware/Software Co-Design

Threadblock- and Thread-Level Parallelism ECE 587 Hardware/Software Co-Design

Compute and Data-Transfer Task Parallelism Off-chip memory bandwidth: BW1 << BW2 << BW3 ECE 587 Hardware/Software Co-Design

FPGA vs. GPU GPU G92 FPGA Xilinx Virtex-5-SX240T Run at 1500MHz 128 SPs 64GB/s peak off-chip BW FPGA Xilinx Virtex-5-SX240T Run at 100~200MHz 1056 DSPs 1032 BRAMs (~ 2MB total) High off-chip memory bandwidth is extremely important. ECE 587 Hardware/Software Co-Design