ECE 587 Hardware/Software Co- Design Lecture 26/27 CUDA to FPGA Flow Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Chapter Hardwired vs Microprogrammed Control Multithreading
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
CUDA and the Memory Model (Part II). Code executed on GPU.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Architectural Optimizations David Ojika March 27, 2014.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
ECE 587 Hardware/Software Co- Design Lecture 23 LLVM and xPilot Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Computer Engg, IIT(BHU)
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
6- General Purpose GPU Programming
Professor Jia Wang Department of Electrical and Computer Engineering
Presentation transcript:

ECE 587 Hardware/Software Co- Design Lecture 26/27 CUDA to FPGA Flow Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute of Technology April 18/20, 2016 ECE 587 Hardware/Software Co-DesignSpring 20161

CUDA to FPGA Flow Alexandros Papakonstantinou et al., Efficient Compilation of CUDA Kernels for High- Performance Computing on FPGAs, ACM TECS, 13(2):25, Sep ECE 587 Hardware/Software Co-DesignSpring 20162

Multicore heterogeneous multiprocessors become popular. Combine processors of different compute characteristics to boost performance per watt. Rise of new programming models for parallel processing. e.g. GPUs for compute-intensive kernels and CUDA programming How to leverage these investments for other kinds of accelerators? FCUDA: use the CUDA programming model for FPGA design Map the coarse- and fine-grained parallelism exposed in CUDA onto FPGA. Source-to-source compilation to transform SIMT CUDA code into task-level parallel C code. AutoPilot (derived from xPilot) to transform from C to RTL for FPGA. ECE 587 Hardware/Software Co-DesignSpring Quick Summary

Introduction ECE 587 Hardware/Software Co-DesignSpring 20164

Parallel processing was used mainly in supercomputing servers and clusters. Become widely available now. Allow to utilize more transistors due to power wall. Prefer more cores to support concurrency among cores than more complicated cores to support instruction-level-parallelism. Massive parallelism on regular structure. Eg. Cell-BE [IBM 2006], TILE [TILERA 2012], GeForce [NVIDIA 2012b] ECE 587 Hardware/Software Co-DesignSpring History of Parallel Processing

These massive parallelism devices have very diverse characteristics. To make them optimal for different types of applications and different usage scenarios. IBM Cell: MPSoC w/ heterogeneous cores, serve either as a stand-alone multiprocessor or a multicore accelerator GPUs: consist of hundreds of processing cores clustered into Streaming Multiprocessors (SMs), which require control of a host processor. Heterogeneity is the key for systems to utilize accelerators. Reconfigurable FPGA can offer flexible and power-efficient application- specific high-performance acceleration via. ECE 587 Hardware/Software Co-DesignSpring Heterogeneous Accelerators

One has to program those devices to gain advantage for performance and power. NOT EASY! Cost of learning parallel programming beyond sequential programs (OO). The need to morph applications into the model that is supported by the devices, e.g. DirectX/OpenGL for GPUs. The need to understand/work with underlying hardware architecture, e.g. RTL design for FPGA. Recent advances: HLS, CUDA/OpenCL FCUDA as a solution Adopt CUDA as a common programming interface for GPU/CPU/FPGA Allow to program FPGAs efficiently for massively parallel application kernels ECE 587 Hardware/Software Co-DesignSpring Challenges

ECE 587 Hardware/Software Co-DesignSpring FCUDA Flow

Source-to-Source Transformation and Optimization Data communication and compute optimizations Analysis of the kernel dataflow followed by data communication and computation reorganization. Enable efficient mapping of the kernel computation and data communication onto the FPGA hardware. Parallelism mapping transformations Expose CUDA parallelism in AutoPilot-C for HLS to generate parallel RTL Processing Engines. ECE 587 Hardware/Software Co-DesignSpring FCUDA SSTO

CUDA provides a C-styled API for expressing massive parallelism in a very concise fashion. Much easier to learn and to use than doing so in AutoPilot-C directly. CUDA as a common programming interface for heterogeneous compute clusters with both GPUs and FPGAs. Simplify application development No need to port applications to evaluate alternatives. Wide adoption of CUDA and its popularity render a large body of existing applications available to FPGA acceleration. ECE 587 Hardware/Software Co-DesignSpring Why CUDA?

Background ECE 587 Hardware/Software Co-DesignSpring

Capacity of FPGAs have been increasing dramatically. 28nm FPGAs host PLLs, ADCs, PCIe interfaces, general-purpose processors, and DSPs, along with millions of reconfigurable logic cells and thousands of distributed memories. Hard IP modules offer compute and data communication efficiency. Reconfigurability enables one to leverage different types of application-specific parallelism (coarse- and fine-grained, data- and task-level, pipelining). Multi-FPGA systems can further facilitate massive parallelism. HC-1 Application-Specific Instruction Processor (ASIP) [Convey 2011] combines a multicore CPU with multi-FPGA-based custom instruction accelerators. Novo-G supercomputer [CHREC 2012] hosts 192 reconfigurable devices. Novo-G consumes almost three orders of magnitude less power compared to Opteron-based Jaguar and Cell-based Roadrunner supercomputers for comparable performance for bioinformatics-related applications. ECE 587 Hardware/Software Co-DesignSpring The FPGA Platform

HLS tools allow to use sequential programs instead of RTL. However, parallelism extraction from sequential programs is restricted at granularities coarser than loop iterations. Possible solutions from HLS frameworks Introduction of new parallel programming models. Language extensions for coarse-grained parallelism annotation. But developers need to learn and use them. There are efforts to adopt OpenCL to use FPGA accelerators. But how to program FPGAs into those accelerators remains unsolved. FCUDA uses CUDA as a unified interface to program GPU and FPGA. ECE 587 Hardware/Software Co-DesignSpring FPGA Programmability

Threadblocks map to SMs. Execute independently. Synchronization possible in recent version of CUDA. Threads in a threadblock map to SPs in a SM. Allow barrier-like synchronizations Execute in groups called warps using the same control flow. Memory hierarchy: register (SP), shared memory (SM), global memory, constant/texture memory (read-only). ECE 587 Hardware/Software Co-DesignSpring CUDA Overview

C/C++ to RTL Frontend based on LLVM Not all features are supported W/ annotations Procedures map to RTL modules Storage On-chip: allocated statically, scalar variables map to FPGA slice registers, constant size arrays/structures and aggregated nonscalar variables map to FPGA BRAMs Off-chip: inferred from pointers w/ annotations, developers assume full control ECE 587 Hardware/Software Co-DesignSpring AutoPilot-C Overview

By utilizing CUDA, FCUDA provides efficient kernel portability across GPUs and FPGAs. CUDA is a good model for programming platforms other than GPU CUDA C provides higher abstraction and incurs a lower learning curve compared to AutoPilot-C FCUDA uses source-to-source transformations instead of IR translation to exploit different levels of coarse-grained parallelism. Memory hierarchy available from CUDA C fits well with the memory view within hardware synthesis flows. ECE 587 Hardware/Software Co-DesignSpring Programming Model Translation

FCUDA Framework ECE 587 Hardware/Software Co-DesignSpring

Objective Convert the implicit workload hierarchies (threads and threadblocks) of the CUDA programming model into explicit AutoPilot-C work items. Expose the coarse-grained parallelism and synchronization restrictions of the kernel in the AutoPilot-C programming model. Generate AutoPilot-C code that can be converted into high-performance RTL implementations. Map the hierarchy of thread groups in CUDA onto the spatial parallelism of the reconfigurable FPGA. Implement code restructuring optimizations to facilitate better performance onto the reconfigurable architecture. e.g. kernel decomposition into compute and data-transfer tasks ECE 587 Hardware/Software Co-DesignSpring CUDA C to Autopilot-C Translation

ECE 587 Hardware/Software Co-DesignSpring Example CUDA Code

ECE 587 Hardware/Software Co-DesignSpring Kernel Restructuring

ECE 587 Hardware/Software Co-DesignSpring Task Procedures (per Threadblock)

Coarse-grained parallelism in AutoPilot C is exposed via loop unroll- and-jam transformations. CUDA threadblocks map to PEs on FPGA Threadblocks are executed independently in CUDA. So each PE is statically scheduled to execute a disjoint subset of threadblocks. Threads within a threadblock are scheduled to execute in warps of size equal to the thread loop unroll degree on PE. Offer opportunity for resource sharing ECE 587 Hardware/Software Co-DesignSpring Some Translation Details

FPGA BRAMs are better suited for use as scratchpad memories rather than caches. FPGA lacks of complex hardware for dynamic threadblock context switching as well as dynamic coalescing of concurrent off-chip memory accesses. The FCUDA SSTO engine implements static coalescing by aggregating all the off-chip accesses into burst block transfers after formulating and combining data transfer tasks. ECE 587 Hardware/Software Co-DesignSpring Some Translation Details (Cont.)

ECE 587 Hardware/Software Co-DesignSpring Overlapping Computation and Data Transfer

ECE 587 Hardware/Software Co-DesignSpring Memory Interface

FCUDA Transformation and Optimization Algorithms ECE 587 Hardware/Software Co-DesignSpring

ECE 587 Hardware/Software Co-DesignSpring Translation Overview

ECE 587 Hardware/Software Co-DesignSpring Translation Overview (Cont.)

Constant memory Faster read access in CUDA than global memory via SM-private caches FCUDA maps constant memory to BRAM per PE, and used prefetching before kernel execution to improve access latency. Global memory CUDA developer is responsible to organize global memory access in coalesced ways to utilize all off-chip memory bandwidth. FCUDA decouples global memory access from computation by introducing new local variables. Registers Shared between threads, or vectorized and implemented in BRAM Shared memory Threadblock-private, map to BRAM ECE 587 Hardware/Software Co-DesignSpring CUDA Memory Space Mapping

ECE 587 Hardware/Software Co-DesignSpring Example Transformation/Optimization

ECE 587 Hardware/Software Co-DesignSpring Example Transformation/Optimization (Cont.)

ECE 587 Hardware/Software Co-DesignSpring Example Transformation/Optimization (Cont.)

Experimental Results ECE 587 Hardware/Software Co-DesignSpring

ECE 587 Hardware/Software Co-DesignSpring CUDA Kernels for Experiments

FPGA computing latency depends on Concurrency, e.g., PE count Cycles, e.g., functional unit allocation Frequency, e.g., interconnects Controlled by PE count Loop unrolling Array partitioning Task synchronization PE Clustering – set to 9 for all experiments ECE 587 Hardware/Software Co-DesignSpring Parallelism Extraction Impact on Performance

ECE 587 Hardware/Software Co-DesignSpring Threadblock- and Thread-Level Parallelism

Off-chip memory bandwidth: BW1 << BW2 << BW3 ECE 587 Hardware/Software Co-DesignSpring Compute and Data-Transfer Task Parallelism

GPU G92 Run at 1500MHz 128 SPs 64GB/s peak off-chip BW FPGA Xilinx Virtex-5-SX240T Run at 100~200MHz 1056 DSPs 1032 BRAMs (~ 2MB total) High off-chip memory bandwidth is extremely important. ECE 587 Hardware/Software Co-DesignSpring FPGA vs. GPU