OpenCL High-Level Synthesis for Mainstream FPGA Acceleration

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Device Tradeoffs Greg Stitt ECE Department University of Florida.

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

Configurable System-on-Chip: Xilinx EDK

Evolution of implementation technologies

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Experiences Implementing Tinuso in gem5 Maxwell Walter, Pascal Schleuniger, Andreas Erik Hindborg, Carl Christian Kjærgaard, Nicklas Bo Jensen, Sven Karlsson.

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

GPGPU platforms GP - General Purpose computation using GPU

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

1/30 Course-Grained Reconfigurable Architectures Patrick Cooke and Elizabeth Graham.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Operating Systems for Reconfigurable Systems John Huisman ID:

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Paper Review: XiSystem - A Reconfigurable Processor and System

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Automated Design of Custom Architecture Tulika Mitra

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

J. Christiansen, CERN - EP/MIC

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

This material exempt per Department of Commerce license exception TSU Xilinx On-Chip Debug.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

VAPRES A Virtual Architecture for Partially Reconfigurable Embedded Systems Presented by Joseph Antoon Abelardo Jara-Berrocal, Ann Gordon-Ross NSF Center.

Full and Para Virtualization

Authors: James Coole, Greg Stitt University of Florida Dept. of Electrical & Computer Engineering and NSF CHREC Gainesville, FL, USA

Exploiting Parallelism

An Improved “Soft” eFPGA Design and Implementation Strategy

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Sunpyo Hong, Hyesoon Kim

EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

ECE 587 Hardware/Software Co- Design Lecture 23 LLVM and xPilot Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Floating-Point FPGA (FPFPGA)

Dynamo: A Runtime Codesign Environment

James Coole PhD student, University of Florida Aaron Landy Greg Stitt

Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.

Enabling machine learning in embedded systems

FPGAs in AWS and First Use Cases, Kees Vissers

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Presentation transcript:

OpenCL High-Level Synthesis for Mainstream FPGA Acceleration James Coole PhD student, University of Florida Dr. Greg Stitt Associate Professor of ECE, University of Florida SHAW Workshop This work is supported by National Science Foundation grant CNS-1149285 and the I/UCRC Program of the National Science Foundation under Grant No. EEC-0642422.

Productivity Bottlenecks Introduction Numerous studies have shown performance, energy, and power advantages of FPGAs But, FPGA usage still limited to niche areas Goal: enable FPGA usage by designers currently targeting GPUs and multi-cores Problem: 10x worse productivity Higher NRE costs than processor or GPU Increased time-to-market Niche usage, higher device costs Productivity bottlenecks Register-transfer-level (RTL) design Requires specialized languages Requires cycle-by-cycle behavior Digital design expertise Low-level debugging Analyze cycle-by-cycle analysis of waveforms with 100s of signals Productivity Bottlenecks Specialized languages Time consuming Error prone Low-level debugging

Introduction Potential Solution: high-level synthesis (HLS) Mainstream High-level Code (e.g. OpenCL) Potential Solution: high-level synthesis (HLS) Compile FPGA app from high-level code Significant recent achievements for OpenCL HLS But, still not appropriate for mainstream usage Main problem: Long compile times Hours, days, even weeks Huge productivity bottleneck Prevents mainstream methodologies Prevents OpenCL’s runtime compilation Need high-level synthesis that takes similar amount of time as software compilation Automatically creates RTL circuit Problem: Takes hours or days

Introduction Main Contribution: Solution: Intermediate Fabrics (IFs) Virtual, reconfigurable architectures between application and FPGA Hides low-level FPGA details Similar to coarse-grained reconfigurable arrays (CGRAs), but implemented on COTS FPGAs Cost and flexibility advantages Provides near-instant FPGA compilation via abstraction > 1000x faster than commercial tools Integrates with OpenCL HLS to enable transparent FPGA usage Main Contribution: Enables mainstream FPGA usage with near-identical tool flow > 1000x faster than FPGA vendor tools

Intermediate Fabric (IF) Overview Traditional FPGA Tool Flow Intermediate Fabric Tool Flow App Portability: always targets IF regardless of underlying FPGA FPGA specific: Limited portability Fast Partial Reconfiguration: even on devices without support Synthesis Synthesis, Place & Route Fast Compilation: several coarse-grained resources > 10k lookup-tables (LUTS) Lengthy compilation Place & Route (PAR) FPGA specific: Not portable Bitfile FPGA . . . Virtual Device Physical Device(s) Physical Device Main Research Challenge: Minimizing Overhead

OpenCL-IF High-Level Synthesis Intermediate fabrics could be integrated with any HLS tool We created our own tool: OpenCL-IF OpenCL-IF compiles code onto reconfiguration contexts Definition: virtual architecture implemented atop FPGA Implemented using intermediate fabrics Other possibilities exist Main research challenge: how to create intermediate fabrics/contexts for a given application or domain? Fast compilation assumes context already exists Without appropriate context, must use slow FPGA compilation

OpenCL-IF Overview: Context Hit

OpenCL-IF Overview: Context Miss

OpenCL-IF Overview: Context Generation

OpenCL-IF Overview: Repeated Misses

Context Design Heuristic for IFs Use clustering heuristic based on k-means to sort by functional similarity We can ignore connections between functional units due to IF routing flexibilty Encourages op sharing within each group and merges ops used between kernels in group Merges ops of same type if “generics” can be configured (e.g. ALU) or promoted (e.g. width) k # contexts provides a tuning parameter for tradeoffs based on designer intent Larger k  smaller, specialized contexts Can help fit: 60% decrease in context size going single  5 contexts in case study Can use savings to preemptively increase flexibility by growing each context 144x faster reconfiguration vs. device (and KB vs. MB bitfiles)

OpenCL-IF Case Study Evaluated computer vision system with 10 fixed-/floating-point OpenCL kernels Compared OpenCL-IF compile times and area/performance against VHDL On workstation, system compiles in ~3s total vs. 7.4h direct: 8700x speedup 4x faster for FLT vs. FXD due to more device resources being hidden by IF cores ~0.15s per-kernel compile times show that runtime compilation is possible 1.8x system area overhead, 1.3x-15x per context vs. separate accelerators Overhead amortized over multiple kernels by using the IF’s rapid configurability Overhead decreases w/ new kernels! Lower for FLT vs FXD because of larger ops Xilinx ISE 14.4 using reduced effort for faster compilation at expense of circuit quality for XC6VCX130T-1FF1154. Times on quad-core 2.66 GHz Intel Xeon W3520 workstation with 12GB RAM running CentOS 6.4 x86 64.

OpenCL-IF Case Study Same system evaluated using OpenCL-IF on an ARM embedded platform Single-core 1GHz Cortex A8 Same Virtex 6 FPGA (using same contexts) Same program source and toolchain System compiles in 20.7s total, still achieving 1470x speedup over workstation vendor synthesis ~1s per-kernel compile times show that runtime compilation is also possible on embedded devices Enables FPGA acceleration of OpenCL programs portable across devices and with dynamic workloads in embedded devices Embedded devices can’t generate new contexts themselves, but can request them from context servers Xilinx ISE 14.4 using reduced effort for faster compilation at expense of circuit quality for XC6VCX130T-1FF1154. Times on quad-core 2.66 GHz Intel Xeon W3520 workstation with 12GB RAM running CentOS 6.4 x86 64.

Conclusions and Future Work OpenCL-IF provides FPGA tool flow that is nearly identical to GPUs and multicores Enables near-instant (< 1s) FPGA compilation > 1000x faster than device-vendor tools Performance overhead is modest Area overhead can be significant for some use cases Significant focus of ongoing work Future work Novel interconnect architectures to reduce area overhead High-level synthesis optimizations enabled by fast compilation Partial reconfiguration of fabric resources

References Coole, J., and Stitt, G. Fast and flexible high-level synthesis from OpenCL using reconfiguration contexts. IEEE Micro: Special Issue on Reconfigurable Computing (to appear). Coole, J., and Stitt, G. Intermediate fabrics: Virtual architectures for circuit portability and fast placement and routing. CODES/ISSS ’10, pp. 13–22. Landy, A., and Stitt, G. A low-overhead interconnect architecture for virtual reconfigurable fabrics. CASES ’12, pp. 111–120. Stitt, G., and Coole, J. Intermediate fabrics: Virtual architectures for near-instant FPGA compilation. Embedded Systems Letters, IEEE 3, 3 (sept. 2011), 81–84. Hao, L. and Stitt, G. Virtual Finite-State-Machine Architectures for Fast Compilation and Portability. ASAP’13, pp. 91-94.

Envisioned Use Cases Improve developer productivity Typically involves multiple edits and in-board testing, requiring lengthy compilation for even minor changes Makes development more similar to GPUs and CPUs – difference is occasional creation of new contexts Large changes or accumulation of small changes results in temporary misses for affected kernels Reduces total compilation time across development Increased portability and dynamic optimizations Runtime compilation allows application source to be portable between FPGAs and technologies Portable toolchain insulated from FPGA details Optimizations based on values known only at runtime Context servers Because need for new contexts is likely to be bursty, makes sense to share context generation Lets systems incapable of FPGA PAR to handle misses Caching @ server might help decrease global miss rate

Memory Optimizations Memory bandwidth often bottleneck in FPGA applications Specialized buffers can improve parallelism by > 10x e.g. sliding-window buffers [Fowers FPGA 2012] Tool implements efficient buffer streaming by inferring 1/2D sliding-window buffers based on kernel’s use of memory Many kernels keep their memory accesses to some set of constant offsets relative to their workgroup id Easier to identify access patterns Schedules work items in sequence to ensure pattern Creates pipelined implementations in this case, with all control/memory interfacing external to IF Similar analysis used to convert const-indexed __const memory to runtime-loaded constants

Intermediate Fabric (IF) Architecture Island-Style Layout Fabric can implement any architecture Current focus on island-style layout Switch boxes, connection boxes, tracks App-specialized computational units (CUs) FFTs, floating-point resources, filters, etc. Specialized track widths tracks Virtual Track “Soft” RTL Track Implementation For a n-bit track with m sources, circuit uses a m:1, n-bit mux Many tracks in IF, largest source of overhead

Intermediate Fabric (IF) Architecture, Cont. “Soft” RTL Switch Box Switch boxes implemented similarly Mux defines every connection Supports any topology Specialized to application requirements Optional registers on outputs Eliminates combinational loops Minimizes delays across muxes Pipelined interconnect can require complicated routing Ensures routing paths have same # of hops For pipelined circuits, avoid by using realignment registers Lengthens shorter path, adds pipeline stages Enables use of traditional place & route algorithms

Intermediate Fabric (IF) Tool Flow App Design Flow IF Creation Flow Choose appropriate fabric: 1) Synthesize custom fabric + Low area overhead - Requires one FPGA PAR or 2) Select fabric from library + Fabric instantly available - Possibly no appropriate IF 1 time only Implement IF on FPGA: Soft resources implement virtual fabric as RTL code + Portable, flexible - More overhead 2) Hard resources directly use physical routing resources + Less overhead - Less portable, flexible