James Coole PhD student, University of Florida Aaron Landy Greg Stitt

Slides:

Advertisements

Similar presentations

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Advertisements

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.

The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays Steven J.

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

OpenCL High-Level Synthesis for Mainstream FPGA Acceleration

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Automating Shift-Register-LUT Based Run-Time Reconfiguration Karel Heyse, Brahim Al Farisi, Karel Bruneel, Dirk Stroobandt

Achieving Load Balance and Effective Caching in Clustered Web Servers Richard B. Bunt Derek L. Eager Gregory M. Oster Carey L. Williamson Department of.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

1 © FASTER Consortium Catalin Ciobanu Chalmers University of Technology Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration.

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

Operating Systems for Reconfigurable Systems John Huisman ID:

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.

Reconfigurable Computing Ender YILMAZ, Hasan Tahsin OĞUZ.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Full and Para Virtualization

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

Authors: James Coole, Greg Stitt University of Florida Dept. of Electrical & Computer Engineering and NSF CHREC Gainesville, FL, USA

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Sunpyo Hong, Hyesoon Kim

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Introduction to Intrusion Detection Systems. All incoming packets are filtered for specific characteristics or content Databases have thousands of patterns.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Floating-Point FPGA (FPFPGA)

Dynamo: A Runtime Codesign Environment

Ph.D. in Computer Science

Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.

Please do not distribute

Application-Specific Customization of Soft Processor Microarchitecture

Technology Mapping into General Programmable Cells

FPGAs in AWS and First Use Cases, Kees Vissers

Introduction to Reconfigurable Computing

Anne Pratoomtong ECE734, Spring2002

Tosiron Adegbija and Ann Gordon-Ross+

A High Performance SoC: PkunityTM

Hossein Omidian, Guy Lemieux

Register-Transfer (RT) Synthesis

Measuring the Gap between FPGAs and ASICs

Fast Min-Register Retiming Through Binary Max-Flow

Application-Specific Customization of Soft Processor Microarchitecture

CS295: Modern Systems What Are FPGAs and Why Should You Care

Presentation transcript:

An Overlay-Based Design Approach for Mainstream Reconfigurable Computing James Coole PhD student, University of Florida Aaron Landy Greg Stitt Associate Professor of ECE, University of Florida Catapult Workshop This work is supported by National Science Foundation grant CNS-1149285 and the I/UCRC Program of the National Science Foundation under Grant No. EEC-0642422.

Introduction Goal: enable FPGA usage by designers currently targeting GPUs and multi-cores Problem: order-of-magnitude worse productivity Productivity bottlenecks Register-transfer-level (RTL) design Solved by high-level synthesis (HLS) Long compile times (hours to days) Prevents mainstream methodologies __kernel void convolve(float *s) {…} OpenCL HLS Long compile times FPGA Place&Route FPGA

Introduction Goal: enable FPGA usage by designers currently targeting GPUs and multi-cores Problem: order-of-magnitude worse productivity Productivity bottlenecks Register-transfer-level (RTL) design Solved by high-level synthesis (HLS) Long compile times (hours to days) Prevents mainstream methodologies Solution: virtual architectures atop FPGAs (i.e. overlays) Provides near-instant place and route >1000x faster than FPGA vendor tools Integrates with HLS for rapid compilation Partial reconfiguration swaps in needed resources Enables transparent FPGA usage __kernel void convolve(float *s) {…} __kernel void kernelA(int *x) {…} __kernel void kernelB(int *x) {…} OpenCL HLS Overlay Place&Route > 1000x faster than FPGA vendor tools Overlay FPGA

CLIF High-Level Synthesis Overlays could be integrated with any HLS tool Created our own tool: CLIF (OpenCL-IF) IF = intermediate fabric CLIF compiles code onto reconfiguration contexts Definition: overlays that can be dynamically changed and swapped in&out of FPGA Contexts originally implemented using intermediate fabrics We now use supernet contexts

CLIF Overview: Context Hit

CLIF Overview: Context Miss

CLIF Overview: Repeated Executions

Results Summary Evaluated system of 20 fixed and float kernels Up to 13,000x faster compiles 0.15s per kernel 2.5s vs 3.6 hours for whole system Enables runtime compilation 16.2% average clock overhead 60% less area than combined RTL implementations for each kernel Original IFs had significant area overhead New supernets use up to 8.9x smaller than minimum-sized IFs Rapid reconfiguration hides overhead of individual kernels

Research Challenges How to minimize area/performance overhead? CODES/ISSS 2010, ESL 2011, CASES 2012, ASAP 2013 How to create reconfiguration contexts for a given application or set of applications? IEEE Micro 2014, SHAW 14, FCCM 2015, TECS (to appear) How to integrate with OpenCL high-level synthesis? IEEE Micro 2014, SHAW 14, FCCM 2015 How to customize overlay for different FPGA fabrics? CASES 2012, ASAP 2013, TECS (to appear)

Catapult Collaboration Just-in-time FPGA compilation Dynamic circuit optimizations based on changing workloads, inputs, etc. Portability of code across multiple FPGA types Virtualization Rapid, transparent partial reconfiguration Improve productivity via virtual, application-specialized resources and interfaces Security Application case studies

References Coole, J. and Stitt, G. Adjustable-Cost Overlays for Runtime Compilation, FCCM 2015. Coole, J., and Stitt, G. Fast and flexible high-level synthesis from OpenCL using reconfiguration contexts. IEEE Micro: Special Issue on Reconfigurable Computing (Jan 2014). Coole, J., and Stitt, G. Opencl high-level synthesis for mainstream fpga acceleration. In Workshop on SoCs, Heterogeneous Architectures and Workloads (SHAW) 2014. Hao, L. and Stitt, G. Virtual Finite-State-Machine Architectures for Fast Compilation and Portability. ASAP’13, pp. 91-94. Landy, A., and Stitt, G. A low-overhead interconnect architecture for virtual reconfigurable fabrics. CASES ’12, pp. 111–120. Stitt, G., and Coole, J. Intermediate fabrics: Virtual architectures for near- instant FPGA compilation. Embedded Systems Letters, IEEE 3, 3 (sept. 2011), 81–84. Coole, J., and Stitt, G. Intermediate fabrics: Virtual architectures for circuit portability and fast placement and routing. CODES/ISSS ’10, pp. 13–22.

Appendix

Intermediate Fabric (IF) Overview Traditional FPGA Tool Flow Intermediate Fabric Tool Flow App Portability: always targets IF regardless of underlying FPGA FPGA specific: Limited portability Fast Partial Reconfiguration: even on devices without support Synthesis Synthesis, Place & Route Fast Compilation: several coarse-grained resources > 10k lookup-tables (LUTS) Lengthy compilation Place & Route (PAR) FPGA specific: Not portable Bitfile FPGA . . . Virtual Device Physical Device(s) Physical Device

CLIF Overview: Context Generation

Context Design Heuristic for IFs Use clustering heuristic based on k-means to sort by functional similarity We can ignore connections between functional units due to IF routing flexibilty Encourages op sharing within each group and merges ops used between kernels in group Merges ops of same type if “generics” can be configured (e.g. ALU) or promoted (e.g. width) k # contexts provides a tuning parameter for tradeoffs based on designer intent Larger k  smaller, specialized contexts Can help fit: 60% decrease in context size going single  5 contexts in case study Can use savings to preemptively increase flexibility by growing each context 144x faster reconfiguration vs. device (and KB vs. MB bitfiles)

Supernets: Contexts w/ Tailored Interconnect Ideally, contexts could also tailor their interconnect for the application at hand What if we included, at a minimum, interconnect to support communication requirements of source set… …using general-purpose routing as a fallback  We can reduce network’s capacity (or eliminate it entirely) to minimize overhead  Goal: “only pay for what you want” Supernet context family: includes merged ops and interconnect from source set and an optional secondary network Offline: merger combines source set datapaths so as to minimize added nets (and nodes), to create a supernet datapath Runtime: mapper attempts to match nets from incoming datapath to supernet, and handles anything else on the network