High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014.

Slides:



Advertisements
Similar presentations
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Advertisements

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Digital RF Stabilization System Based on MicroTCA Technology - Libera LLRF Robert Černe May 2010, RT10, Lisboa
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
1 Performed By: Khaskin Luba Einhorn Raziel Einhorn Raziel Instructor: Rivkin Ina Spring 2004 Spring 2004 Virtex II-Pro Dynamical Test Application Part.
Processor Technology and Architecture
Configurable System-on-Chip: Xilinx EDK
Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
Implementation of DSP Algorithm on SoC. Characterization presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompany engineer : Emilia Burlak.
Study of AES Encryption/Decription Optimizations Nathan Windels.
Final presentation Encryption/Decryption on embedded system Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Winter 2013 Part A.
Introducing Zynq-7000 EPP The First Extensible Processing Platform Family March 2011.
Advantages of Reconfigurable System Architectures
The Open Standard for Parallel Programming of Heterogeneous systems James Xu.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
GBT Interface Card for a Linux Computer Carson Teale 1.
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
GRECO - CIn - UFPE1 A Reconfigurable Architecture for Multi-context Application Remy Eskinazi Sant´Anna Federal University of Pernambuco – UFPE GRECO.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
GPU Architecture and Programming
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Part A Presentation Implementation of DSP Algorithm on SoC Student : Einat Tevel Supervisor : Isaschar Walter Accompanying engineer : Emilia Burlak The.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Final Presentation Encryption on Embedded System Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Spring 2014 Part B.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Modern FPGA architecture.
Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
My Coordinates Office EM G.27 contact time:
ROM. ROM functionalities. ROM boards has to provide data format conversion. – Event fragments, from the FE electronics, enter the ROM as serial data stream;
Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.
Design with Vivado IP Integrator
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Introduction to the FPGA and Labs
GCSE Computing - The CPU
M. Bellato INFN Padova and U. Marconi INFN Bologna
NFV Compute Acceleration APIs and Evaluation
ECE354 Embedded Systems Introduction C Andras Moritz.
Introduction to Programmable Logic
An Introduction to GPU Computing
Texas Instruments TDA2x and Vision SDK
ENG3050 Embedded Reconfigurable Computing Systems
Electronics for Physicists
FPGAs in AWS and First Use Cases, Kees Vissers
Introduction to High-level Synthesis
GPU Programming using OpenCL
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Programmable Logic- How do they do that?
Implementation of IDEA on a Reconfigurable Computer
RECONFIGURABLE PROCESSING AND AVIONICS SYSTEMS
Dynamically Reconfigurable Architectures: An Overview
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Implementation of a GNSS Space Receiver on a Zynq
Electronics for Physicists
GCSE Computing - The CPU
ADSP 21065L.
Chapter 13: I/O Systems.
Presentation transcript:

High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014

XILINX CONFIDENTIAL. FPGA Refresher

XILINX CONFIDENTIAL. Field Programmable Computation Fabric Processor System Select IOs DDR 1866Mhz CMT BRAM 36kB DSP 18x18 Integer MultAdd XADCSecure Configuration PCIe Gen 3x8 Interlaken Multi Gigabit Transceivers 2.5 – 30 Gbp/s

XILINX CONFIDENTIAL. Zynq-7020 SoC Device Page 4 Processor System (PS) –ARM Cortex-A9 MPcore –Standard Peripherals –32-bit DDR3 / LPDDR2 controller –54 Multi-Use IOs –73 DDR IOs Programmable Logic (PL) –85 K Logic Cells –106K FFs – Kb Block RAM –220 DSP Blocks –Dual 12-bit ADC –Secure configuration engine –4 Clock Management Tiles –200 Select IO ( V) Processor System Select IOsCMTBRAMDSP XADCSecure Configuration Processor System Programmable Logic

XILINX CONFIDENTIAL. ApplicationFPGA FeatureKey Benefit Routers / SwitchesFlexible I/OMulti standard support Terabit / sec bandwidth Protocol Conversion, High Frequency Trading I/O connected directly into compute fabric Nanosecond latency from pin to compute block Digital Signal Processing (Integer / Fixed Point) Dedicated MAC blocks and FPGA fabric 25 Tera Int 250 MHz Encryption / Compression 6-LUT fabric3M 1-bit compares / sec MHz What are FPGAs Good For

XILINX CONFIDENTIAL. Typical Peak Power Consumption DevicePower Zynq SoC1 W High End Virtex FPGA30 W 4-Core Server CPU45 – 145 W High End GPU200 W FPGA Power Advantage

XILINX CONFIDENTIAL. What Holds Back the FPGA

XILINX CONFIDENTIAL. Software application to run on –ZC702 as a standalone system –x86 + FPGA PCIe card Application has 1 accelerator Accelerator does not require direct access to FPGA I/O FPGA Design: Example Description

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 1: Create logic view of the platform, find where accelerators can run Memory

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 2: Load binaries for accelerator into memory Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 3: Allocate memories to be used in the application Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 4: Define the accelerator function(s), set parameters Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Check for accelerator Check for accelerator

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 5: Create process to monitor the accelerator Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 6: Transfer memory from processor to FPGA Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q Shared Memory Architecture Nothing happens in this step

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 7: Dispatch work to the accelerator Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q matrix_mult Kernel matrix_mult Kernel

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 8: Wait for results Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q matrix_mult Kernel matrix_mult Kernel

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 9: Get results back Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q matrix_mult Kernel matrix_mult Kernel Shared Memory Architecture Nothing happens in this step

XILINX CONFIDENTIAL. One application executing on two platforms Memory buffer allocation –Shared Memory (Zynq) vs. Separate Memories (x86 + PCIe) Data transfer –Pass by reference (Zynq) vs. Pass by value (x86 + PCIe) –AXI interconnect vs. AXI + PCIe interconnect Loading and launching accelerators Result collection FPGA Design: Example Summary

XILINX CONFIDENTIAL. Vision for FPGA Programming

XILINX CONFIDENTIAL. Familiar tools using standard programming languages with no restrictions Ability to target multiple boards without being a hardware expert How a Software Programmer Wants to Use an FPGA

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); __kernel __attribute__ ((reqd_work_group_size(16, 16, 1))) void mmult(__global int* a, __global int* b, __global int* output) { int r = get_global_id(0); int c = get_global_id(1); int rank = get_global_size(0); int running = 0; for (int index=0; index<rank; index++) { int aIndex = r*rank + index; int bIndex = index*rank + c; running += a[aIndex] * b[bIndex]; } output[r*rank + c] = running; return; } Processor code accelerator

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find where accelerator can run

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters Step 5: Create process to monitor the accelerator

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the work function(s) in the application and set parameters Step 5: Create process to monitor the accelerator Step 6: Transfer memory from the processor to FPGA

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters Step 5: Create process to monitor the accelerator Step 6: Transfer memory from processor to FPGA Step 7: Dispatch work to the accelerator

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters Step 5: Create process to monitor the accelerator Step 6: Transfer memory from processor to FPGA Step 7: Dispatch work to the accelerator Step 8: Wait for results

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters Step 5: Create process to monitor the accelerator Step 6: Transfer memory from processor to FPGA Step 7: Dispatch work to the accelerator Step 8: Wait for results Step 9: Get results back

XILINX CONFIDENTIAL. One processor program, multiple platforms Accelerator described as software runs on the FPGA Nuance of shared vs. distributed memory hidden by APIs Software Approach: Summary FPGA hardware platforms captured in industry standard API.

XILINX CONFIDENTIAL. OpenCL

XILINX CONFIDENTIAL. Industry standard for the development of cross-platform, vendor agnostic, parallel programs Provides a standard software API across hardware vendors Enables cross-platform functional portability of an application without coding changes OpenCL Program written once can be deployed on multiple hardware systems. Application performance driven by silicon and compiler technology.

XILINX CONFIDENTIAL. Platform defines the hardware on which the user application is executed Minimum platform has 1 host and 1 compute device –Compute device can be CPU, DSP, FPGA, GPU OpenCL Platform Host Interconnect Compute Device Type A Compute Device Type B Platform

XILINX CONFIDENTIAL. Host Program: The application main() function. This code runs on a processor with the sole purpose of coordinating data transfer and launching compute units. Kernel Code: The compute intensive part of an application. The part of the program which benefits from parallelization and can be accelerated. Runtime: Vendor specific implementation of the standard OpenCL API functions, which handles details of interacting with the platform. OpenCL: Basic Terminology

XILINX CONFIDENTIAL. OpenCL Kernel Synthesis in Vivado HLS

XILINX CONFIDENTIAL. Kernel Compilation Design Flow Hardware kernels.cl OpenCL language Compiler Vivado HLS Kernel_binary Vivado ARM Fabric Software Kernel Vivado IPI Kernel

XILINX CONFIDENTIAL. High Level Synthesis Core Technology Code : void foo( int * a){ int i,input,temp1; int temp2,output; for(i=0;i<100;i++){ int input = a[i]; input ++; temp1 = input * 4; temp2 = input * 100; output = temp1 + temp2; a[i] = output; } MEMREAD ++ MEMWRITE ++ X X Control Data Flow Graph Formation Constraints : Pipeline scheduler Loop Initiation Interval = 1 Unbounded Resources Unbounded Latency

XILINX CONFIDENTIAL. Module Selection MEMREAD ++ MEMWRITE ++ X X Define available Operator 1 Cycle Pipeline4 Cycle Pipeline Blocking OperatorsOperators Operations

XILINX CONFIDENTIAL. Scheduling and Binding MEMREAD ++ MEMWRITE ++ X X Bind Operations to Operators, Schedule Pipeline + Parallel Operations Shift Reg Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8

XILINX CONFIDENTIAL. Kernel Compilation : Loop Pipelining kernel void foo(...) {... __attribute__((xcl_pipeline_loop)) for (int i=0; i<3; i++) { int idx = get_global_id(0)*3 + i; op_Read(idx); op_Compute(idx); op_Write(idx); }... } kernel void foo(...) {... __attribute__((xcl_pipeline_loop)) for (int i=0; i<3; i++) { int idx = get_global_id(0)*3 + i; op_Read(idx); op_Compute(idx); op_Write(idx); }... } Xilinx OpenCL extension Execute iterations of loop in parallel as a pipeline 9 cycles 5 cycles execution time of loop loop iteration

XILINX CONFIDENTIAL. OpenCL Kernel Execution Model on Zynq

XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 43 Memory 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); Host Code … … … … FPGA Bitstream System Config. Info Compiled OpenCL Kernels

XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 44 OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); Host Code Memory … … … … FPGA Bitstream System Config. Info Compiled OpenCL Kernels.XCLBIN 1 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator 1

XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 45 OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); 2 Host Code Memory buf0buf … … … … FPGA Bitstream System Config. Info Compiled OpenCL Kernels.XCLBIN 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator

XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 46 buf0buf … … … … FPGA Bitstream System Config. Info OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); 3 Compiled OpenCL Kernels.XCLBIN Host Code Memory OpenCL Accelerator 3 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator

XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 47 buf0buf1 OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); 4 Host Code Memory OpenCL Accelerator 4 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator … … … … FPGA Bitstream System Config. Info Compiled OpenCL Kernels.XCLBIN

XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 48 buf0buf1 OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); 4 Host Code Memory OpenCL Accelerator 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator Device Configuration, Memory Management, Execution … … … … FPGA Bitstream System Config. Info Compiled OpenCL Kernels.XCLBIN

XILINX CONFIDENTIAL. Decryption Example with OpenCL Kernels Running on a Zynq Device

XILINX CONFIDENTIAL. Zynq AES Decrypt 160 byte key 16 byte block AES Encrypt Round 0 Round 1 Round 2 Round 3 Round 4 Round 5 Round 6 Round 7 Round 8 Round 9 16 byte block Input Output

XILINX CONFIDENTIAL. AES Decrypt Round 16 byte block SBOX 256Byte Lookup Tables shiftrows SBOX 16 byte round key BYTEWISE XORs 16 byte block add roundKey rotatecols

XILINX CONFIDENTIAL. OpenCL vector types AES Kernel Pipeline with respect to Work Items Data read and write AES Rounds __kernel __attribute__ ((reqd_work_group_size(LOCALSIZE,1,1))) void AESDecrypt(__global uchar16 *output, __global uchar16 *input, __global uchar16 *roundKey) { __attribute__((xcl_pipeline_workitems)) { __private unsigned int localIndex=get_local_id(0); __private unsigned int globalIndex=get_global_id(0); __private uchar16 block0,block1; block0 = input[globalIndex]; //addRoundKey block0 ^= roundKey[ROUNDS]; DecryptRound(9,&block0,roundKey) DecryptRound(8,&block0,roundKey) DecryptRound(7,&block0,roundKey) DecryptRound(6,&block0,roundKey) DecryptRound(5,&block0,roundKey) DecryptRound(4,&block0,roundKey) DecryptRound(3,&block0,roundKey) DecryptRound(2,&block0,roundKey) DecryptRound(1,&block0,roundKey) //shiftRowsInv block0 = shiftRowsInv16(block0); //subBytes block0 box_constant16c(RSBOX(0,0),RSBOX(0,1),RSBOX(0,2), RSBOX(0,3),RSBOX(0,4),RSBOX(0,5), RSBOX(0,6),RSBOX(0,7), block0); output[globalIndex] = block0 ^ roundKey[0]; }

XILINX CONFIDENTIAL. Set Workgroup size = 1 Loop Pipeline Form Loop pipeline __kernel __attribute__ ((reqd_work_group_size(1,1,1))) void AESDecrypt(__global uchar16 *output, __global uchar16 *input, __global uchar16 *roundKey, int blocks) { unsigned int globalindex; __attribute__((xcl_pipeline_loop)) for(globalIndex=0;globalIndex<blocks;globalIndex++){ __private unsigned int localIndex=get_local_id(0); __private unsigned int globalIndex=get_global_id(0); __private uchar16 block0,block1; block0 = input[globalIndex]; //addRoundKey block0 ^= roundKey[ROUNDS]; DecryptRound(9,&block0,roundKey) DecryptRound(8,&block0,roundKey) DecryptRound(7,&block0,roundKey) DecryptRound(6,&block0,roundKey) DecryptRound(5,&block0,roundKey) DecryptRound(4,&block0,roundKey) DecryptRound(3,&block0,roundKey) DecryptRound(2,&block0,roundKey) DecryptRound(1,&block0,roundKey) //shiftRowsInv block0 = shiftRowsInv16(block0); //subBytes block0 =box_constant16c(RSBOX(0,0),RSBOX(0,1),RSBOX(0,2), RSBOX(0,3),RSBOX(0,4),RSBOX(0,5), RSBOX(0,6),RSBOX(0,7), block0); output[globalIndex] = block0 ^ roundKey[0]; }

XILINX CONFIDENTIAL. Zynq System Architecture DDR HP0ACP AES Compute Unit 1 Local memory Host Memory Global Memory GP1GP0 Zynq PL Zynq PS AXI MM 100Mhz 64-bit AXI MM AXI LITE

XILINX CONFIDENTIAL. Zynq Result : –100 Mhz –II = 2.64-bit bidirectional AXI MM interface therefore 2 cycles / block –80 BRAMs10 Rounds. 160 Sbox ROMs. Mapped to 80 BRAMs with both used Performance SpeedupPower (W)Power Efficiency (Speedup/W) Embedded CPU Dual Core (800 Mhz)10.5 – Server CPU Quad Core (3.2 GHz)8260 – GPU Low End (8 Work Groups)2725 – Zynq PS + PL

XILINX CONFIDENTIAL. Thank You