High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014.

High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014

XILINX CONFIDENTIAL. FPGA Refresher

XILINX CONFIDENTIAL. Field Programmable Computation Fabric Processor System Select IOs DDR 1866Mhz CMT BRAM 36kB DSP 18x18 Integer MultAdd XADCSecure Configuration PCIe Gen 3x8 Interlaken Multi Gigabit Transceivers 2.5 – 30 Gbp/s

XILINX CONFIDENTIAL. Zynq-7020 SoC Device Page 4 Processor System (PS) –ARM Cortex-A9 MPcore –Standard Peripherals –32-bit DDR3 / LPDDR2 controller –54 Multi-Use IOs –73 DDR IOs Programmable Logic (PL) –85 K Logic Cells –106K FFs –140 32-Kb Block RAM –220 DSP Blocks –Dual 12-bit ADC –Secure configuration engine –4 Clock Management Tiles –200 Select IO (1.2-3.3V) Processor System Select IOsCMTBRAMDSP XADCSecure Configuration Processor System Programmable Logic

XILINX CONFIDENTIAL. ApplicationFPGA FeatureKey Benefit Routers / SwitchesFlexible I/OMulti standard support Terabit / sec bandwidth Protocol Conversion, High Frequency Trading I/O connected directly into compute fabric Nanosecond latency from pin to compute block Digital Signal Processing (Integer / Fixed Point) Dedicated MAC blocks and FPGA fabric 25 Tera Int Ops @ 250 MHz Encryption / Compression 6-LUT fabric3M 1-bit compares / sec at @250 MHz What are FPGAs Good For

XILINX CONFIDENTIAL. Typical Peak Power Consumption DevicePower Zynq SoC1 W High End Virtex FPGA30 W 4-Core Server CPU45 – 145 W High End GPU200 W FPGA Power Advantage

XILINX CONFIDENTIAL. What Holds Back the FPGA

XILINX CONFIDENTIAL. Software application to run on –ZC702 as a standalone system –x86 + FPGA PCIe card Application has 1 accelerator Accelerator does not require direct access to FPGA I/O FPGA Design: Example Description

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 1: Create logic view of the platform, find where accelerators can run Memory

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 2: Load binaries for accelerator into memory Memory 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 3: Allocate memories to be used in the application Memory 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 4: Define the accelerator function(s), set parameters Memory 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Check for accelerator Check for accelerator

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 5: Create process to monitor the accelerator Memory 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 6: Transfer memory from processor to FPGA Memory 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q Shared Memory Architecture Nothing happens in this step

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 7: Dispatch work to the accelerator Memory 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q matrix_mult Kernel matrix_mult Kernel

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 8: Wait for results Memory 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q matrix_mult Kernel matrix_mult Kernel

XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 9: Get results back Memory 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 10110111 01011001 01101011 10001001 01100100 11100100 00100001 11100100 01111100 11111000 01001100 11110001 FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q matrix_mult Kernel matrix_mult Kernel Shared Memory Architecture Nothing happens in this step

XILINX CONFIDENTIAL. One application executing on two platforms Memory buffer allocation –Shared Memory (Zynq) vs. Separate Memories (x86 + PCIe) Data transfer –Pass by reference (Zynq) vs. Pass by value (x86 + PCIe) –AXI interconnect vs. AXI + PCIe interconnect Loading and launching accelerators Result collection FPGA Design: Example Summary

XILINX CONFIDENTIAL. Vision for FPGA Programming

XILINX CONFIDENTIAL. Familiar tools using standard programming languages with no restrictions Ability to target multiple boards without being a hardware expert How a Software Programmer Wants to Use an FPGA

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); __kernel __attribute__ ((reqd_work_group_size(16, 16, 1))) void mmult(__global int* a, __global int* b, __global int* output) { int r = get_global_id(0); int c = get_global_id(1); int rank = get_global_size(0); int running = 0; for (int index=0; index<rank; index++) { int aIndex = r*rank + index; int bIndex = index*rank + c; running += a[aIndex] * b[bIndex]; } output[r*rank + c] = running; return; } Processor code accelerator

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find where accelerator can run

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters Step 5: Create process to monitor the accelerator

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the work function(s) in the application and set parameters Step 5: Create process to monitor the accelerator Step 6: Transfer memory from the processor to FPGA

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters Step 5: Create process to monitor the accelerator Step 6: Transfer memory from processor to FPGA Step 7: Dispatch work to the accelerator

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters Step 5: Create process to monitor the accelerator Step 6: Transfer memory from processor to FPGA Step 7: Dispatch work to the accelerator Step 8: Wait for results

XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters Step 5: Create process to monitor the accelerator Step 6: Transfer memory from processor to FPGA Step 7: Dispatch work to the accelerator Step 8: Wait for results Step 9: Get results back

XILINX CONFIDENTIAL. One processor program, multiple platforms Accelerator described as software runs on the FPGA Nuance of shared vs. distributed memory hidden by APIs Software Approach: Summary FPGA hardware platforms captured in industry standard API.

XILINX CONFIDENTIAL. OpenCL

XILINX CONFIDENTIAL. Industry standard for the development of cross-platform, vendor agnostic, parallel programs Provides a standard software API across hardware vendors Enables cross-platform functional portability of an application without coding changes OpenCL Program written once can be deployed on multiple hardware systems. Application performance driven by silicon and compiler technology.

XILINX CONFIDENTIAL. Platform defines the hardware on which the user application is executed Minimum platform has 1 host and 1 compute device –Compute device can be CPU, DSP, FPGA, GPU OpenCL Platform Host Interconnect Compute Device Type A Compute Device Type B Platform

XILINX CONFIDENTIAL. Host Program: The application main() function. This code runs on a processor with the sole purpose of coordinating data transfer and launching compute units. Kernel Code: The compute intensive part of an application. The part of the program which benefits from parallelization and can be accelerated. Runtime: Vendor specific implementation of the standard OpenCL API functions, which handles details of interacting with the platform. OpenCL: Basic Terminology

XILINX CONFIDENTIAL. OpenCL Kernel Synthesis in Vivado HLS

XILINX CONFIDENTIAL. Kernel Compilation Design Flow Hardware kernels.cl OpenCL language Compiler Vivado HLS Kernel_binary Vivado ARM Fabric Software Kernel Vivado IPI Kernel

XILINX CONFIDENTIAL. High Level Synthesis Core Technology Code : void foo( int * a){ int i,input,temp1; int temp2,output; for(i=0;i<100;i++){ int input = a[i]; input ++; temp1 = input * 4; temp2 = input * 100; output = temp1 + temp2; a[i] = output; } MEMREAD ++ MEMWRITE ++ X X 4 100 + Control Data Flow Graph Formation Constraints : Pipeline scheduler Loop Initiation Interval = 1 Unbounded Resources Unbounded Latency

XILINX CONFIDENTIAL. Module Selection MEMREAD ++ MEMWRITE ++ X X 4 100+ Define available Operator 1 Cycle Pipeline4 Cycle Pipeline Blocking OperatorsOperators Operations

XILINX CONFIDENTIAL. Scheduling and Binding MEMREAD ++ MEMWRITE ++ X X 4 100 + Bind Operations to Operators, Schedule Pipeline + Parallel Operations Shift Reg Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8

XILINX CONFIDENTIAL. Kernel Compilation : Loop Pipelining kernel void foo(...) {... __attribute__((xcl_pipeline_loop)) for (int i=0; i<3; i++) { int idx = get_global_id(0)*3 + i; op_Read(idx); op_Compute(idx); op_Write(idx); }... } kernel void foo(...) {... __attribute__((xcl_pipeline_loop)) for (int i=0; i<3; i++) { int idx = get_global_id(0)*3 + i; op_Read(idx); op_Compute(idx); op_Write(idx); }... } Xilinx OpenCL extension Execute iterations of loop in parallel as a pipeline 9 cycles 5 cycles execution time of loop loop iteration

XILINX CONFIDENTIAL. OpenCL Kernel Execution Model on Zynq

XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 43 Memory 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); Host Code 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 FPGA Bitstream System Config. Info Compiled OpenCL Kernels

XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 44 OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); Host Code Memory 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 FPGA Bitstream System Config. Info Compiled OpenCL Kernels.XCLBIN 1 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator 1

XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 45 OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); 2 Host Code Memory buf0buf1 2 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 FPGA Bitstream System Config. Info Compiled OpenCL Kernels.XCLBIN 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator

XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 46 buf0buf1 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 FPGA Bitstream System Config. Info OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); 3 Compiled OpenCL Kernels.XCLBIN Host Code Memory OpenCL Accelerator 3 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator

XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 47 buf0buf1 OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); 4 Host Code Memory OpenCL Accelerator 4 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 FPGA Bitstream System Config. Info Compiled OpenCL Kernels.XCLBIN

XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 48 buf0buf1 OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); 4 Host Code Memory OpenCL Accelerator 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator Device Configuration, Memory Management, Execution 4 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 10110111 01011001 01101011 00111010 10001001 01100100 11100100 10100111 00100001 11100100 01111100 01111000 11111000 01001100 11110001 11011011 … 11010111 01011111 00111011 11010010 11011011 00110000 01000000 10100100 11011101 01100101 11100101 10000111 11100000 11110001 00011100 10101010 FPGA Bitstream System Config. Info Compiled OpenCL Kernels.XCLBIN

XILINX CONFIDENTIAL. Decryption Example with OpenCL Kernels Running on a Zynq Device

XILINX CONFIDENTIAL. Zynq AES Decrypt 160 byte key 16 byte block AES Encrypt Round 0 Round 1 Round 2 Round 3 Round 4 Round 5 Round 6 Round 7 Round 8 Round 9 16 byte block Input Output

XILINX CONFIDENTIAL. AES Decrypt Round 16 byte block SBOX 256Byte Lookup Tables shiftrows SBOX 16 byte round key BYTEWISE XORs 16 byte block add roundKey rotatecols

XILINX CONFIDENTIAL. OpenCL vector types AES Kernel Pipeline with respect to Work Items Data read and write AES Rounds __kernel __attribute__ ((reqd_work_group_size(LOCALSIZE,1,1))) void AESDecrypt(__global uchar16 *output, __global uchar16 *input, __global uchar16 *roundKey) { __attribute__((xcl_pipeline_workitems)) { __private unsigned int localIndex=get_local_id(0); __private unsigned int globalIndex=get_global_id(0); __private uchar16 block0,block1; block0 = input[globalIndex]; //addRoundKey block0 ^= roundKey[ROUNDS]; DecryptRound(9,&block0,roundKey) DecryptRound(8,&block0,roundKey) DecryptRound(7,&block0,roundKey) DecryptRound(6,&block0,roundKey) DecryptRound(5,&block0,roundKey) DecryptRound(4,&block0,roundKey) DecryptRound(3,&block0,roundKey) DecryptRound(2,&block0,roundKey) DecryptRound(1,&block0,roundKey) //shiftRowsInv block0 = shiftRowsInv16(block0); //subBytes block0 box_constant16c(RSBOX(0,0),RSBOX(0,1),RSBOX(0,2), RSBOX(0,3),RSBOX(0,4),RSBOX(0,5), RSBOX(0,6),RSBOX(0,7), block0); output[globalIndex] = block0 ^ roundKey[0]; }

XILINX CONFIDENTIAL. Set Workgroup size = 1 Loop Pipeline Form Loop pipeline __kernel __attribute__ ((reqd_work_group_size(1,1,1))) void AESDecrypt(__global uchar16 *output, __global uchar16 *input, __global uchar16 *roundKey, int blocks) { unsigned int globalindex; __attribute__((xcl_pipeline_loop)) for(globalIndex=0;globalIndex<blocks;globalIndex++){ __private unsigned int localIndex=get_local_id(0); __private unsigned int globalIndex=get_global_id(0); __private uchar16 block0,block1; block0 = input[globalIndex]; //addRoundKey block0 ^= roundKey[ROUNDS]; DecryptRound(9,&block0,roundKey) DecryptRound(8,&block0,roundKey) DecryptRound(7,&block0,roundKey) DecryptRound(6,&block0,roundKey) DecryptRound(5,&block0,roundKey) DecryptRound(4,&block0,roundKey) DecryptRound(3,&block0,roundKey) DecryptRound(2,&block0,roundKey) DecryptRound(1,&block0,roundKey) //shiftRowsInv block0 = shiftRowsInv16(block0); //subBytes block0 =box_constant16c(RSBOX(0,0),RSBOX(0,1),RSBOX(0,2), RSBOX(0,3),RSBOX(0,4),RSBOX(0,5), RSBOX(0,6),RSBOX(0,7), block0); output[globalIndex] = block0 ^ roundKey[0]; }

XILINX CONFIDENTIAL. Zynq System Architecture DDR HP0ACP AES Compute Unit 1 Local memory Host Memory Global Memory GP1GP0 Zynq PL Zynq PS AXI MM 100Mhz 64-bit AXI MM AXI LITE

XILINX CONFIDENTIAL. Zynq Result : –100 Mhz –II = 2.64-bit bidirectional AXI MM interface therefore 2 cycles / block –80 BRAMs10 Rounds. 160 Sbox ROMs. Mapped to 80 BRAMs with both used Performance SpeedupPower (W)Power Efficiency (Speedup/W) Embedded CPU Dual Core (800 Mhz)10.5 – 1.01 - 2 Server CPU Quad Core (3.2 GHz)8260 – 801.2 GPU Low End (8 Work Groups)2725 – 271.05 Zynq PS + PL481.532

XILINX CONFIDENTIAL. Thank You

High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014.

Similar presentations

Presentation on theme: "High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014.

Similar presentations

Presentation on theme: "High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014."— Presentation transcript:

Similar presentations

About project

Feedback