High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014
XILINX CONFIDENTIAL. FPGA Refresher
XILINX CONFIDENTIAL. Field Programmable Computation Fabric Processor System Select IOs DDR 1866Mhz CMT BRAM 36kB DSP 18x18 Integer MultAdd XADCSecure Configuration PCIe Gen 3x8 Interlaken Multi Gigabit Transceivers 2.5 – 30 Gbp/s
XILINX CONFIDENTIAL. Zynq-7020 SoC Device Page 4 Processor System (PS) –ARM Cortex-A9 MPcore –Standard Peripherals –32-bit DDR3 / LPDDR2 controller –54 Multi-Use IOs –73 DDR IOs Programmable Logic (PL) –85 K Logic Cells –106K FFs – Kb Block RAM –220 DSP Blocks –Dual 12-bit ADC –Secure configuration engine –4 Clock Management Tiles –200 Select IO ( V) Processor System Select IOsCMTBRAMDSP XADCSecure Configuration Processor System Programmable Logic
XILINX CONFIDENTIAL. ApplicationFPGA FeatureKey Benefit Routers / SwitchesFlexible I/OMulti standard support Terabit / sec bandwidth Protocol Conversion, High Frequency Trading I/O connected directly into compute fabric Nanosecond latency from pin to compute block Digital Signal Processing (Integer / Fixed Point) Dedicated MAC blocks and FPGA fabric 25 Tera Int 250 MHz Encryption / Compression 6-LUT fabric3M 1-bit compares / sec MHz What are FPGAs Good For
XILINX CONFIDENTIAL. Typical Peak Power Consumption DevicePower Zynq SoC1 W High End Virtex FPGA30 W 4-Core Server CPU45 – 145 W High End GPU200 W FPGA Power Advantage
XILINX CONFIDENTIAL. What Holds Back the FPGA
XILINX CONFIDENTIAL. Software application to run on –ZC702 as a standalone system –x86 + FPGA PCIe card Application has 1 accelerator Accelerator does not require direct access to FPGA I/O FPGA Design: Example Description
XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 1: Create logic view of the platform, find where accelerators can run Memory
XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 2: Load binaries for accelerator into memory Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream
XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 3: Allocate memories to be used in the application Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’
XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 4: Define the accelerator function(s), set parameters Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Check for accelerator Check for accelerator
XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 5: Create process to monitor the accelerator Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q
XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 6: Transfer memory from processor to FPGA Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q Shared Memory Architecture Nothing happens in this step
XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 7: Dispatch work to the accelerator Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q matrix_mult Kernel matrix_mult Kernel
XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 8: Wait for results Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q matrix_mult Kernel matrix_mult Kernel
XILINX CONFIDENTIAL. PCIe Card with V7-690t FPGA Design DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP Memory PC x86 DMA AXI Interconnect MIG Memory Step 9: Get results back Memory FPGA Bitstream FPGA Bitstream FPGA Bitstream FPGA Bitstream A B C A B C A’ B’ C’ Q Q matrix_mult Kernel matrix_mult Kernel Shared Memory Architecture Nothing happens in this step
XILINX CONFIDENTIAL. One application executing on two platforms Memory buffer allocation –Shared Memory (Zynq) vs. Separate Memories (x86 + PCIe) Data transfer –Pass by reference (Zynq) vs. Pass by value (x86 + PCIe) –AXI interconnect vs. AXI + PCIe interconnect Loading and launching accelerators Result collection FPGA Design: Example Summary
XILINX CONFIDENTIAL. Vision for FPGA Programming
XILINX CONFIDENTIAL. Familiar tools using standard programming languages with no restrictions Ability to target multiple boards without being a hardware expert How a Software Programmer Wants to Use an FPGA
XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); __kernel __attribute__ ((reqd_work_group_size(16, 16, 1))) void mmult(__global int* a, __global int* b, __global int* output) { int r = get_global_id(0); int c = get_global_id(1); int rank = get_global_size(0); int running = 0; for (int index=0; index<rank; index++) { int aIndex = r*rank + index; int bIndex = index*rank + c; running += a[aIndex] * b[bIndex]; } output[r*rank + c] = running; return; } Processor code accelerator
XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find where accelerator can run
XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory
XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application
XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters
XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters Step 5: Create process to monitor the accelerator
XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the work function(s) in the application and set parameters Step 5: Create process to monitor the accelerator Step 6: Transfer memory from the processor to FPGA
XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters Step 5: Create process to monitor the accelerator Step 6: Transfer memory from processor to FPGA Step 7: Dispatch work to the accelerator
XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters Step 5: Create process to monitor the accelerator Step 6: Transfer memory from processor to FPGA Step 7: Dispatch work to the accelerator Step 8: Wait for results
XILINX CONFIDENTIAL. Software Approach Context context (CL_DEVICE_TYPE_ACCELERATOR); vector devices = context.getInfor (); Program program(context, devices, source); program.build(devices); Buffer A (context, CL_MEM_READ, sizeof(matrix_A)); Buffer B (context, CL_MEM_READ, sizeof(matrix_A)); Buffer C (context, CL_MEM_WRITE, sizeof(matrix_A)); Kernel kernel(program, “matrix_mult”); kernel.setArg(0,A); kernel.setArg(1,B); kernel.setArg(2,C); CommandQueue Q(context, devices[0]); Q.enqueueWriteBuffer(A, CL_TRUE, 0, sizeof(matrix_A), matrix_A); Q.enqueueWriteBuffer(B, CL_TRUE, 0, sizeof(matrix_A), matrix_B); Q.enqueueNDRangeKernel(kernel, 2,NULL,16,16,0,NULL,&finished); Q.waitForEvents(1,&finished); Q.enqueueReadBuffer(C, CL_TRUE, 0, sizeof(matrix_A), matrix_C); Step 1: Create logic view of the platform, find compute devices Step 1: Create logic view of the platform, find compute devices Step 2: Load binaries for accelerators into memory Step 3: Allocate memories to be used in the application Step 4: Define the accelerator function(s) and set parameters Step 5: Create process to monitor the accelerator Step 6: Transfer memory from processor to FPGA Step 7: Dispatch work to the accelerator Step 8: Wait for results Step 9: Get results back
XILINX CONFIDENTIAL. One processor program, multiple platforms Accelerator described as software runs on the FPGA Nuance of shared vs. distributed memory hidden by APIs Software Approach: Summary FPGA hardware platforms captured in industry standard API.
XILINX CONFIDENTIAL. OpenCL
XILINX CONFIDENTIAL. Industry standard for the development of cross-platform, vendor agnostic, parallel programs Provides a standard software API across hardware vendors Enables cross-platform functional portability of an application without coding changes OpenCL Program written once can be deployed on multiple hardware systems. Application performance driven by silicon and compiler technology.
XILINX CONFIDENTIAL. Platform defines the hardware on which the user application is executed Minimum platform has 1 host and 1 compute device –Compute device can be CPU, DSP, FPGA, GPU OpenCL Platform Host Interconnect Compute Device Type A Compute Device Type B Platform
XILINX CONFIDENTIAL. Host Program: The application main() function. This code runs on a processor with the sole purpose of coordinating data transfer and launching compute units. Kernel Code: The compute intensive part of an application. The part of the program which benefits from parallelization and can be accelerated. Runtime: Vendor specific implementation of the standard OpenCL API functions, which handles details of interacting with the platform. OpenCL: Basic Terminology
XILINX CONFIDENTIAL. OpenCL Kernel Synthesis in Vivado HLS
XILINX CONFIDENTIAL. Kernel Compilation Design Flow Hardware kernels.cl OpenCL language Compiler Vivado HLS Kernel_binary Vivado ARM Fabric Software Kernel Vivado IPI Kernel
XILINX CONFIDENTIAL. High Level Synthesis Core Technology Code : void foo( int * a){ int i,input,temp1; int temp2,output; for(i=0;i<100;i++){ int input = a[i]; input ++; temp1 = input * 4; temp2 = input * 100; output = temp1 + temp2; a[i] = output; } MEMREAD ++ MEMWRITE ++ X X Control Data Flow Graph Formation Constraints : Pipeline scheduler Loop Initiation Interval = 1 Unbounded Resources Unbounded Latency
XILINX CONFIDENTIAL. Module Selection MEMREAD ++ MEMWRITE ++ X X Define available Operator 1 Cycle Pipeline4 Cycle Pipeline Blocking OperatorsOperators Operations
XILINX CONFIDENTIAL. Scheduling and Binding MEMREAD ++ MEMWRITE ++ X X Bind Operations to Operators, Schedule Pipeline + Parallel Operations Shift Reg Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8
XILINX CONFIDENTIAL. Kernel Compilation : Loop Pipelining kernel void foo(...) {... __attribute__((xcl_pipeline_loop)) for (int i=0; i<3; i++) { int idx = get_global_id(0)*3 + i; op_Read(idx); op_Compute(idx); op_Write(idx); }... } kernel void foo(...) {... __attribute__((xcl_pipeline_loop)) for (int i=0; i<3; i++) { int idx = get_global_id(0)*3 + i; op_Read(idx); op_Compute(idx); op_Write(idx); }... } Xilinx OpenCL extension Execute iterations of loop in parallel as a pipeline 9 cycles 5 cycles execution time of loop loop iteration
XILINX CONFIDENTIAL. OpenCL Kernel Execution Model on Zynq
XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 43 Memory 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); Host Code … … … … FPGA Bitstream System Config. Info Compiled OpenCL Kernels
XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 44 OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); Host Code Memory … … … … FPGA Bitstream System Config. Info Compiled OpenCL Kernels.XCLBIN 1 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator 1
XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 45 OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); 2 Host Code Memory buf0buf … … … … FPGA Bitstream System Config. Info Compiled OpenCL Kernels.XCLBIN 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator
XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 46 buf0buf … … … … FPGA Bitstream System Config. Info OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); 3 Compiled OpenCL Kernels.XCLBIN Host Code Memory OpenCL Accelerator 3 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator
XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 47 buf0buf1 OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); 4 Host Code Memory OpenCL Accelerator 4 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator … … … … FPGA Bitstream System Config. Info Compiled OpenCL Kernels.XCLBIN
XILINX CONFIDENTIAL. Processing System DDR Memory Controller AMBA ® Switches APU Dual ARM Cortex-A9 AMBA ® Switches Programmable Logic M_AXI4_GP S_AXI4_HP/ACP OpenCL Runtime Page 48 buf0buf1 OpenCL Runtime OpenCL Runtime context = clCreateContextFromType(…); clGetDeviceIDs(…, &device_id, …); queue = clCreateCommandQueue(context, device_id, …); buf0 = clCreateBuffer(context, CL_MEM_READ_ONLY, …); buf1 = clCreateBuffer(context, CL_MEM_READ_WRITE, …); program = clCreateProgramWithBinary(…); clBuildProgram(program, …); kernel = clCreateKernel(program, “mykernel", …); clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf0); clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf1); clEnqueueWriteBuffer(queue, buf0, …); clEnqueueNDRangeKernel(queue, kernel, …); clEnqueueReadBuffer(queue, buf1, …); 4 Host Code Memory OpenCL Accelerator 1.Initialize Runtime 2.Allocate Buffers 3.Configure Device 4.Run Accelerator Device Configuration, Memory Management, Execution … … … … FPGA Bitstream System Config. Info Compiled OpenCL Kernels.XCLBIN
XILINX CONFIDENTIAL. Decryption Example with OpenCL Kernels Running on a Zynq Device
XILINX CONFIDENTIAL. Zynq AES Decrypt 160 byte key 16 byte block AES Encrypt Round 0 Round 1 Round 2 Round 3 Round 4 Round 5 Round 6 Round 7 Round 8 Round 9 16 byte block Input Output
XILINX CONFIDENTIAL. AES Decrypt Round 16 byte block SBOX 256Byte Lookup Tables shiftrows SBOX 16 byte round key BYTEWISE XORs 16 byte block add roundKey rotatecols
XILINX CONFIDENTIAL. OpenCL vector types AES Kernel Pipeline with respect to Work Items Data read and write AES Rounds __kernel __attribute__ ((reqd_work_group_size(LOCALSIZE,1,1))) void AESDecrypt(__global uchar16 *output, __global uchar16 *input, __global uchar16 *roundKey) { __attribute__((xcl_pipeline_workitems)) { __private unsigned int localIndex=get_local_id(0); __private unsigned int globalIndex=get_global_id(0); __private uchar16 block0,block1; block0 = input[globalIndex]; //addRoundKey block0 ^= roundKey[ROUNDS]; DecryptRound(9,&block0,roundKey) DecryptRound(8,&block0,roundKey) DecryptRound(7,&block0,roundKey) DecryptRound(6,&block0,roundKey) DecryptRound(5,&block0,roundKey) DecryptRound(4,&block0,roundKey) DecryptRound(3,&block0,roundKey) DecryptRound(2,&block0,roundKey) DecryptRound(1,&block0,roundKey) //shiftRowsInv block0 = shiftRowsInv16(block0); //subBytes block0 box_constant16c(RSBOX(0,0),RSBOX(0,1),RSBOX(0,2), RSBOX(0,3),RSBOX(0,4),RSBOX(0,5), RSBOX(0,6),RSBOX(0,7), block0); output[globalIndex] = block0 ^ roundKey[0]; }
XILINX CONFIDENTIAL. Set Workgroup size = 1 Loop Pipeline Form Loop pipeline __kernel __attribute__ ((reqd_work_group_size(1,1,1))) void AESDecrypt(__global uchar16 *output, __global uchar16 *input, __global uchar16 *roundKey, int blocks) { unsigned int globalindex; __attribute__((xcl_pipeline_loop)) for(globalIndex=0;globalIndex<blocks;globalIndex++){ __private unsigned int localIndex=get_local_id(0); __private unsigned int globalIndex=get_global_id(0); __private uchar16 block0,block1; block0 = input[globalIndex]; //addRoundKey block0 ^= roundKey[ROUNDS]; DecryptRound(9,&block0,roundKey) DecryptRound(8,&block0,roundKey) DecryptRound(7,&block0,roundKey) DecryptRound(6,&block0,roundKey) DecryptRound(5,&block0,roundKey) DecryptRound(4,&block0,roundKey) DecryptRound(3,&block0,roundKey) DecryptRound(2,&block0,roundKey) DecryptRound(1,&block0,roundKey) //shiftRowsInv block0 = shiftRowsInv16(block0); //subBytes block0 =box_constant16c(RSBOX(0,0),RSBOX(0,1),RSBOX(0,2), RSBOX(0,3),RSBOX(0,4),RSBOX(0,5), RSBOX(0,6),RSBOX(0,7), block0); output[globalIndex] = block0 ^ roundKey[0]; }
XILINX CONFIDENTIAL. Zynq System Architecture DDR HP0ACP AES Compute Unit 1 Local memory Host Memory Global Memory GP1GP0 Zynq PL Zynq PS AXI MM 100Mhz 64-bit AXI MM AXI LITE
XILINX CONFIDENTIAL. Zynq Result : –100 Mhz –II = 2.64-bit bidirectional AXI MM interface therefore 2 cycles / block –80 BRAMs10 Rounds. 160 Sbox ROMs. Mapped to 80 BRAMs with both used Performance SpeedupPower (W)Power Efficiency (Speedup/W) Embedded CPU Dual Core (800 Mhz)10.5 – Server CPU Quad Core (3.2 GHz)8260 – GPU Low End (8 Work Groups)2725 – Zynq PS + PL
XILINX CONFIDENTIAL. Thank You