Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on AWS F1 FPGA Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China Computer Science Department, University of California, Los Angeles, CA, USA Falcon Computing Solutions, Inc, Los Angeles, CA, USA
Falcon Computing Solutions An early stage company focused on FPGA-based acceleration solutions with offices in Santa Clara, Los Angeles, and Beijing Vision : Provide seamless acceleration solutions that deliver high performance and energy efficiency for compute-intensive applications on-premises or in the cloud Leveraging years of research under co-founder Dr. Jason Cong Chancellor’s professor and Director of the Center for Domain-Specific Computing at UCLA Have raised more than $10M in venture funding in the past 2 years Executive team from Intel, Altera, Xilinx, Synopsys, Magma >30 years in FPGA industry, >30 years of University Research
DNN Design Challenges on FPGAs High performance and throughput DNN architecture Maximum resource utilization and frequency on SSI devices Irregularities in different DNN layers Design portability across different DNN models and FPGA devices
Systolic Array Architecture
Stacked Systolic Array Architecture
Two-Phase Design Space Exploration for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; Determine the single systolic array structure Determine the number of systolic arrays Stream buffer management for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q];
Programming Model for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q];
Merlin Compiler Pure C/C++ based flow enabling SW programmers to develop FPGA accelerated applications Highly integrated flow with automatic optimization greatly improving productivity Advanced code transformation delivering highest QoR without FPGA expertise Kernel Code to Accelerate FPGA Merlin Compiler CPU K C/C++ GCC Merlin Optimization Library FPGA Binary
Experiment Result Baseline: single layer implementation on AWS F1 Irregularity: Stacked Systolic Array w/o floorplanning – higher computation efficiency Frequency: Stacked Systolic Array w floorplanning – higher clock frequency
Summary A low latency DNN accelerator design based on stacked systolic arrays achieving 2 TOPs on AWS F1 FPGA An automated resource partitioning algorithm between systolic arrays and FPGA dies for multiple DNN layers achieving 90% resource utilization and 240MHz An end-to-end automation flow from high-level C code to FPGA accelerated DNNs in datacenters. We implement a push-button automation
THANK YOU!