Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu

Similar presentations


Presentation on theme: "Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu"— Presentation transcript:

1 Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on AWS F1 FPGA
Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China Computer Science Department, University of California, Los Angeles, CA, USA Falcon Computing Solutions, Inc, Los Angeles, CA, USA

2 Falcon Computing Solutions
An early stage company focused on FPGA-based acceleration solutions with offices in Santa Clara, Los Angeles, and Beijing Vision : Provide seamless acceleration solutions that deliver high performance and energy efficiency for compute-intensive applications on-premises or in the cloud Leveraging years of research under co-founder Dr. Jason Cong Chancellor’s professor and Director of the Center for Domain-Specific Computing at UCLA Have raised more than $10M in venture funding in the past 2 years Executive team from Intel, Altera, Xilinx, Synopsys, Magma >30 years in FPGA industry, >30 years of University Research

3 DNN Design Challenges on FPGAs
High performance and throughput DNN architecture Maximum resource utilization and frequency on SSI devices Irregularities in different DNN layers Design portability across different DNN models and FPGA devices

4 Systolic Array Architecture

5 Stacked Systolic Array Architecture

6 Two-Phase Design Space Exploration
for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; Determine the single systolic array structure Determine the number of systolic arrays Stream buffer management for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q];

7 Programming Model for(i = 0; i < 128; i++)
for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q]; for(i = 0; i < 128; i++) for(o = 0; o < 192; o++) for(c = 0; c < 13; c++) for(r = 0; r < 13; r++) for(p = 0; p < 3; p++) for(q = 0; q < 3; q++) out[o][r][c] += w[o][i][p][q] * in[i][r+p][c+q];

8 Merlin Compiler Pure C/C++ based flow enabling SW programmers to develop FPGA accelerated applications Highly integrated flow with automatic optimization greatly improving productivity Advanced code transformation delivering highest QoR without FPGA expertise Kernel Code to Accelerate FPGA Merlin Compiler CPU K C/C++ GCC Merlin Optimization Library FPGA Binary

9 Experiment Result Baseline: single layer implementation on AWS F1
Irregularity: Stacked Systolic Array w/o floorplanning – higher computation efficiency Frequency: Stacked Systolic Array w floorplanning – higher clock frequency

10 Summary A low latency DNN accelerator design based on stacked systolic arrays achieving 2 TOPs on AWS F1 FPGA An automated resource partitioning algorithm between systolic arrays and FPGA dies for multiple DNN layers achieving 90% resource utilization and 240MHz An end-to-end automation flow from high-level C code to FPGA accelerated DNNs in datacenters. We implement a push-button automation

11 THANK YOU!


Download ppt "Xuechao Wei, Peng Zhang, Cody Hao Yu, and Jim Wu"

Similar presentations


Ads by Google