FPGAs in AWS and First Use Cases, Kees Vissers
FPGA technology over time Logic I/O Memory/DSP Memory/DSP LUTs Columns Die-stacked slices Basic bit-oriented logic 4-input, 6-input Lookup table Basic bit-oriented logic + Word-oriented Multiply-accumulate Word-oriented Memory Basic bit-oriented logic + Word-oriented Multiply-accumulate Word-oriented Memory System integration, e.g. PCIe, DDR Your program becomes a configuration that sets table values and switches via synthesis, Place and Route tools. Page 2
The FPGA in the Amazon F1 Instance: VU9p (16nm) More then 1 Million 6-input LUTs Lots of on-chip fine grain memory, total in the range of 42 Mbyte Lots of ‘DSP’ elements (Multiply Accumulate), total 6840 What can you as a programmer do with this: RTL (Verilog, VHDL) or Program (C/C++, OpenCL) Typical program synthesizes to ~250MHz - 500MHz or more, 10,000 - 100,000 of operations concurrently. Typical Utilization of all these resources in the 60-90% range, some needed for the ‘shell’. How to achieve this in actual designs?
FPGA programming: dataflow and memory model SW Programmability, Host code with Accelerator code, OpenCL, C/C++ High Level Synthesis (HLS), C/C++/OpenCL with Vivado IPI Ethernet IP Video decode C++ Video process Video encode HDMI Traditional HW design, Verilog or VHDL HDMI video proc. video enc.
Some concepts of HLS for a programmer C code describes this: Vivado HLS solution: Optimally crafted RTL for DSP blocks y = a*x + b + c; a b c y v a Fits into a DSP48 * * x + x + y + b + c Registers are allocated by HLS at all the “right” places void foo (...) { ... add: for (i=0;i<=3;i++) { b = a[i] + b; unroll + + a[3] + + a[2] v b a[1] a[0] Example: Fully unrolled loop. (parallel execution and more resources)
Software flow with SDAccel Design Flow on AWS F1
CPU and FPGA comparison: great potential for speedup speedup. CPU (xeon series) FPGA (virtex ultrascale) Number of elements per chip 2-28 processor cores 1Million Luts and 10,000 DSPs Number of operations per clock per chip (perfect memory model) 1-16 (vector) * 2-28 = 2-448 10,000 – 100,000 Clock frequency 2 - 4 GHz 250 – 500 MHz Max performance (peak) 0.004 – 1.8 Tops 2.5 – 50 Tops Power Consumption ~100-300W ~30-100W Operations 32bit, 64bit integer and float 1,2,3,4,up to 16,32,64 bit integer floating point possible Typical ratio compared to peak 30-90% 30-70% Programming languages and models Python/Java/C/C++/OpenMP/ OpenCL frameworks RTL, C/C++/OpenCL, exploiting parallel opportunities Typical compile/link time Seconds - minutes Hours (synthesis, P+R) Speedup range in practice 10 - 100
FPGAs are good at: All bit-widths, e.g. video processing with 8bit, 10bit, 12bit and more Security and compression algorithms (e.g. 160bit) Machine Learning using reduced precision with specialized bit-widths (8, 4, 2bit, small floating point) Signal processing 8bit, 16, 18bit, 24bit, 32bit integer Multiply-Accumulate and floating point Streaming dataflow oriented compute, e.g. Video encode and decode Streaming network functions Hash functions and query functions Machine Learning Specialized dedicated processor style architectures
Conclusion AWS opens new opportunities to leverage FPGAs There is a potential benefit with FPGAs for a number of applications Programming requires some additional effort You can do it!