Download presentation
Presentation is loading. Please wait.
Published byFrancine Baldwin Modified over 6 years ago
1
FPGAs in AWS and First Use Cases, Kees Vissers
2
FPGA technology over time
Logic I/O Memory/DSP Memory/DSP LUTs Columns Die-stacked slices Basic bit-oriented logic 4-input, 6-input Lookup table Basic bit-oriented logic + Word-oriented Multiply-accumulate Word-oriented Memory Basic bit-oriented logic + Word-oriented Multiply-accumulate Word-oriented Memory System integration, e.g. PCIe, DDR Your program becomes a configuration that sets table values and switches via synthesis, Place and Route tools. Page 2
3
The FPGA in the Amazon F1 Instance: VU9p (16nm)
More then 1 Million 6-input LUTs Lots of on-chip fine grain memory, total in the range of 42 Mbyte Lots of ‘DSP’ elements (Multiply Accumulate), total 6840 What can you as a programmer do with this: RTL (Verilog, VHDL) or Program (C/C++, OpenCL) Typical program synthesizes to ~250MHz - 500MHz or more, 10, ,000 of operations concurrently. Typical Utilization of all these resources in the 60-90% range, some needed for the ‘shell’. How to achieve this in actual designs?
4
FPGA programming: dataflow and memory model
SW Programmability, Host code with Accelerator code, OpenCL, C/C++ High Level Synthesis (HLS), C/C++/OpenCL with Vivado IPI Ethernet IP Video decode C++ Video process Video encode HDMI Traditional HW design, Verilog or VHDL HDMI video proc. video enc.
5
Some concepts of HLS for a programmer
C code describes this: Vivado HLS solution: Optimally crafted RTL for DSP blocks y = a*x + b + c; a b c y v a Fits into a DSP48 * * x + x + y + b + c Registers are allocated by HLS at all the “right” places void foo (...) { ... add: for (i=0;i<=3;i++) { b = a[i] + b; unroll + + a[3] + + a[2] v b a[1] a[0] Example: Fully unrolled loop. (parallel execution and more resources)
6
Software flow with SDAccel Design Flow on AWS F1
7
CPU and FPGA comparison: great potential for speedup speedup.
CPU (xeon series) FPGA (virtex ultrascale) Number of elements per chip 2-28 processor cores 1Million Luts and 10,000 DSPs Number of operations per clock per chip (perfect memory model) 1-16 (vector) * 2-28 = 2-448 10,000 – 100,000 Clock frequency 2 - 4 GHz 250 – 500 MHz Max performance (peak) 0.004 – 1.8 Tops 2.5 – 50 Tops Power Consumption ~ W ~30-100W Operations 32bit, 64bit integer and float 1,2,3,4,up to 16,32,64 bit integer floating point possible Typical ratio compared to peak 30-90% 30-70% Programming languages and models Python/Java/C/C++/OpenMP/ OpenCL frameworks RTL, C/C++/OpenCL, exploiting parallel opportunities Typical compile/link time Seconds - minutes Hours (synthesis, P+R) Speedup range in practice
8
FPGAs are good at: All bit-widths, e.g.
video processing with 8bit, 10bit, 12bit and more Security and compression algorithms (e.g. 160bit) Machine Learning using reduced precision with specialized bit-widths (8, 4, 2bit, small floating point) Signal processing 8bit, 16, 18bit, 24bit, 32bit integer Multiply-Accumulate and floating point Streaming dataflow oriented compute, e.g. Video encode and decode Streaming network functions Hash functions and query functions Machine Learning Specialized dedicated processor style architectures
9
Conclusion AWS opens new opportunities to leverage FPGAs
There is a potential benefit with FPGAs for a number of applications Programming requires some additional effort You can do it!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.