FPGAs in AWS and First Use Cases, Kees Vissers

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Give qualifications of instructors: DAP
Maciej Gołaszewski Tutor: Tadeusz Sondej, PhD Design and implementation of softcore dual processor system on single chip FPGA Design and implementation.
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Graduate Computer Architecture I Lecture 16: FPGA Design.
CS 151 Digital Systems Design Lecture 37 Register Transfer Level
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Some Thoughts on Technology and Strategies for Petaflops.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Configurable System-on-Chip: Xilinx EDK
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
Implementation of DSP Algorithm on SoC. Characterization presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompany engineer : Emilia Burlak.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.
Programmable Logic- How do they do that? 1/16/2015 Warren Miller Class 5: Software Tools and More 1.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Highest Performance Programmable DSP Solution September 17, 2015.
A Relational Algebra Processor Final Project Ming Liu, Shuotao Xu.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
1 of 23 Fouts MAPLD 2005/C117 Synthesis of False Target Radar Images Using a Reconfigurable Computer Dr. Douglas J. Fouts LT Kendrick R. Macklin Daniel.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Introduction to MMX, XMM, SSE and SSE2 Technology
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Low Power IP Design Methodology for Rapid Development of DSP Intensive SOC Platforms T. Arslan A.T. Erdogan S. Masupe C. Chun-Fu D. Thompson.
Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.
Optimizing OpenCL Applications for FPGAs
Programmable Logic Devices
. ASAP 2017 Ramine Roane Sr Dir Product Planning July 12, 2017.
Presenter: Darshika G. Perera Assistant Professor
Programmable Hardware: Hardware or Software?
Jehandad Khan and Peter Athanas Virginia Tech
Backprojection Project Update January 2002
Embedded Systems Design
Electronics for Physicists
Introduction to High-level Synthesis
Getting Started with Programmable Logic
IP – Based Design Methodology
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Lecture 41: Introduction to Reconfigurable Computing
Dynamically Reconfigurable Architectures: An Overview
A Digital Signal Prophecy The past, present and future of programmable DSP and the effects on high performance applications Continuing technology enhancements.
Embedded systems, Lab 1: notes
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Optimizing stencil code for FPGA
Programmable Logic- How do they do that?
Hardware Architectures for Deep Learning
Electronics for Physicists
CSE 502: Computer Architecture
ADSP 21065L.
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
CS295: Modern Systems What Are FPGAs and Why Should You Care
Presentation transcript:

FPGAs in AWS and First Use Cases, Kees Vissers

FPGA technology over time Logic I/O Memory/DSP Memory/DSP LUTs Columns Die-stacked slices Basic bit-oriented logic 4-input, 6-input Lookup table Basic bit-oriented logic + Word-oriented Multiply-accumulate Word-oriented Memory Basic bit-oriented logic + Word-oriented Multiply-accumulate Word-oriented Memory System integration, e.g. PCIe, DDR Your program becomes a configuration that sets table values and switches via synthesis, Place and Route tools. Page 2

The FPGA in the Amazon F1 Instance: VU9p (16nm) More then 1 Million 6-input LUTs Lots of on-chip fine grain memory, total in the range of 42 Mbyte Lots of ‘DSP’ elements (Multiply Accumulate), total 6840 What can you as a programmer do with this: RTL (Verilog, VHDL) or Program (C/C++, OpenCL) Typical program synthesizes to ~250MHz - 500MHz or more, 10,000 - 100,000 of operations concurrently. Typical Utilization of all these resources in the 60-90% range, some needed for the ‘shell’. How to achieve this in actual designs?

FPGA programming: dataflow and memory model SW Programmability, Host code with Accelerator code, OpenCL, C/C++ High Level Synthesis (HLS), C/C++/OpenCL with Vivado IPI Ethernet IP Video decode C++ Video process Video encode HDMI Traditional HW design, Verilog or VHDL HDMI video proc. video enc.

Some concepts of HLS for a programmer C code describes this: Vivado HLS solution: Optimally crafted RTL for DSP blocks y = a*x + b + c; a b c y v a Fits into a DSP48 * * x + x + y + b + c Registers are allocated by HLS at all the “right” places void foo (...) { ... add: for (i=0;i<=3;i++) { b = a[i] + b; unroll + + a[3] + + a[2] v b a[1] a[0] Example: Fully unrolled loop. (parallel execution and more resources)

Software flow with SDAccel Design Flow on AWS F1

CPU and FPGA comparison: great potential for speedup speedup. CPU (xeon series) FPGA (virtex ultrascale) Number of elements per chip 2-28 processor cores 1Million Luts and 10,000 DSPs Number of operations per clock per chip (perfect memory model) 1-16 (vector) * 2-28 = 2-448 10,000 – 100,000 Clock frequency 2 - 4 GHz 250 – 500 MHz Max performance (peak) 0.004 – 1.8 Tops 2.5 – 50 Tops Power Consumption ~100-300W ~30-100W Operations 32bit, 64bit integer and float 1,2,3,4,up to 16,32,64 bit integer floating point possible Typical ratio compared to peak 30-90% 30-70% Programming languages and models Python/Java/C/C++/OpenMP/ OpenCL frameworks RTL, C/C++/OpenCL, exploiting parallel opportunities Typical compile/link time Seconds - minutes Hours (synthesis, P+R) Speedup range in practice 10 - 100

FPGAs are good at: All bit-widths, e.g. video processing with 8bit, 10bit, 12bit and more Security and compression algorithms (e.g. 160bit) Machine Learning using reduced precision with specialized bit-widths (8, 4, 2bit, small floating point) Signal processing 8bit, 16, 18bit, 24bit, 32bit integer Multiply-Accumulate and floating point Streaming dataflow oriented compute, e.g. Video encode and decode Streaming network functions Hash functions and query functions Machine Learning Specialized dedicated processor style architectures

Conclusion AWS opens new opportunities to leverage FPGAs There is a potential benefit with FPGAs for a number of applications Programming requires some additional effort You can do it!