CS295: Modern Systems What Are FPGAs and Why Should You Care

Slides:



Advertisements
Similar presentations
VHDL Design of Multifunctional RISC Processor on FPGA
Advertisements

Field Programmable Gate Array
Hao wang and Jyh-Charn (Steve) Liu
Introduction to Programmable Logic John Coughlan RAL Technology Department Electronics Division.
EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
Configurable System-on-Chip: Xilinx EDK
Evolution of implementation technologies
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
General FPGA Architecture Field Programmable Gate Array.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
Computer Architecture and Organization Introduction.
System Arch 2008 (Fire Tom Wada) /10/9 Field Programmable Gate Array.
Automated Design of Custom Architecture Tulika Mitra
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
Galen SasakiEE 260 University of Hawaii1 Electronic Design Automation (EDA) EE 260 University of Hawaii.
Programmable Logic Devices
Introduction to Programmable Logic Devices John Coughlan STFC Technology Department Detector & Electronics Division.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
EE3A1 Computer Hardware and Digital Design
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
M.Mohajjel. Why? TTM (Time-to-market) Prototyping Reconfigurable and Custom Computing 2Digital System Design.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
FPGA Field Programmable Gate Arrays Shiraz University of shiraz spring 2012.
Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.
EMT 351/4 DIGITAL IC DESIGN Week # 1 EDA & HDL.
Chapter 2 content Basic organization of computer What is motherboard
Introduction to the FPGA and Labs
Introduction to Programmable Logic Devices
Programmable Logic Devices
Sequential Programmable Devices
ECE 636 Reconfigurable Computing Lecture 1 Course Introduction Prof
Introduction to Field Programmable Gate Arrays FPGAs
Memories.
Reconfigurable Architectures
ECE354 Embedded Systems Introduction C Andras Moritz.
Complex Programmable Logic Device (CPLD) Architecture and Its Applications
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
Introduction to Programmable Logic
Instructor: Dr. Phillip Jones
Electronics for Physicists
FPGAs in AWS and First Use Cases, Kees Vissers
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
Programmable Logic Devices: CPLDs and FPGAs with VHDL Design
Reconfigurable Computing
Field Programmable Gate Array
Field Programmable Gate Array
Field Programmable Gate Array
Lecture 41: Introduction to Reconfigurable Computing
CISC AND RISC SYSTEM Based on instruction set, we broadly classify Computer/microprocessor/microcontroller into CISC and RISC. CISC SYSTEM: COMPLEX INSTRUCTION.
Embedded systems, Lab 1: notes
Digital Fundamentals Tenth Edition Floyd Chapter 11.
Chapter 1 Introduction.
VHDL Introduction.
Electronics for Physicists
"Computer Design" by Sunggu Lee
THE ECE 554 XILINX DESIGN PROCESS
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.
Digital Designs – What does it take
THE ECE 554 XILINX DESIGN PROCESS
(Lecture by Hasan Hassan)
Reconfigurable Computing (EN2911X)
Programmable logic and FPGA
Reconfigurable Computing (EN2911X)
Presentation transcript:

CS295: Modern Systems What Are FPGAs and Why Should You Care Sang-Woo Jun Spring, 2019

What Are FPGAs Field-Programmable Gate Array Can be configured to act like any circuit – More later! Can do many things, but we focus on computation acceleration

FPGAs Come In Many Forms PCIe-Attached In-Storage CPU Integrated In-Network

How Is It Different From CPU/GPUs GPU – The other major accelerator CPU/GPU hardware is fixed “General purpose” we write programs (sequence of instructions) for them FPGA hardware is not fixed “Special purpose” Hardware can be whatever we want Will our hardware require/support software? Maybe! Optimized hardware is very efficient GPU-level performance** 10x power efficiency (300 W vs 30 W)

Analogy CPU/GPU comes with fixed circuits FPGA gives you a big bag of components To build whatever Could be a CPU/GPU! “The Z-Berry” “Experimental Investigations on Radiation Characteristics of IC Chips” benryves.com “Z80 Computer” Shadi Soundation: Homebrew 4 bit CPU

Fine-Grained Parallelism of Special-Purpose Circuits Example -- Calculating gravitational force: 𝐺× 𝑚 1 × 𝑚 2 ( 𝑥 1 − 𝑥 2 ) 2 + ( 𝑦 1 − 𝑦 2 ) 2 8 instructions on a CPU → 8 cycles** Much fewer cycles on a special purpose circuit A = G × m1 × m2 B = (x1 - x2)2 C = (y1 - y2)2 A = G × m1 C = x1 - x2 E = y1 - y2 D = B + C B = A × m2 D = C2 F = E2 Ret = B / G G = D + F 3 cycles with compound operations Ret = B / G May slow down clock Ret = (G × m1 × m2) / ((x1 - x2)2 + (y1 - y2)2) 4 cycles with basic operations 1 cycle with even further compound operations

Coarse-Grained Parallelism of Special-Purpose Circuits Typical unit of parallelism for general-purpose units are threads ~= cores Special-purpose processing units can also be replicated for parallelism Large, complex processing units: Few can fit in chip Small, simple processing units: Many can fit in chip Only generates hardware useful for the application Instruction? Decoding? Cache? Coherence?

How Is It Different From ASICs ASIC (Application-Specific Integrated Circuit) Special chip purpose-built for an application E.g., ASIC bitcoin miner, Intel neural network accelerator Function cannot be changed once expensively built + FPGAs can be field-programmed Function can be changed completely whenever FPGA fabric emulates custom circuits - Emulated circuits are not as efficient as bare-metal ~10x performance (larger circuits, faster clock) ~10x power efficiency

Basic FPGA Architecture Programmable “Configurable logic block (CLB)” ~ 6-Input Look-Up Table Latch I/O block FF Ex) 2-LUT for “AND” Sequential circuit construction Input 1 Input 2 Output 1 Programmable interconnect

Basic FPGA Architecture – DSP Blocks CLBs act as gates – Many needed to implement high-level logic Arithmetic operation provided as efficient ALU blocks “Digital Signal Processing (DSP) blocks” Each block provides an adder + multiplier × +/-

Basic FPGA Architecture – Block RAM CLB can act as flip-flops (~1 bit/block) – tiny! Some on-chip SRAM provided as blocks ~18/36 Kbit/block, MBs per chip Massively parallel access to data → multi- TB/s bandwidth

Basic FPGA Architecture – Hard Cores Some functions are provided as efficient, non-configurable “hard cores” Multi-core ARM cores (“Zynq” series) Multi-Gigabit Transceivers PCIe/Ethernet PHY Memory controllers … Memory Ethernet ARM PCIe

Example Accelerator Card Architecture “FPGA Mezzanine Card” Expansion Network Ports, Memory, Storage, PCIe, … General-Purpose I/O Pins Multi-Gigabit Transceivers FMC DRAM FPGA 1GbE DRAM 40GbE PCIe

Example Accelerator Card (VCU108)

Programming FPGAs Languages and tools overlap with ASIC/VLSI design FPGAs for acceleration typically done with either Hardware Description Languages (HDL): Register-Transfer Level (RTL) languages High-Level Synthesis: Compiler translates software programming languages to RTL RTL models a circuit using: Registers (state), and Combinational logic (computation)

Hardware Description Language Software programming languages: Describes process Hardware description languages: Describes structure std::queue<float> input_queue; std::queue<float> output_queue; float factor; while (true) { if ( !input_queue.empty() ) { ret = input_queue.front() * factor; output_queue.push(ret) input_queue.pop(); } FIFO#(Float) input_queue <- mkFIFO; FIFO#(Float) output_queue <- mkFIFO; Reg#(Float) factor <- mkReg; FloatMultIfc mult <- mkFloatMult; rule in; mult.enq(factor, input_queue.first); input_queue.deq; endrule rule out; ret <- mult.result; output_queue.enq(ret); Exists in memory Exists on chip Instructions For CPU Creates circuits

Major Hardware Description Languages Verilog: Most widely used in industry Relatively low-level language supported by everyone Chisel – Compiles to Verilog Relatively high-level language from Berkeley Embedded in the Scala programming language Prominently used in RISC-V development (Rocket core, etc) Bluespec – Compiles to Verilog Relatively high-level language from MIT Supports types, interfaces, etc Also active RISC-V development (Piccolo, etc)

High-Level Synthesis Compiler translates software programming languages to RTL High-Level Synthesis compiler from Xilinx, Altera/Intel Compiles C/C++, annotated with #pragma’s into RTL Theory/history behind it is a complex can of worms we won’t go into Personal experience: needs to be HEAVILY annotated to get performance Anecdote: Naïve RISC-V in Vivado HLS achieves IPC of 0.0002 [1], 0.04 after optimizations [2] OpenCL Inherently parallel language more efficiently translated to hardware Stable software interface [1] http://msyksphinz.hatenablog.com/entry/2019/02/20/040000 [2] http://msyksphinz.hatenablog.com/entry/2019/02/27/040000

FPGA Compilation Toolchain “Which transceiver instance should top_transceiver_01 map to?” And so, so much more… High-Level HDL Code High-level language vendor tool Constraint File Functional Simulation Cycle-level Simulation Language Compiler FPGA Vendor toolchain (Few open source) Verilog/ VHDL Netlist Bitfile Synthesize Map/ Place/ Route

Programming/Using an FPGA Accelerator Bitfile is programmed to FPGA over “JTAG” interface Typically used over USB cable Supports FPGA programming, limited debugging access, etc PCIe-attached FPGA accelerator card is typically used similarly to GPUs Program FPGA, execute software Software copies data to FPGA board, notify FPGA -> FPGA logic performs computations -> Software copies data back from FPGA FPGA flexibility gives immense freedom of usage patterns Streaming, coherent memory, …

Partial Reconfiguration FPGA Parts of the FPGA can be swapped out dynamically without turning off FPGA Physical area is drawn on chip Used in Amazon F1, etc Toolchain support for isolation Sub-components

FPGAs In The Cloud Amazon EC2 F1 instance (1 – 4 FPGAs) Microsoft Azure, etc…