Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida
2/26 Introduction Improved performance enables new applications Past decade - Mp3 players, portable game consoles, cell phones, etc. Future architectures - Speech/image recognition, self-guiding cars, computation biology, etc.
3/26 Introduction FPGAs (Field Programmable Gate Arrays) – Implement custom circuits 10x, 100x, even 1000x for scientific and embedded apps [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05], … But, FPGAs not mainstream Warp Processing Goal: Bring FPGAs into mainstream Make FPGAs “Invisible” uP FPGA Performance FPGAs capable of large performance improvements
4/26 Introduction – Hardware/Software Partitioning for (i=0; i < 128; i++) y[i] += c[i] * x[i].. for (i=0; i < 16; i++) y[i] += c[i] * x[i].. C Code for FIR Filter Processor ~1000 cycles Compiler Hardware/software partitioning selects performance critical regions for hardware implementation [Ernst, Henkel 93] [Gupta, DeMicheli 97] [Vahid, Gajski 94] [Eles et al. 97] [Sangiovanni-Vincentelli 94] Processor FPGA ************ Designer creates custom hardware using hardware description language (HDL) Hardware for loop ~ 10 cycles Speedup = 1000 cycles/ 10 cycles = 100x
5/26 Introduction – High-level Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-level Code Decompilatio n High-level Synthesis Bitstream uPFPGA Linker Hardware Software Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers Decompilation Hw/Sw Partitioning Compiler
6/26 Introduction – High-level Synthesis Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers for (i=0; i < 16; i++) y[i] += c[i] * x[i] ************ Decompilation High-level Synthesis
7/26 Problems with High-Level Synthesis Problem: High-level synthesis is unattractive to software developers Requires specialized language SystemC, NapaC, HandelC, … Requires specialized compiler Spark, ROCCC, CatapultC, … Limited commercial success Software developers reluctant to change tools Libraries/ Object Code Libraries/ Object Code Updated Binary High-level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Non- Standard Software Tool Flow Updated Binary Specialized Language Decompilation Specialized Compiler
8/26 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Solution: Make synthesis “invisible” 2 Requirements Standard software tool flow Perform compilation before synthesis Hide synthesis tool Move synthesis on chip Similar to dynamic binary translation [Transmeta] But, translate to hw Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Move compilation before synthesis Standard Software Tool Flow
9/26 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Solution: Make synthesis “invisible” 2 Requirements Standard software tool flow Perform compilation before synthesis Hide synthesis tool Move synthesis on chip Similar to dynamic binary translation [Transmeta] But, translate to hw Warp processor looks like standard uP but invisibly synthesizes hardware
10/26 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Advantages Supports all languages,compilers, IDEs Supports synthesis of assembly code Support synthesis of library code Also, enables dynamic optimizations Updated Binary C, C++, Java, Matlab Decompilation gcc, g++, javac, keil Warp processor looks like standard uP but invisibly synthesizes hardware
11/26 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary
12/26 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µP
13/26 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected
14/26 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD
15/26 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD converts critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0
16/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 :=
17/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA
18/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped”
19/55 uP I$ D$ FPGA Profiler On-chip CAD Warp Processing Background: Basic Technology Challenge: CAD tools normally require powerful workstations Develop extremely efficient on-chip CAD tools Requires efficient synthesis Requires specialized FPGA, physical design tools (JIT FPGA compilation) [Lysecky FCCM05/DAC04], University of Arizona Binary HW Synthesis Technology Mapping Placement & Routing Logic Optimization Binary Updated Binary JIT FPGA compilation 46x improvement 30% perf. penalty
20/26 Warp Processing: Initial Results - Embedded Applications Average speedup of 6.3x Achieved completely transparently Also, energy savings of 66%
21/26 µP Thread Warping - Overview FPGA µP OS µP f() Compiler Binary for (i = 0; i < 10; i++) { thread_create( f, i ); } f() µP On-chip CAD Acc. Lib f() OS schedules threads onto available µPs Remaining threads added to queue OS invokes on-chip CAD tools to create accelerators for f() OS schedules threads onto accelerators (possibly dozens), in addition to µPs Thread warping: use one core to create accelerator for waiting threads Very large speedups possible – parallelism at bit, arithmetic, and now thread level too Performance Multi- core platforms multi- threaded apps
22/26 Speedup from Thread Warping Average 130x speedup 11x faster than 64-core system Simulation pessimistic, actual results likely better But, FPGA uses additional area So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s
23/26 Dynamic enables Custom Communication µP NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2
24/26 Dynamic enables Custom Communication FPGA NoC – Network on a Chip provides communication between multiple cores Bus Mesh Bus Mesh App1 App2 µP Warp processing can dynamically choose topology FPGA µP FPGA µP Problem: Best topology is application dependent
Summary Warp processors Achieves performance advantages of FPGA without any extra effort “Invisible” synthesis Allows designers to use existing tools/languages Enables dynamic hardware optimization Thread warping Dynamic synthesis of thread accelerators for multi-cores Custom communication Warp processing can adapt communication topology to needs of application or a particular workload 25/26
26/26 References Patent Warp Processor for Dynamic Hardware/Software Partitioning. F. Vahid, R. Lysecky, G. Stitt. Patent Pending, Hardware/Software Partitioning of Software Binaries G. Stitt and F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2002, pp Warp Processors R. Lysecky, G. Stitt, and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2006, Volume 11, Number 3, pp Binary Synthesis G. Stitt and F. Vahid Accepted for publication in ACM Transactions on Design Automation of Electronic Systems (TODAES) 4. Expandable Logic G. Stitt, F. Vahid Submitted to IEEE/ACM Conference on Design Automation (DAC), New Decompilation Techniques for Binary-level Co-processor Generation G. Stitt, F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2005, pp Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode G.Stitt, F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid IEEE/ACM Design Automation and Test in Europe (DATE), 2005, pp Dynamic Hardware/Software Partitioning: A First Approach G. Stitt, R. Lysecky and F. Vahid IEEE/ACM Conference on Design Automation (DAC), 2003, pp