Portability for FPGA Applications—Warp Processing and SystemC Bytecode Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ.

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Advertisements

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Computer Abstractions and Technology
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
Portability for FPGA Applications—Warp Processing and SystemC Bytecode Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ.
Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
JIT FPGA Ideas Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.
The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.
Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,
Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.
Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.
A Code Refinement Methodology for Performance-Improved Synthesis from C Greg Stitt, Frank Vahid*, Walid Najjar Department of Computer Science and Engineering.
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering.
Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.
Just-in-Time Compilation for FPGA Processor Cores This work was supported in part by the National Science Foundation (CNS ) and by the Semiconductor.
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.
Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Scott Sirowy Department of Computer Science and Engineering University of California, Riverside This work was supported in part by the National Science.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Scott Sirowy*, Greg Stitt‡, Frank Vahid*†
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Computer Systems Organization CS 1428 Foundations of Computer Science.
Automated Design of Custom Architecture Tulika Mitra
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
J. Christiansen, CERN - EP/MIC
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Computer Architecture Lecture 3 Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics.
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Full and Para Virtualization
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
Buffering Techniques Greg Stitt ECE Department University of Florida.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
Improving java performance using Dynamic Method Migration on FPGAs
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
Introduction to cosynthesis Rabi Mahapatra CSCE617
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
A High Performance SoC: PkunityTM
Chapter 1 Introduction.
Dynamic FPGA Routing for Just-in-Time Compilation
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Chapter 12 Pipelining and RISC
Dynamic Hardware/Software Partitioning: A First Approach
Warp Processor: A Dynamically Reconfigurable Coprocessor
Portable SystemC-on-a-Chip
A Level Computer Science Topic 5: Computer Architecture and Assembly
Online SystemC Emulation Acceleration
Presentation transcript:

Portability for FPGA Applications—Warp Processing and SystemC Bytecode Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ. of Florida, Gainesville Scotty Sirowy (current) David Sheldon (current) Chen Huang (current) This research was supported in part by the National Science Foundation, the Semiconductor Research Corporation, Intel, Freescale, IBM, and Xilinx Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine

Frank Vahid, UC Riverside 2/52 Portable Applications on PCs x86 binary Pentium Atom Opteron Dual Core How? Why? One binary Multiple platforms

Frank Vahid, UC Riverside 3/52 Portable Applications on PCs Standard software binary Dynamic software binary translation Applications ToolsArchitectures “Ecosystem” SW binary translation VLIW x86 µP VLIW Binary x86 Binary

Frank Vahid, UC Riverside 4/52 Meanwhile, Circuits on FPGAs Show Large Speedups Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS, MICRO, CASES, DAC, DATE, ICCAD, RAW, …

Frank Vahid, UC Riverside 5/52 FPGAs Entering Computing Mainstream Xilinx Virtex II Pro. Source: Xilinx SGI Altix supercomputer (UCR: 64 Itaniums plus 2 FPGA RASCs) AMD Opteron Intel QuickAssist Cray, SGI Mitrionics IBM Cell (research) Xilinx, Altera

Frank Vahid, UC Riverside 6/52 Circuits on FPGAs are Software Binaries Processor … … 0010 … Bits loaded into program memory Microprocessor Binaries (Instructions) … Bits loaded into LUTs and SMs FPGA “Binaries” (Circuits) Processor FPGA 0111 … aka "bitstream" "Software" "Hardware" Sep 2007 IEEE Computer not hardware

Frank Vahid, UC Riverside 7/52 “Portable Applications” + “FPGAs” Standard software binary Dynamic translation Applications ToolsArchitectures “Ecosystem” SW binary translation VLIW x86 µP VLIW Binary x86 Binary SW binary translation FPGA x86 µP FPGA binary “Warp Processing”

Frank Vahid, UC Riverside 8/52 µP FPGA On-chip CAD Warp Processing Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

Frank Vahid, UC Riverside 9/52 µP FPGA On-chip CAD Warp Processing Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µP

Frank Vahid, UC Riverside 10/52 µP FPGA On-chip CAD Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

Frank Vahid, UC Riverside 11/52 µP FPGA On-chip CAD Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

Frank Vahid, UC Riverside 12/52 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits

Frank Vahid, UC Riverside 13/52 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 :=

Frank Vahid, UC Riverside 14/52 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA

Frank Vahid, UC Riverside 15/52 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped” >10x speedups for some apps Warp speed, Scotty

Frank Vahid, UC Riverside 16/52 Warp Processing Challenges Can we decompile binaries sufficiently for synthesis? Can we just-in-time (JIT) compile to FPGAs? µP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Profiling & partitioning Binary Updater Binary Microp Binary CDFG JIT FPGA compilation

Frank Vahid, UC Riverside 17/52 Decompilation Recover high-level information from binary: branches, loops, arrays, subroutines, … Adapted previous methods for processor-processor translation (UQBT) Developed new synthesis-oriented methods (e.g., “reroll” loops, strength “promotion”) Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery Almost Identical Representations

Frank Vahid, UC Riverside 18/52 Decompilation Results vs. C Synthesis from decompiled binary is competitive with synthesis from C

Frank Vahid, UC Riverside 19/52 Decompilation Results on Optimized H.264 In-depth Study with Freescale Again, competitive with synthesis from C

Frank Vahid, UC Riverside 20/52 Decompilation Effective Even with Compiler Optimizations Average Speedup of 10 Examples Do compiler optimizations hurt decompilation? (Surprisingly) found optimized code synthesizes to even better circuits Speedup when decompiled binary is partitioned and synthesized to FPGA

Frank Vahid, UC Riverside 21/52 Decompilation Summary: Decompilation is surprisingly effective at recovering high-level program structures for synthesis Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07 Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now Asst. Prof. at UF Gainesville)

Frank Vahid, UC Riverside 22/52 Warp Processing Challenges Can we decompile binaries sufficiently for synthesis? Can we just-in-time (JIT) compile to FPGAs? µP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Profiling & partitioning Binary Updater Binary Microp Binary CDFG JIT FPGA compilation

Frank Vahid, UC Riverside 23/52 Expand Reduce Irredundant dc-seton-setoff-set Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping, e.g., Logic synthesis: run single expand phase Technology mapping: bottom-up graph clustering heuristic Placement: place critical path first, then adjacent items Routing: use resource graph that matches switch matrix / channel structure Challenge: JIT Compile to FPGA 60 MB Logic synthesisTech. map.PlacementRouting 9.1 s Commercial tool 3.6MB 0.2 s Ultra-lean Riverside JIT FPGA tools (drawn to scale) 1.4s Ultra-lean Riverside JIT FPGA tools on a 75MHz ARM7 3.6MB Penalty: 1.3-2x in performance & size (even more might be acceptable)

Frank Vahid, UC Riverside 24/52 JIT Compile to FPGA Summary: Ultra-lean JIT FPGA compiler  40x speedup, 20x less memory, 1.3x-2x circuit penalty Lysecky et al, DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06 Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona)

Frank Vahid, UC Riverside 25/52 Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41 1 = ARM-only execution Overall application speedup average is 7.4 vs. 200 MHz ARM µP I$ D$ FPGA Profiler On-chip CAD

Frank Vahid, UC Riverside 26/52 µP Warping Thread-Based Applications FPGA µP OS µP f() Compiler Binary for (i = 0; i < 10; i++) { thread_create( f, i ); } f() µP On-chip CAD Acc. Lib f() OS schedules threads onto available µPs Remaining threads added to queue OS invokes on-chip CAD tools to create accelerators for f() OS schedules threads onto accelerators (possibly dozens), in addition to µPs Thread warping: use one core to create accelerator for waiting threads Very large speedups possible – parallelism at bit, arithmetic, and now thread level too Performance Multi-core platforms  multi- threaded apps

Frank Vahid, UC Riverside 27/52 Must deal with widely known memory bottleneck problem FPGAs great, but often can’t get data to them fast enough void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }.... } Memory Access Synchronization (MAS) Same array FPGA b()a() RAM Data for dozens of threads can create bottleneck for (i = 0; i < 10; i++) { thread_create( thread_function, a, i ); } DMA Threaded programs exhibit unique feature: Multiple threads often access same or overlapping data Solution: Fetch data once, broadcast to multiple threads (MAS) ….

Frank Vahid, UC Riverside 28/52 Memory Access Synchronization (MAS) Detect overlapping memory regions – “windows” void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3];.... } for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } a[0]a[1]a[2]a[3]a[4]a[5] ……… f() ……………… f() DMA RAM A[0-103] A[0-3] A[1-4] A[6-9] Data streamed to “smart buffer” Smart Buffer Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses Synthesis creates active “smart buffer” [Guo/Najjar FPGA04] Actively fetches data, stores the reused data, delivers windows to threads Active rather than passive component; designed for specific threads Each thread accesses different addresses – but addresses may overlap enable

Frank Vahid, UC Riverside 29/52 Speedups from Thread Warping Chose benchmarks with extensive parallelism Four core (ARM MHz) base system Virtex IV FPGA at circuit-specific clock frequency (~ MHz) Average 130x speedup Still 20x faster than 32-core system (and 11x faster than 64-core) Simulation pessimistic, actual results likely better FPGA more flexible But, FPGA uses additional area. Our FPGA size = ~36 ARM11s

Frank Vahid, UC Riverside 30/52 Warp Scenarios µP Time µP (1 st execution) Time On-chip CAD µP FPGA Speedup Long Running Applications Recurring Applications Long-running applications Scientific computing, etc. Recurring applications (save and reuse FPGA configurations) Common in embedded systems Might view as (long) boot phase For networked/docked devices, CAD can occur on server (ongoing work) On-chip CAD Single-execution speedup FPGA Warping takes time (seconds, minutes, or more) – when useful?

Frank Vahid, UC Riverside 31/52 Why Dynamic? Static good, but hiding FPGA opens technique to all sw platforms Standard languages/tools/binaries On-chip CAD FPGA µPµP Any Compiler FPGA µPµP Specialized Compiler Binary Netlist Binary Specialized Language Any Language Static Compiling to FPGAs Dynamic Compiling to FPGAs Applications ToolsArchitectures “Ecosystem”

Frank Vahid, UC Riverside 32/52 Synthesis-Friendly Applications Coding style impacts synthesis results

Frank Vahid, UC Riverside 33/52 Synthesis-Friendly Application Coding Guidelines Conversion to Constants (CC) Conversion to Fixed Point (CF) Conversion to Explicit Data Flow (CEDF) Conversion to Explicit Memory Accesses (CEMA) Function Specialization (FS) Constant Input Enumeration (CIE) Loop Rerolling (LR) Conversion to Explicit Control Flow (CECF) Algorithmic Specialization (AS) Pass-By-Value Return (PVR) Coding Guidelines

Frank Vahid, UC Riverside 34/52 Conversion to Explicit Control Flow (CECF) Problem: Function pointers may prevent static control flow analysis Guideline: Don’t use function pointers. Replace with if-else, static calls Makes possible targets explicit void f( int (*fp) (int) ) {..... for (i=0; i < 10; i++) { a[i] = fp(i); } enum Target { FUNC1, FUNC2, FUNC3 }; void f( enum Target fp ) {..... for (i=0; i < 10; i++) { if (fp == FUNC1) a[i] = f1(i); else if (fp == FUNC2) a[i] = f2(i); else a[i] = f3(i); } Synthesis unlikely to determine possible targets of function pointer ? a[i] Synthesized Hardware a[i] Synthesized Circuit f1(i)f2(i)f3(i) 3x1 fp

Frank Vahid, UC Riverside 35/52 Speedups from Synthesis-Friendly Coding Guidelines 10 guidelines For ~1,000 line benchmark: 5-6 changes typical, tens of minutes each Simple guidelines increased speedup to 6.5x

Frank Vahid, UC Riverside 36/52 Speedups from Synthesis-Friendly Coding Guidelines Original C code (Powerstone, Mediabench) Original average speedups with FPGA: 2.6x (excludes brev) Refined C code with guidelines Average speedup: 8.4x (excludes brev) Guidelines led to 3.5x improvement of speedup

Frank Vahid, UC Riverside 37/52 count “Spatial” Algorithms for FPGAs As FPGAs more common – app writers may expect FPGA presence Example – Count patterns Sequential algorithm Hash table 10s cycles per pattern Spatial algorithm (for FPGA) Pipelined stages Level 1 logic pattern logic Level 2 Level 3 Level m logic Current pattern count pattern count pattern count pattern logic Level 4 count pattern Spatial algorithm: Essence is the connectivity of components, not the sequencing of instructions

Frank Vahid, UC Riverside 38/52 2 n Count 2 n patterns 4 Count 4 patterns 2 Count 2 patterns 1 Count Spatial Algorithms for FPGAs Spatial algorithm 2 Pipelined binary tree Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns Current pattern......

Frank Vahid, UC Riverside 39/52 Example Stage 1 Stage 2 Stage 3 Stage Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns Current pattern Possible patterns pre-stored in binary search tree circuit

Frank Vahid, UC Riverside 40/52 Example Stage 1 Stage 2 Stage 3 Stage Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns Current pattern

Frank Vahid, UC Riverside 41/52 Example Stage 1 Stage 2 Stage 3 Stage Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns Current pattern

Frank Vahid, UC Riverside 42/52 Example Stage 1 Stage 2 Stage 3 Stage Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns Current pattern

Frank Vahid, UC Riverside 43/52 Example Stage 1 Stage 2 Stage 3 Stage 4 11 Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns Current pattern

Frank Vahid, UC Riverside 44/52 Study of Spatial Algorithms in FCCM YearApplicationType 20013D Vec. NormalizationSpatial 2001Efficient CAM Automated SensorTemporal 2001Regular ExpressionSpatial 2002Hyperspectral ImageSpatial 2002Machine VisionSpatial 2002RC4Temporal 2002Set CoveringSpatial 2002Template MatchingSpatial 2002Triangle MeshSpatial 2003Congruential SievesTemporal 2003Content ScanningTemporal 2003F.P and Square RootSpatial 2003Gaussian NoiseSpatial 2003TRNG D FDTD MethodSpatial 2004Deep Packet Filter Online Floating Point Molecular DynamicsSpatial 2004Pattern MatchingSpatial 2004Seismic MigrationSpatial 2004Software Deceleration V.M Window Data MiningSpatial 2005Cell AutomataTemporal 2005Particle GraphicsSpatial 2005RadiosityTemporal 2005Transient WavesSpatial 2005Road TrafficTemporal 2006All Pairs Shortest PathSpatial 2006Apriori Data MiningSpatial 2006Molecular DynamicsSpatial 2006Gaussian EliminationSpatial 2006Radiation DoseTemporal 2006Random VariatesSpatial FCCM papers describing fast application on FPGA Examined 35 in depth (every other one) 6 used device-specific features 9 represented expected synthesized circuit from the obvious sequential algorithm 20 were spatially-oriented applications e.g., earlier pipelined binary tree

Frank Vahid, UC Riverside 45/52 Portable Spatial Applications? Current portable microprocessor binaries – sequential Extensions for threads, processes,... How support spatial constructs Ports, connections, timing model Adds libraries and macros, still standard C++ Sequential and spatial constructs Compiling links in the simulation kernel Self-executing simulation Intended for SoC simulation

Frank Vahid, UC Riverside 46/52

Frank Vahid, UC Riverside 47/52 Bytecode Modern portability approach Java, C# Pentium Atom Opteron bytecode Compiler VM Virtual Machine (VM): Program that executes bytecode May JIT compile to native architecture

Frank Vahid, UC Riverside 48/52 SystemC Bytecode? Pentium FPGA SystemC bytecode Compiler VM SystemC Opteron + FPGA VM

Frank Vahid, UC Riverside 49/52 SystemC Bytecode Compiler class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; SC_method(getPixel); sensitive << clock.pos(); void getPixel(){ … dataReady.write(1); } void mainComp(){ int i, j; for(i = 0; i < 3; i++){ for(j = 0; j < 3; j++){ sumX = sumX + mem.read()*GX[i][j] } … edge.write(sumX + sumY) } SystemC Pinapa Front End ELAB AST Link Bytecode Back End SystemC bytecode Code Generation 1 SystemC Bytecode Compiler Register Allocation

Frank Vahid, UC Riverside 50/52 SystemC Bytecode Emulator Emulator Input Memory Output Memory UART Buttons LEDs Read Signal Memory Write Signal Memory Main Processor Instruction Memory USB Interface Accelerator 1 Accelerator 2 Accelerator 3 FPGA Bytecode uploadable via USB drive Accelerators speedup emulation µP I$ D$ FPGA Profiler On-chip CAD “Warping” also possible – JIT compile bytecode portions to circuits on FPGA SystemC bytecode

Frank Vahid, UC Riverside 51/52 µP Cache Dynamic Enables Expandable Logic Concept RAM Expandable RAM uP Performance Profiler µP Cache Warp Tools DMA FPGA RAM Expandable RAM – System detects RAM during start, improves performance invisibly Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable Logic

Frank Vahid, UC Riverside 52/52 Summary FPGAs entering mainstream Portability of applications is important Dynamic binary translation to FPGAs – Warp processing Shown feasible; Extensive future work Trends towards FPGA ubiquity Microprocessor binaries need extensions for spatial constructs One approach: SystemC bytecode and virtual machine Can also be warped for circuit-speed