Download presentation
Presentation is loading. Please wait.
1
Portability for FPGA Applications—Warp Processing and SystemC Bytecode Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ. of Florida, Gainesville Scotty Sirowy (current) David Sheldon (current) Chen Huang (current) This research was supported in part by the National Science Foundation, the Semiconductor Research Corporation, Intel, Freescale, IBM, and Xilinx Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine
2
Frank Vahid, UC Riverside 2/64 Portable Applications on PCs x86 binary Pentium Atom Opteron Dual Core How? Why? One binary Multiple platforms
3
Frank Vahid, UC Riverside 3/64 Portable Applications on PCs Standard software binary Dynamic software binary translation Applications ToolsArchitectures “Ecosystem” SW binary translation VLIW x86 µP VLIW Binary x86 Binary
4
Frank Vahid, UC Riverside 4/64 Meanwhile, Circuits on FPGAs Show Large Speedups Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS, MICRO, CASES, DAC, DATE, ICCAD, RAW, …
5
Frank Vahid, UC Riverside 5/64 FPGAs Entering Computing Mainstream Xilinx Virtex II Pro. Source: Xilinx SGI Altix supercomputer (UCR: 64 Itaniums plus 2 FPGA RASCs) AMD Opteron Intel QuickAssist Cray, SGI Mitrionics IBM Cell (research) Xilinx, Altera
6
Frank Vahid, UC Riverside 6/64 Circuits on FPGAs are Software Binaries Processor 001010010 … 001010010 … 0010 … Bits loaded into program memory Microprocessor Binaries (Instructions) 001010010 … 01110100... Bits loaded into LUTs and SMs FPGA “Binaries” (Circuits) Processor FPGA 0111 … aka "bitstream" "Software" "Hardware" Sep 2007 IEEE Computer not hardware
7
Frank Vahid, UC Riverside 7/64 “Portable Applications” + “FPGAs” Standard software binary Dynamic translation Applications ToolsArchitectures “Ecosystem” SW binary translation VLIW x86 µP VLIW Binary x86 Binary SW binary translation FPGA x86 µP FPGA binary “Warp Processing”
8
Frank Vahid, UC Riverside 8/64 µP FPGA On-chip CAD Warp Processing Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary
9
Frank Vahid, UC Riverside 9/64 µP FPGA On-chip CAD Warp Processing Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µP
10
Frank Vahid, UC Riverside 10/64 µP FPGA On-chip CAD Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected
11
Frank Vahid, UC Riverside 11/64 µP FPGA On-chip CAD Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD
12
Frank Vahid, UC Riverside 12/64 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits
13
Frank Vahid, UC Riverside 13/64 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +...
14
Frank Vahid, UC Riverside 14/64 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA
15
Frank Vahid, UC Riverside 15/64 µP FPGA Dynamic Part. Module (DPM) Warp Processing Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped” >10x speedups for some apps Warp speed, Scotty
16
Frank Vahid, UC Riverside 16/64 Warp Processing Challenges Can we decompile binaries sufficiently for synthesis? Can we just-in-time (JIT) compile to FPGAs? µP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Profiling & partitioning Binary Updater Binary Microp Binary CDFG JIT FPGA compilation
17
Frank Vahid, UC Riverside 17/64 Decompilation Recover high-level information from binary: branches, loops, arrays, subroutines, … Adapted previous methods for processor-processor translation (UQBT) Developed new synthesis-oriented methods (e.g., “reroll” loops, strength “promotion”) Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery Almost Identical Representations
18
Frank Vahid, UC Riverside 18/64 Decompilation Results vs. C Synthesis from decompiled binary is competitive with synthesis from C
19
Frank Vahid, UC Riverside 19/64 Decompilation Results on Optimized H.264 In-depth Study with Freescale Again, competitive with synthesis from C
20
Frank Vahid, UC Riverside 20/64 Decompilation Effective Even with Compiler Optimizations Average Speedup of 10 Examples Do compiler optimizations hurt decompilation? (Surprisingly) found optimized code synthesizes to even better circuits Speedup when decompiled binary is partitioned and synthesized to FPGA
21
Frank Vahid, UC Riverside 21/64 Decompilation Summary: Decompilation is surprisingly effective at recovering high-level program structures for synthesis Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07 Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now Asst. Prof. at UF Gainesville)
22
Frank Vahid, UC Riverside 22/64 Warp Processing Challenges Can we decompile binaries sufficiently for synthesis? Can we just-in-time (JIT) compile to FPGAs? µP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Profiling & partitioning Binary Updater Binary Microp Binary CDFG JIT FPGA compilation
23
Frank Vahid, UC Riverside 23/64 Expand Reduce Irredundant dc-seton-setoff-set Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping, e.g., Logic synthesis: run single expand phase Technology mapping: bottom-up graph clustering heuristic Placement: place critical path first, then adjacent items Routing: use resource graph that matches switch matrix / channel structure Challenge: JIT Compile to FPGA 60 MB Logic synthesisTech. map.PlacementRouting 9.1 s Commercial tool 3.6MB 0.2 s Ultra-lean Riverside JIT FPGA tools (drawn to scale) 1.4s Ultra-lean Riverside JIT FPGA tools on a 75MHz ARM7 3.6MB Penalty: 1.3-2x in performance & size (even more might be acceptable)
24
Frank Vahid, UC Riverside 24/64 JIT Compile to FPGA Summary: Ultra-lean JIT FPGA compiler 40x speedup, 20x less memory, 1.3x-2x circuit penalty Lysecky et al, DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06 Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona)
25
Frank Vahid, UC Riverside 25/64 Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41 1 = ARM-only execution Overall application speedup average is 7.4 vs. 200 MHz ARM µP I$ D$ FPGA Profiler On-chip CAD
26
Frank Vahid, UC Riverside 26/64 µP Warping Thread-Based Applications FPGA µP OS µP f() Compiler Binary for (i = 0; i < 10; i++) { thread_create( f, i ); } f() µP On-chip CAD Acc. Lib f() OS schedules threads onto available µPs Remaining threads added to queue OS invokes on-chip CAD tools to create accelerators for f() OS schedules threads onto accelerators (possibly dozens), in addition to µPs Thread warping: use one core to create accelerator for waiting threads Very large speedups possible – parallelism at bit, arithmetic, and now thread level too Performance Multi-core platforms multi- threaded apps
27
Frank Vahid, UC Riverside 27/64 Must deal with widely known memory bottleneck problem FPGAs great, but often can’t get data to them fast enough void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; }.... } Memory Access Synchronization (MAS) Same array FPGA b()a() RAM Data for dozens of threads can create bottleneck for (i = 0; i < 10; i++) { thread_create( thread_function, a, i ); } DMA Threaded programs exhibit unique feature: Multiple threads often access same or overlapping data Solution: Fetch data once, broadcast to multiple threads (MAS) ….
28
Frank Vahid, UC Riverside 28/64 Memory Access Synchronization (MAS) Detect overlapping memory regions – “windows” void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3];.... } for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } a[0]a[1]a[2]a[3]a[4]a[5] ……… f() ……………… f() DMA RAM A[0-103] A[0-3] A[1-4] A[6-9] Data streamed to “smart buffer” Smart Buffer Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses Synthesis creates active “smart buffer” [Guo/Najjar FPGA04] Actively fetches data, stores the reused data, delivers windows to threads Active rather than passive component; designed for specific threads Each thread accesses different addresses – but addresses may overlap enable
29
Frank Vahid, UC Riverside 29/64 Speedups from Thread Warping Chose benchmarks with extensive parallelism Four core (ARM11 400 MHz) base system Virtex IV FPGA at circuit-specific clock frequency (~100-300 MHz) Average 130x speedup Still 20x faster than 32-core system (and 11x faster than 64-core) Simulation pessimistic, actual results likely better FPGA more flexible But, FPGA uses additional area. Our FPGA size = ~36 ARM11s
30
Frank Vahid, UC Riverside 30/64 Warp Scenarios µP Time µP (1 st execution) Time On-chip CAD µP FPGA Speedup Long Running Applications Recurring Applications Long-running applications Scientific computing, etc. Recurring applications (save and reuse FPGA configurations) Common in embedded systems Might view as (long) boot phase For networked/docked devices, CAD can occur on server (ongoing work) On-chip CAD Single-execution speedup FPGA Warping takes time (seconds, minutes, or more) – when useful?
31
Frank Vahid, UC Riverside 31/64 Why Dynamic? Static good, but hiding FPGA opens technique to all sw platforms Standard languages/tools/binaries On-chip CAD FPGA µPµP Any Compiler FPGA µPµP Specialized Compiler Binary Netlist Binary Specialized Language Any Language Static Compiling to FPGAs Dynamic Compiling to FPGAs Applications ToolsArchitectures “Ecosystem”
32
Frank Vahid, UC Riverside 32/64 Synthesis-Friendly Applications Coding style impacts synthesis results
33
Frank Vahid, UC Riverside 33/64 Synthesis-Friendly Application Coding Guidelines Conversion to Constants (CC) Conversion to Fixed Point (CF) Conversion to Explicit Data Flow (CEDF) Conversion to Explicit Memory Accesses (CEMA) Function Specialization (FS) Constant Input Enumeration (CIE) Loop Rerolling (LR) Conversion to Explicit Control Flow (CECF) Algorithmic Specialization (AS) Pass-By-Value Return (PVR) Coding Guidelines
34
Frank Vahid, UC Riverside 34/64 Conversion to Explicit Control Flow (CECF) Problem: Function pointers may prevent static control flow analysis Guideline: Don’t use function pointers. Replace with if-else, static calls Makes possible targets explicit void f( int (*fp) (int) ) {..... for (i=0; i < 10; i++) { a[i] = fp(i); } enum Target { FUNC1, FUNC2, FUNC3 }; void f( enum Target fp ) {..... for (i=0; i < 10; i++) { if (fp == FUNC1) a[i] = f1(i); else if (fp == FUNC2) a[i] = f2(i); else a[i] = f3(i); } Synthesis unlikely to determine possible targets of function pointer ? a[i] Synthesized Hardware a[i] Synthesized Circuit f1(i)f2(i)f3(i) 3x1 fp
35
Frank Vahid, UC Riverside 35/64 Speedups from Synthesis-Friendly Coding Guidelines 10 guidelines For ~1,000 line benchmark: 5-6 changes typical, tens of minutes each Simple guidelines increased speedup to 6.5x
36
Frank Vahid, UC Riverside 36/64 Speedups from Synthesis-Friendly Coding Guidelines Original C code (Powerstone, Mediabench) Original average speedups with FPGA: 2.6x (excludes brev) Refined C code with guidelines Average speedup: 8.4x (excludes brev) Guidelines led to 3.5x improvement of speedup
37
Frank Vahid, UC Riverside 37/64 “Spatial” Algorithms for FPGAs Example – Count patterns Sequential algorithm Hash table 10s cycles per pattern int patterns[1,000]; int counts[1,000]; while (1) { WaitForPattern(); CurrPattern = X; hash = HashFct(CurrPattern); item = Find(patterns, CurrPattern, hash); if (item) { counts[item]++; } count Level 1 logic pattern logic Level 2 Level m logic CurrPattern count pattern count pattern...... bus Spatial algorithm Pipelined stages Essence is the connectivity of components, not the sequencing of instructions
38
Frank Vahid, UC Riverside 38/64 2 n Count 2 n patterns 4 Count 4 patterns 2 Count 2 patterns 1 Count Spatial Algorithms for FPGAs Spatial algorithm 2 Pipelined binary tree Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns...... Current pattern......
39
Frank Vahid, UC Riverside 39/64 Example Stage 1 Stage 2 Stage 3 Stage 4 73 48 Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns...... Current pattern...... Possible patterns pre-stored in binary search tree circuit
40
Frank Vahid, UC Riverside 40/64 Example Stage 1 Stage 2 Stage 3 Stage 4 48 23 Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns...... Current pattern...... 73
41
Frank Vahid, UC Riverside 41/64 Example Stage 1 Stage 2 Stage 3 Stage 4 23 75 Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns...... Current pattern...... 48 73
42
Frank Vahid, UC Riverside 42/64 Example Stage 1 Stage 2 Stage 3 Stage 4 75 11 Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns...... Current pattern...... 23 73 48 1
43
Frank Vahid, UC Riverside 43/64 Example Stage 1 Stage 2 Stage 3 Stage 4 11 Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns...... Current pattern...... 48 23 1 75 1 1
44
Frank Vahid, UC Riverside 44/64 Study of Spatial Algorithms in FCCM YearApplicationType 20013D Vec. NormalizationSpatial 2001Efficient CAM -- 2001Automated SensorTemporal 2001Regular ExpressionSpatial 2002Hyperspectral ImageSpatial 2002Machine VisionSpatial 2002RC4Temporal 2002Set CoveringSpatial 2002Template MatchingSpatial 2002Triangle MeshSpatial 2003Congruential SievesTemporal 2003Content ScanningTemporal 2003F.P and Square RootSpatial 2003Gaussian NoiseSpatial 2003TRNG-- 20043D FDTD MethodSpatial 2004Deep Packet Filter-- 2004Online Floating Point-- 2004Molecular DynamicsSpatial 2004Pattern MatchingSpatial 2004Seismic MigrationSpatial 2004Software Deceleration-- 2004 V.M Window-- 2005Data MiningSpatial 2005Cell AutomataTemporal 2005Particle GraphicsSpatial 2005RadiosityTemporal 2005Transient WavesSpatial 2005Road TrafficTemporal 2006All Pairs Shortest PathSpatial 2006Apriori Data MiningSpatial 2006Molecular DynamicsSpatial 2006Gaussian EliminationSpatial 2006Radiation DoseTemporal 2006Random VariatesSpatial FCCM 2001-2006 70 papers describing fast application on FPGA Examined 35 in depth (every other one) 6 used device-specific features 9 represented expected synthesized circuit from the obvious sequential algorithm 20 were spatially-oriented applications e.g., earlier pipelined binary tree
45
Frank Vahid, UC Riverside 45/64 Portable Spatial Applications? Current portable microprocessor binaries – sequential Extensions for threads, processes,... How support spatial constructs Ports, connections, timing model..... www.systemc.org Adds libraries and macros, still standard C++ Sequential and spatial constructs Compiling links in the simulation kernel Self-executing simulation Intended for SoC simulation
46
Frank Vahid, UC Riverside 46/64 Bytecode Modern portability approach Java, C# Pentium Atom Opteron bytecode Compiler VM Virtual Machine (VM): Program that executes bytecode May JIT compile to native architecture
47
Frank Vahid, UC Riverside 47/64 SystemC Bytecode? Pentium FPGA SystemC bytecode Compiler VM SystemC Opteron + FPGA VM
48
Frank Vahid, UC Riverside 48/64 UCR SystemC Bytecode and Compiler class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; SC_method(getPixel); sensitive << clock.pos(); void getPixel(){ … dataReady.write(1); } void mainComp(){ int i, j; for(i = 0; i < 3; i++){ for(j = 0; j < 3; j++){ sumX = sumX + mem.read()*GX[i][j] } … edge.write(sumX + sumY) } SystemC --header signal clock : 1 signal reset : 1 signal memory_in : 32 signal fb_data : 32 signal leds : 4 process(clock) READ $1 memory_in ADD $2 $0 3 ADD $3 $2 $1 WRITE $3 s1 ADDI $1 $0 1 WRITE $1 dataReady END process(dataReady) READ $5 val6 SW $5 24($0) READ $5 val7 … ADDI $10 $0 0 ADDI $7 $0 0 ADDI $13 $0 8 … END UCR’s SystemC bytecode UCR’s SystemC-to- bytecode compiler MIPS-like sequential instructions Spatial Constructs
49
Frank Vahid, UC Riverside 49/64 SystemC Bytecode for FPGAs Demo
50
Frank Vahid, UC Riverside 50/64 SystemC Bytecode Emulator Emulator Input Memory Output Memory UART Buttons LEDs Read Signal Memory Write Signal Memory Main Processor Instruction Memory USB Interface FPGA Bytecode uploadable via USB drive Accelerators speedup emulation SystemC bytecode
51
Frank Vahid, UC Riverside 51/64 SystemC Bytecode Accelerators Emulator Input Memory Output Memory UART Buttons LEDs Read Signal Memory Write Signal Memory Main Processor Instruction Memory USB Interface Accelerator 1 Accelerator 2 Accelerator 3 FPGA SystemC bytecode Implementation MIPS-like multicycle RISC datapath 100 MHz Clock ~33 Million Instr/Sec Communicates to core emulator memory mapped registers Area: ~5000 slices # of accelerators limited to # of masters allowed on bus ~1200 lines of VHDL Accelerator RISC Datapath Register File Local Mem Bus, start, load logic
52
Frank Vahid, UC Riverside 52/64 Dynamic SystemC Accelerator Management Emulator Input Memory Output Memory UART Buttons LEDs Read Signal Memory Write Signal Memory Main Processor Instruction Memory USB Interface Accelerator 1 Accelerator 2 Accelerator 3 FPGA SystemC bytecode Only a limited number of SystemC accelerators can fit on an FPGA fabric Dynamically map processes to accelerators based on process usage Involves online algorithms 42 44 43111210 Image Filter Example
53
Frank Vahid, UC Riverside 53/64 Just-in-Time Synthesis Emulator Input Memory Output Memory UART Buttons LEDs Read Signal Memory Write Signal Memory Main Processor Instruction Memory Accelerator 1 Accelerator 2 Accelerator 3 FPGA SystemC bytecode Possible to even perform synthesis on-chip – “warp processing” (previous UCR work) Send SystemC bytecode to synthesis server FPGA Specific Bitstream Dynamically reconfigure some or all of the FPGA
54
Frank Vahid, UC Riverside 54/64 2 n Count 2 n patterns 4 Count 4 patterns 2 Count 2 patterns 1 Count Spatial Algorithms for FPGAs Even better spatial algorithm for pattern counting Pipelined binary tree Level 1 logic Memory 1 pattern logic Memory 2 patterns logic Memory 4 patterns Level 2 Level 3 Level n logic Memory 2 n patterns...... CurrPattern......
55
Frank Vahid, UC Riverside 55/64 Study of Spatial Algorithms in FCCM (Sirowy FPGA’2008) YearApplicationType 20013D Vec. NormalizationSpatial 2001Efficient CAM -- 2001Automated SensorTemporal 2001Regular ExpressionSpatial 2002Hyperspectral ImageSpatial 2002Machine VisionSpatial 2002RC4Temporal 2002Set CoveringSpatial 2002Template MatchingSpatial 2002Triangle MeshSpatial 2003Congruential SievesTemporal 2003Content ScanningTemporal 2003F.P and Square RootSpatial 2003Gaussian NoiseSpatial 2003TRNG-- 20043D FDTD MethodSpatial 2004Deep Packet Filter-- 2004Online Floating Point-- 2004Molecular DynamicsSpatial 2004Pattern MatchingSpatial 2004Seismic MigrationSpatial 2004Software Deceleration-- 2004 V.M Window-- 2005Data MiningSpatial 2005Cell AutomataTemporal 2005Particle GraphicsSpatial 2005RadiosityTemporal 2005Transient WavesSpatial 2005Road TrafficTemporal 2006All Pairs Shortest PathSpatial 2006Apriori Data MiningSpatial 2006Molecular DynamicsSpatial 2006Gaussian EliminationSpatial 2006Radiation DoseTemporal 2006Random VariatesSpatial FCCM 2001-2006 70 papers describing fast application on FPGA Examined 35 in depth (every other one) 6 used device-specific features 9 represented expected synthesized circuit from the obvious sequential algorithm 20 were spatially-oriented applications akin to earlier pipelined binary tree
56
Frank Vahid, UC Riverside 56/64 Portable Spatial Applications? Current portable microprocessor binaries – sequential Extensions for threads, processes,... How support spatial constructs Ports, connections, timing model..... www.systemc.org Adds libraries and macros, still standard C++ Sequential and spatial constructs Compiling links in the simulation kernel Self-executing simulation Intended for SoC simulation
57
Frank Vahid, UC Riverside 57/64 Transmuting Coprocessors Demo
58
Frank Vahid, UC Riverside 58/64 FPGA is a Size-Limited Coprocessing Resource FPGA implements coprocessors Upload app profile info Select coproc. set, generate new FPGA bitstream Send back new bitstream, re- program FPGA Speedup with previous apps App executions change. Must decide which coprocessors should be FPGA-resident at a given time – transmuting coprocessors
59
Frank Vahid, UC Riverside 59/64 Transmuting Coprocessor Demo Three image filters: Blur filter (S/L): Blur the image Sobel filter (S/L): Find the edge of the image Emboss filter(S/L): Emboss the image Platform: Virtex 2P(XC2VP30): PPC + Coprocessors PPC Frequency: 100Mhz Coproc. Frequency: 50Mhz 30x120x Size(slice)SmallLarge Blur30120 Sobel228912 Emboss81324
60
Frank Vahid, UC Riverside 60/64 Demo architecture PPC Peripherals Instruction BRAM EDK Interface to external Display BRAM Image BRAM Coproc VGA control VGA display UARTPush button ISE Image (128*128 pixels and 24bit color): 24 BRAMs Soft version: Read (Image BRAM) Execution (PPC) Write (Display BRAM) Coprocessor version: Read (Image BRAM) Execution(Coproc) Write (Display BRAM) Dock: send the profile information through UART. PLB
61
Frank Vahid, UC Riverside 61/64 Coprocessor configurations Microprocessor only Small blur+ small sobel Small blur + small emboss Small sobel + small emboss Large blur Large sobel Large emboss Choose the configuration according to app profile info. PPCPeripherals Memory Virtex2P Coprocessor region Blur (S) Sobel(S) Blur (S) Emboss(s) Sobel(s) Emboss(s) Blur (L)Sobel (L) Emboss(L)
62
Frank Vahid, UC Riverside 62/64 Video demo program flow Execution Read profile info from UART Update profile information Dock Select new program file Reprogram FPGA Different objectives and different heuristics. Time information Dock + CP selection0.001s Start IMPACT + FPGA reprogramming 12s Filter PPC only (128 frames)30s Filter CP small (128 frames)1s Filter CP large (128 frames)0.25s
63
Frank Vahid, UC Riverside 63/64 µP Cache Dynamic Enables Expandable Logic Concept RAM Expandable RAM uP Performance Profiler µP Cache Warp Tools DMA FPGA RAM Expandable RAM – System detects RAM during start, improves performance invisibly Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable Logic
64
Frank Vahid, UC Riverside 64/64 Summary FPGAs entering mainstream Portability of applications is important Dynamic binary translation to FPGAs – Warp processing Shown feasible; Extensive future work Trends towards FPGA ubiquity Microprocessor binaries need extensions for spatial constructs One approach: SystemC bytecode and virtual machine Can also be warped for circuit-speed http://www.cs.ucr.edu/~vahid/pubs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.