Presentation is loading. Please wait.

Presentation is loading. Please wait.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

Similar presentations


Presentation on theme: "Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida."— Presentation transcript:

1 Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida

2 2/55 Introduction Improved performance enables new applications Past decade - Mp3 players, portable game consoles, cell phones, etc. Future architectures - Speech/image recognition, self-guiding cars, computation biology, etc.

3 3/55 Introduction FPGAs (Field Programmable Gate Arrays) – Implement custom circuits 10x, 100x, even 1000x for scientific and embedded apps [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05], … But, FPGAs not mainstream Warp Processing Goal: Bring FPGAs into mainstream Make FPGAs “Invisible” uP FPGA Performance FPGAs capable of large performance improvements

4 4/55 Introduction – Hardware/Software Partitioning for (i=0; i < 128; i++) y[i] += c[i] * x[i].. for (i=0; i < 16; i++) y[i] += c[i] * x[i].. C Code for FIR Filter Processor ~1000 cycles Compiler Hardware/software partitioning selects performance critical regions for hardware implementation [Ernst, Henkel 93] [Gupta, DeMicheli 97] [Vahid, Gajski 94] [Eles et al. 97] [Sangiovanni-Vincentelli 94] Processor FPGA ************ ++++++ + ++ ++ +....... Designer creates custom hardware using hardware description language (HDL) Hardware for loop ~ 10 cycles Speedup = 1000 cycles/ 10 cycles = 100x

5 5/55 Introduction – High-level Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-level Code Decompilatio n High-level Synthesis Bitstream uPFPGA Linker Hardware Software Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers Decompilation Hw/Sw Partitioning Compiler

6 6/55 Introduction – High-level Synthesis Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers Libraries/ Object Code Libraries/ Object Code Updated Binary High-level Code Bitstream uPFPGA Linker Hardware Software Decompilation High-level Synthesis

7 7/55 Introduction – High-level Synthesis Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers for (i=0; i < 16; i++) y[i] += c[i] * x[i] ************ ++++++ + ++ ++ +....... Decompilation High-level Synthesis

8 8/55 Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis Key techniques for synthesis from binaries Decompilation Current and Future Directions Multi-threaded Warp Processing Custom Communication

9 9/55 Problems with High-Level Synthesis Problem: High-level synthesis is unattractive to software developers Requires specialized language SystemC, NapaC, HandelC, … Requires specialized compiler Spark, ROCCC, CatapultC, … Limited commercial success Software developers reluctant to change tools Libraries/ Object Code Libraries/ Object Code Updated Binary High-level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Non- Standard Software Tool Flow Updated Binary Specialized Language Decompilation Specialized Compiler

10 10/55 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Solution: Make synthesis “invisible” 2 Requirements Standard software tool flow Perform compilation before synthesis Hide synthesis tool Move synthesis on chip Similar to dynamic binary translation [Transmeta] But, translate to hw Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Move compilation before synthesis Standard Software Tool Flow

11 11/55 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Solution: Make synthesis “invisible” 2 Requirements Standard software tool flow Perform compilation before synthesis Hide synthesis tool Move synthesis on chip Similar to dynamic binary translation [Transmeta] But, translate to hw Warp processor looks like standard uP but invisibly synthesizes hardware

12 12/55 Warp Processing – “Invisible” Synthesis Libraries/ Object Code Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uPFPGA Linker Hardware Software Decompilation Synthesis Decompilation Compiler Updated Binary High-level Code Libraries/ Object Code Libraries/ Object Code Updated Binary Software Binary Hardware Software Advantages Supports all languages,compilers, IDEs Supports synthesis of assembly code Support synthesis of library code Also, enables dynamic optimizations Updated Binary C, C++, Java, Matlab Decompilation gcc, g++, javac, keil Warp processor looks like standard uP but invisibly synthesizes hardware

13 13/55 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

14 14/55 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µP

15 15/55 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

16 16/55 µP FPGA On-chip CAD Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

17 17/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD converts critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0

18 18/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +...

19 19/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA

20 20/55 µP FPGA Dynamic Part. Module (DPM) Warp Processing Background: Basic Idea Profiler µP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped”

21 21/55 µP Cache Expandable Logic RAM Expandable RAM uP Performance Profiler µP Cache Warp Tools DMA FPGA RAM Expandable RAM – System detects RAM during start, improves performance invisibly Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable Logic

22 22/55 Expandable Logic Allows for customization of platforms User can select FPGAs based on used applications Application Portable Gaming Performance Unacceptable Performance

23 23/55 Expandable Logic Allows for customization of platforms User can select FPGAs based on used applications Application Portable Gaming Performance.. User can customize FPGAs to the desired amount of performance Performance improvement is invisible – doesn’t require new binary from the developer

24 24/55 Expandable Logic Allows for customization of platforms User can select FPGAs based on used applications Application Web Browser Performance Acceptable Performance No-FPGA Platform designer doesn’t have to decide on fixed amount of FPGA. User doesn’t have to pay for FPGA that isn’t needed

25 25/55 uP I$ D$ FPGA Profiler On-chip CAD Warp Processing Background: Basic Technology Challenge: CAD tools normally require powerful workstations Develop extremely efficient on-chip CAD tools Requires efficient synthesis Requires specialized FPGA, physical design tools (JIT FPGA compilation) [Lysecky FCCM05/DAC04], University of Arizona Binary HW Synthesis Technology Mapping Placement & Routing Logic Optimization Binary Updated Binary JIT FPGA compilation

26 26/55 Warp Processing Background: On-Chip CAD 60 MB 9.1 s Xilinx ISE Manually performed 3.6MB 0.2 s On-chip CAD On a 75Mhz ARM7: only 1.4 s 46x improvement 30% perf. penalty Log. Opt. Tech. Map Place Route RT Syn. Synthesis

27 27/55 Warp Processing: Initial Results - Embedded Applications Average speedup of 6.3x Achieved completely transparently Also, energy savings of 66%

28 28/55 Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis Key techniques for synthesis from binaries Decompilation Current and Future Directions Multi-threaded Warp Processing Custom Communication

29 29/55 Binary Synthesis Warp processors perform synthesis from software binary – “binary synthesis” Problem: No high-level information Synthesis needs high-level constructs > 10x slowdown Can we recover high-level information for synthesis? Make binary synthesis (and Warp processing) competitive with high- level synthesis for (i=0; i < 128; i++) y[i] += c[i] * x[i].. for (i=0; i < 128; i++) y[i] += c[i] * x[i].. Compiler Addi r1, r0, 0 Ld r3, 256(r1) Ld r4, 512(r1) Subi r2, r1, 128 Jnz r2, -5 No high-level constructs – arrays, loops, etc. Binary Synthesis Processor FPGA Hardware can be > 10x to 100x

30 30/55 Decompilation We realized decompilation recovers high-level information But, generally used for binary translation or source- code recovery May not be suitable for synthesis We studied existing approaches [Cifuentes 94, 99, 01][Mycroft 99,01] DisC, dcc, Boomerang, Mocha, SourceAgain Determined relevant techniques Adapted existing techniques for synthesis

31 31/55 Decompilation – Control/Data Flow Graph Recovery Recovery of control/data flow graph (CDFG) Format used by synthesis Difficult because of indirect jumps Cannot statically analyze control flow But, heuristics are over 99% successful on standard benchmarks [Cifuentes 99, 00] Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly

32 32/55 Decompilation – Data Flow Analysis Original purpose - remove temporary registers Area overhead – 130% Need new techniques for binary synthesis Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis

33 33/55 Decompilation – Data Flow Analysis Strength Reduction – Compare-with-zero instructions Operator Size Reduction Sub reg3, reg4, reg5 Bz reg3, -5 reg4reg5 Sub reg3 = 0 Branch? Not needed, wastes area 32-bit reg4 32-bit + 32-bit reg5 32-bit reg3 Lb reg4, 0(reg1) Mvi reg5, 16 Add reg3, reg4, reg5 8-bit + 8-bit reg3 Only 8-bit adder needed reg4 = reg5 Branch? Optimized DFG Area Overhead Reduced to 10% 8-bit reg45-bit reg5 Optimized DFG Load Byte 16

34 34/55 Decompilation – Function Recovery Recover parameters and return values Def-use analysis of prologue/epilogue 100% success rate Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Original C Code Corresponding Assembly long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery

35 35/55 Decompilation – Control Structure Recovery Recover loops, if statements Uses interval analysis techniques [Cifuentes 94] 100% success rate long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Original C Code Corresponding Assembly

36 36/55 Decompilation – Array Recovery Detect linear memory patterns and row-major ordering calculations ~ 95% success rate [Stitt, Guo, Najjar, Vahid 05] [Cifuentes 00] Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Original C Code Corresponding Assembly long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery

37 37/55 Comparison of Decompiled Code and Original Code Decompiled code almost identical to original code Only difference is variable names Binary synthesis is competitive with high-level synthesis long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Original C Code long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Decompiled Code Almost Identical Representations

38 38/55 Libraries/ Object Code Binary Synthesis Tool Flow Binary Synthesis Binary Decompilation Hardware Software Libraries/ Object Code Hardware Netlists Bitstream Profiling Synthesis Profiling Binary Updater Hw/Sw Estimation Hw/Sw Partitioning Profiling Updated Binary High-level Source Decompilation Compiler Binary Bitstream uPFPGA Updated Binary Initially, high-level source is compiled and linked to form a binary Recovers high- level information needed for synthesis Modifies binary to use synthesized hardware ~30,000 lines of C code

39 39/55 Binary Synthesis is Competitive with High-Level Synthesis Binary synthesis competitive with high-level synthesis Binary speedup: 8x, High-level speedup: 8.2x High-level synthesis only 2.5% better Commercial products beginning to appear Critical Blue, Binachip Small difference in speedup

40 40/55 Binary Synthesis with Software Compiler Optimizations But, binaries generated with few optimizations Optimizations for software may hurt hardware Need new decompilation techniques C code SW Compiler Optimized Binary uPFPGA Binary Synthesis Binary is optimized for software Hardware synthesized from optimized binary may be inefficient

41 41/55 Loop Rerolling Solution: We introduce loop rerolling to undo loop unrolling Problem: Loop unrolling may cause inefficient hardware Longer synthesis times Super-linear heuristics Unrolling 100 times => synthesis time is 100 2 times longer Larger area requirements Unrolling by compiler unlikely to match unrolling by synthesis Loop structure needed for advanced synthesis techniques Synthesis Execution Times Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Non-unrolled Loop Unrolled Loop

42 42/55 Loop Rerolling – Identifying Unrolled Loops x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; Original C Code Find Consecutive Repeating Substrings: Adjacent Nodes with Same Substring Unrolled Loop 2 unrolled iterations Each iteration = abc (Ld, Add, St) Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 Binary x= x + 1; a[0] = b[0]+1; a[1] = b[1]+1; y = x; Unrolled Loop Add r3, r3, 1 => B Ld r0, b(0) => A Add r1, r0, 1 => B St a(0), r1 => C Ld r0, b(1) => A Add r1, r0, 1 => B St a(1), r1 => C Mov r4, r3 => D Map to String BABCABCD String Representation Idea - Identify consecutively repeating instruction sequences abc c d b abcabcd c abcd d d d Suffix Tree [Ukkonen 95]

43 43/55 Loop Rerolling Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 Original C Code Unrolled Loop Identificiation Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 Determine relationship of constants 1) Add r3, r3, 1 i=0 loop: Ld r0, b(i) Add r1, r0, 1 St a(i), r1 Bne i, 2, loop Mov r4, r3 Replace constants with induction variable expression 2) reg3 = reg3 + 1; for (i=0; i < 2; i++) array1[i]=array2[i]+1; reg4=reg3; Rerolled, decompiled code 3) x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; Average Speedup of 1.6x

44 44/55 Strength Promotion + + + << B[i+1] 4 1 + << B[i]3 1 + << B[i+2] 5 1 + << B[i+3] 6 1 + A[i] However, some of the strength reduction was beneficial Strength promotion lets synthesis decide on strength reduction, not software compiler Average Speedup of 1.5 Identify strength- reduced subgraphs + + + << B[i+1] 4 1 + << B[i+2] 5 1 + << B[i+3] 6 1 + A[i] B[i]10 * Replace with multiplication + + + << B[i+2] 5 1 + << B[i+3] 6 1 + A[i] B[i]10 * B[i]18 * + + + << B[i+3] 6 1 + A[i] B[i]10 * B[i]18 * B[i]34 * + + + A[i] B[i]10 * B[i]18 * B[i]34 * B[i]66 * 1 + + B[i+1]18 B[i]10 + << B[i+2] 5 1 + << B[i+3] 6 + A[i] * * Synthesis reapplies strength reduction to get optimal DFG Problem: Strength reduction may cause inefficient hardware

45 45/55 Multiple ISA/Optimization Results What about aggressive software compiler optimizations? May obscure binary, making decompilation impossible What about different instructions sets? Side effects may degrade hardware performance Speedups similar on MIPS for –O1 and –O3 optimizations Speedups similar on ARM for –O1 and –O3 optimizations Speedups similar between ARM and MIPS Complex instructions of ARM didn’t hurt synthesis MicroBlaze speedups much larger MicroBlaze is a slower microprocessor -O3 optimizations were very beneficial to hardware Speedup

46 46/55 High-level vs. Binary Synthesis : Proprietary H.264 Decoder MPEG2 H.264 High-level synthesis vs. binary synthesis Collaboration with Freescale Semiconductor H.264 Decoder MPEG-4 Part 10 Advanced Video Coding (AVC) 3x smaller than MPEG-2 Better quality

47 47/55 High-level vs. Binary Synthesis : Proprietary H.264 Decoder Binary synthesis was competitive with high- level synthesis High-level speedup – 6.56x Binary speedup – 6.55x Binary synthesis competitive with high- level synthesis

48 48/55 Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis Key techniques for synthesis from binaries Decompilation Current and Future Directions Multi-Threaded Warp Processing Custom Communication

49 49/55 Thread Warping - Overview Profiler µP Warp Tools Warp FPGA µP OS a( ) b( ) for (i=0; i < 10; i++) createThread( b ); Function a( ) OS Thread Queue b( ) Warp Tools b( ) Warp FPGA b( ) OS can only schedule 2 threads Remaining 8 threads placed in thread queue Warp tools create custom accelerators for b( ) OS schedules 4 threads to custom accelerators 3x more thread parallelism Architectural Trend – Include more cores on chip Result – More multi-threaded applications

50 50/55 Thread Warping - Overview Profiler µP Warp Tools Warp FPGA µP OS a( ) b( ) for (i=0; i < 10; i++) createThread( b ); Function a( ) Warp Tools b( ) Profiler Profiler detects performance critical loop in b( ) Warp FPGA b( ) Warp tools create larger/faster accelerators b( ) Potentially > 100x speedup

51 51/55 Thread Warping - Results Thread warping 120x faster than 4-uP (ARM) system Comparison of thread warping (TW) and multi-core Simulated multi-cores ranging from 4 to 64 Thread warping – 4 cores + FPGA

52 52/55 Warp Processing – Custom Communication µP Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2 NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar] Performance

53 53/55 Warp Processing – Custom Communication FPGA NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar] Problem: Best topology is application dependent Bus Mesh Bus Mesh App1 App2 µP Warp processing can dynamically choose topology – 2x to 100x improvement FPGA µP FPGA µP Collaboration with Rakesh Kumar University of Illinois, Urbana-Champaign “Amoebic Computing” Performance

54 54/55 Summary uP I$ D$ FPGA Profiler On-chip CAD Updated Binary Any Language Updated Binary Standard Binary Decompilation Any Compiler Developer is unaware of FPGA/synthesis Binary HW Binary Synthesis JIT FPGA Compilation Binary Updated Binary Decompilation makes possible FPGA Expandable Logic Warp Processing uP Performance Warp processing invisibly achieves > 100x speedups

55 55/55 References Patent Warp Processor for Dynamic Hardware/Software Partitioning. F. Vahid, R. Lysecky, G. Stitt. Patent Pending, 2004 1. Hardware/Software Partitioning of Software Binaries G. Stitt and F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2002, pp. 164- 170. 2. Warp Processors R. Lysecky, G. Stitt, and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2006, Volume 11, Number 3, pp. 659-681. 3. Binary Synthesis G. Stitt and F. Vahid Accepted for publication in ACM Transactions on Design Automation of Electronic Systems (TODAES) 4. Expandable Logic G. Stitt, F. Vahid Submitted to IEEE/ACM Conference on Design Automation (DAC), 2007. 5. New Decompilation Techniques for Binary-level Co-processor Generation G. Stitt, F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2005, pp. 547-554. 6. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode G.Stitt, F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp. 285- 290. 7. A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid IEEE/ACM Design Automation and Test in Europe (DATE), 2005, pp.396-397. 8. Dynamic Hardware/Software Partitioning: A First Approach G. Stitt, R. Lysecky and F. Vahid IEEE/ACM Conference on Design Automation (DAC), 2003, pp. 250-255. Supported by NSF, SRC, Intel, IBM, Xilinx


Download ppt "Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida."

Similar presentations


Ads by Google