Presentation is loading. Please wait.

Presentation is loading. Please wait.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis

Similar presentations


Presentation on theme: "Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis"— Presentation transcript:

1 Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
Greg Stitt Department of Electrical and Computer Engineering University of Florida test

2 Introduction Improved performance enables new applications
Past decade - Mp3 players, portable game consoles, cell phones, etc. Future architectures - Speech/image recognition, self-guiding cars, computation biology, etc. test

3 Introduction FPGAs (Field Programmable Gate Arrays) – Implement custom circuits 10x, 100x, even 1000x for scientific and embedded apps [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05], … But, FPGAs not mainstream Warp Processing Goal: Bring FPGAs into mainstream Make FPGAs “Invisible” FPGAs capable of large performance improvements Performance FPGA uP test

4 Introduction – Hardware/Software Partitioning
C Code for FIR Filter Processor FPGA * + Designer creates custom hardware using hardware description language (HDL) Hardware for loop Hardware/software partitioning selects performance critical regions for hardware implementation [Ernst, Henkel 93] [Gupta, DeMicheli 97] [Vahid, Gajski 94] [Eles et al. 97] [Sangiovanni-Vincentelli 94] for (i=0; i < 16; i++) y[i] += c[i] * x[i] .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. ~ 10 cycles Speedup = cycles/ 10 cycles = 100x Compiler Processor Processor ~1000 cycles test

5 Introduction – High-level Synthesis
Updated Binary High-level Code Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers Decompilation Hw/Sw Partitioning Compiler Decompilation High-level Synthesis Hardware Software Libraries/ Object Code Linker Bitstream uP FPGA test

6 Introduction – High-level Synthesis
Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers Updated Binary High-level Code Decompilation High-level Synthesis Hardware Software Libraries/ Object Code Linker Bitstream uP FPGA test

7 Introduction – High-level Synthesis
Problem: Describing circuit using HDL is time consuming/difficult Solution: High-level synthesis Create circuit from high-level code [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] Allows developers to use higher-level specification Potentially, enables synthesis for software developers for (i=0; i < 16; i++) y[i] += c[i] * x[i] Decompilation High-level Synthesis * + test

8 Outline Introduction Warp Processing Overview
Enabling Technology – Binary Synthesis Key techniques for synthesis from binaries Decompilation Current and Future Directions Multi-threaded Warp Processing Custom Communication test

9 Problems with High-Level Synthesis
Problem: High-level synthesis is unattractive to software developers Requires specialized language SystemC, NapaC, HandelC, … Requires specialized compiler Spark, ROCCC, CatapultC, … Limited commercial success Software developers reluctant to change tools Non-Standard Software Tool Flow Updated Binary Specialized Language Decompilation Specialized Compiler Updated Binary High-level Code Decompilation Synthesis Libraries/ Object Code Hardware Software Linker Bitstream uP FPGA test

10 Warp Processing – “Invisible” Synthesis
Decompilation Synthesis Compiler Updated Binary High-level Code Libraries/ Object Code Software Binary Hardware Software Move compilation before synthesis Standard Software Tool Flow Solution: Make synthesis “invisible” 2 Requirements Standard software tool flow Perform compilation before synthesis Hide synthesis tool Move synthesis on chip Similar to dynamic binary translation [Transmeta] But, translate to hw Libraries/ Object Code Updated Binary High-Level Code Decompilation Synthesis Bitstream uP FPGA Linker Hardware Software test

11 Warp Processing – “Invisible” Synthesis
Libraries/ Object Code Solution: Make synthesis “invisible” 2 Requirements Standard software tool flow Perform compilation before synthesis Hide synthesis tool Move synthesis on chip Similar to dynamic binary translation [Transmeta] But, translate to hw Updated Binary High-level Code Updated Binary High-Level Code Decompilation Compiler Decompilation Synthesis Updated Binary Software Binary Libraries/ Object Code Hardware Software Decompilation Synthesis Warp processor looks like standard uP but invisibly synthesizes hardware Linker Hardware Software Bitstream uP FPGA test

12 Warp Processing – “Invisible” Synthesis
Libraries/ Object Code Advantages Supports all languages,compilers, IDEs Supports synthesis of assembly code Support synthesis of library code Also, enables dynamic optimizations Updated Binary High-level Code Updated Binary C, C++, Java, Matlab Updated Binary High-Level Code Decompilation Compiler Decompilation gcc, g++, javac, keil Decompilation Synthesis Updated Binary Software Binary Libraries/ Object Code Hardware Software Decompilation Synthesis Warp processor looks like standard uP but invisibly synthesizes hardware Linker Hardware Software Bitstream uP FPGA test

13 Warp Processing Background: Basic Idea
Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA On-chip CAD test

14 Warp Processing Background: Basic Idea
Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 2 Microprocessor executes instructions in software binary µP Profiler I Mem µP D$ FPGA On-chip CAD test

15 Warp Processing Background: Basic Idea
Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 3 Profiler monitors instructions and detects critical regions in binary Critical Loop Detected Profiler Profiler I Mem µP µP beq add beq add beq add beq add beq add beq add beq add beq add beq add beq add D$ FPGA On-chip CAD test

16 Warp Processing Background: Basic Idea
Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 4 On-chip CAD reads in critical region Profiler Profiler I Mem µP µP D$ FPGA On-chip CAD On-chip CAD test

17 Warp Processing Background: Basic Idea
Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 5 On-chip CAD converts critical region into control data flow graph (CDFG) Profiler Profiler I Mem µP µP D$ FPGA loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 On-chip CAD Dynamic Part. Module (DPM) test

18 Warp Processing Background: Basic Idea
Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler Profiler I Mem µP µP D$ FPGA + . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 On-chip CAD Dynamic Part. Module (DPM) test

19 Warp Processing Background: Basic Idea
Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 7 On-chip CAD maps circuit onto FPGA Profiler Profiler I Mem µP µP D$ FPGA FPGA + . . . CLB SM loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 On-chip CAD Dynamic Part. Module (DPM) + + test

20 Warp Processing Background: Basic Idea
On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary 8 Software-only “Warped” Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 Profiler Profiler I Mem µP µP D$ FPGA FPGA FPGA + . . . CLB SM loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 On-chip CAD Dynamic Part. Module (DPM) + + test

21 Expandable Logic µP µP RAM RAM Profiler Cache Warp Tools Cache
Expandable RAM – System detects RAM during start, improves performance invisibly Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. RAM Profiler µP Cache Warp Tools DMA FPGA µP Cache Expandable Logic Expandable RAM uP Performance test

22 Expandable Logic Allows for customization of platforms
User can select FPGAs based on used applications Application Performance Portable Gaming Unacceptable Performance test

23 Expandable Logic Allows for customization of platforms
User can select FPGAs based on used applications Application Performance Portable Gaming User can customize FPGAs to the desired amount of performance Performance improvement is invisible – doesn’t require new binary from the developer test

24 Expandable Logic Allows for customization of platforms
User can select FPGAs based on used applications Application Performance No-FPGA Web Browser Acceptable Performance Platform designer doesn’t have to decide on fixed amount of FPGA. User doesn’t have to pay for FPGA that isn’t needed test

25 Warp Processing Background: Basic Technology
Challenge: CAD tools normally require powerful workstations Develop extremely efficient on-chip CAD tools Requires efficient synthesis Requires specialized FPGA, physical design tools (JIT FPGA compilation) [Lysecky FCCM05/DAC04], University of Arizona Binary Synthesis Logic Optimization uP I$ D$ FPGA Profiler On-chip CAD Technology Mapping JIT FPGA compilation Placement & Routing Binary HW Binary Updated Binary test

26 Warp Processing Background: On-Chip CAD
Synthesis RT Syn. Log. Opt. Tech. Map Place Route Manually performed 9.1 s 60 MB Xilinx ISE 3.6MB 0.2 s On-chip CAD On a 75Mhz ARM7: only 1.4 s 46x improvement 30% perf. penalty test

27 Warp Processing: Initial Results - Embedded Applications
Average speedup of 6.3x Achieved completely transparently Also, energy savings of 66% test

28 Outline Introduction Warp Processing Overview
Enabling Technology – Binary Synthesis Key techniques for synthesis from binaries Decompilation Current and Future Directions Multi-threaded Warp Processing Custom Communication test

29 Binary Synthesis for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. Warp processors perform synthesis from software binary – “binary synthesis” Problem: No high-level information Synthesis needs high-level constructs > 10x slowdown Can we recover high-level information for synthesis? Make binary synthesis (and Warp processing) competitive with high-level synthesis Compiler Addi r1, r0, 0 Ld r3, 256(r1) Ld r4, 512(r1) Subi r2, r1, 128 Jnz r2, -5 No high-level constructs – arrays, loops, etc. Binary Synthesis Processor FPGA Hardware can be > 10x to 100x test

30 Decompilation We realized decompilation recovers high-level information But, generally used for binary translation or source-code recovery May not be suitable for synthesis We studied existing approaches [Cifuentes 94, 99, 01][Mycroft 99,01] DisC, dcc, Boomerang, Mocha, SourceAgain Determined relevant techniques Adapted existing techniques for synthesis test

31 Decompilation – Control/Data Flow Graph Recovery
Recovery of control/data flow graph (CDFG) Format used by synthesis Difficult because of indirect jumps Cannot statically analyze control flow But, heuristics are over 99% successful on standard benchmarks [Cifuentes 99, 00] Corresponding Assembly Control/Data Flow Graph Creation Original C Code reg3 := 0 reg4 := 0 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 test

32 Decompilation – Data Flow Analysis
Original purpose - remove temporary registers Area overhead – 130% Need new techniques for binary synthesis Corresponding Assembly Data Flow Analysis Original C Code reg3 := 0 reg4 := 0 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 test

33 Decompilation – Data Flow Analysis
Strength Reduction – Compare-with-zero instructions Operator Size Reduction reg4 reg5 Sub reg3 = Branch? Sub reg3, reg4, reg5 Bz reg3, -5 reg4 = reg5 Branch? Optimized DFG Not needed, wastes area 8-bit reg4 5-bit reg5 Optimized DFG Load Byte 16 Lb reg4, 0(reg1) Mvi reg5, 16 Add reg3, reg4, reg5 32-bit reg4 32-bit reg5 Only 8-bit adder needed 32-bit + 8-bit + 32-bit reg3 8-bit reg3 Area Overhead Reduced to 10% test

34 Decompilation – Function Recovery
Recover parameters and return values Def-use analysis of prologue/epilogue 100% success rate Corresponding Assembly Function Recovery Original C Code long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 test

35 Decompilation – Control Structure Recovery
Recover loops, if statements Uses interval analysis techniques [Cifuentes 94] 100% success rate Corresponding Assembly Control Structure Recovery Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; test

36 Decompilation – Array Recovery
Detect linear memory patterns and row-major ordering calculations ~ 95% success rate [Stitt, Guo, Najjar, Vahid 05] [Cifuentes 00] Corresponding Assembly Array Recovery Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; test

37 Comparison of Decompiled Code and Original Code
Decompiled code almost identical to original code Only difference is variable names Binary synthesis is competitive with high-level synthesis Original C Code Decompiled Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; Almost Identical Representations test

38 Binary Synthesis Tool Flow
Initially, high-level source is compiled and linked to form a binary Binary Updated Binary High-level Source Recovers high-level information needed for synthesis Decompilation Decompilation Compiler Libraries/ Object Code Libraries/ Object Code Hw/Sw Estimation Hw/Sw Partitioning Profiling Binary Hardware Software Binary Synthesis Modifies binary to use synthesized hardware Profiling Synthesis Binary Updater Bitstream Updated Binary Hardware Netlists uP FPGA Bitstream ~30,000 lines of C code test

39 Binary Synthesis is Competitive with High-Level Synthesis
Small difference in speedup Binary synthesis competitive with high-level synthesis Binary speedup: 8x, High-level speedup: 8.2x High-level synthesis only 2.5% better Commercial products beginning to appear Critical Blue, Binachip test

40 Binary Synthesis with Software Compiler Optimizations
But, binaries generated with few optimizations Optimizations for software may hurt hardware Need new decompilation techniques C code Hardware synthesized from optimized binary may be inefficient SW Compiler Binary is optimized for software Optimized Binary Binary Synthesis uP FPGA test

41 Loop Rerolling Problem: Loop unrolling may cause inefficient hardware
Non-unrolled Loop Unrolled Loop Problem: Loop unrolling may cause inefficient hardware Longer synthesis times Super-linear heuristics Unrolling 100 times => synthesis time is 1002 times longer Larger area requirements Unrolling by compiler unlikely to match unrolling by synthesis Loop structure needed for advanced synthesis techniques Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Synthesis Execution Times Solution: We introduce loop rerolling to undo loop unrolling test

42 Loop Rerolling – Identifying Unrolled Loops
Idea - Identify consecutively repeating instruction sequences BABCABCD String Representation Original C Code x= x + 1; a[0] = b[0]+1; a[1] = b[1]+1; y = x; Unrolled Loop Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) St a(1), r1 Mov r4, r3 Binary Add r3, r3, 1 => B Ld r0, b(0) => A Add r1, r0, 1 => B St a(0), r1 => C Ld r0, b(1) => A St a(1), r1 => C Mov r4, r3 => D Map to String x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; abc c d b abcabcd abcd Suffix Tree [Ukkonen 95] Unrolled Loop 2 unrolled iterations Each iteration = abc (Ld, Add, St) Find Consecutive Repeating Substrings: Adjacent Nodes with Same Substring test

43 Loop Rerolling 1) 2) 3) Average Speedup of 1.6x
Unrolled Loop Identificiation Original C Code x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) St a(1), r1 Mov r4, r3 Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) St a(1), r1 Mov r4, r3 Determine relationship of constants 1) Add r3, r3, 1 i=0 loop: Ld r0, b(i) Add r1, r0, 1 St a(i), r1 Bne i, 2, loop Mov r4, r3 Replace constants with induction variable expression 2) reg3 = reg3 + 1; for (i=0; i < 2; i++) array1[i]=array2[i]+1; reg4=reg3; Rerolled, decompiled code 3) Average Speedup of 1.6x test

44 Strength Promotion Problem: Strength reduction may cause inefficient hardware + << B[i+1] 4 1 B[i+2] 5 B[i+3] 6 A[i] B[i] 10 * + << B[i+2] 5 1 B[i+3] 6 A[i] B[i] 10 * 18 + << B[i+3] 6 1 A[i] B[i] 10 * 18 34 + A[i] B[i] 10 * 18 34 66 Identify strength-reduced subgraphs Replace with multiplication + << B[i+1] 4 1 B[i] 3 B[i+2] 5 B[i+3] 6 A[i] However, some of the strength reduction was beneficial 1 + B[i+1] 18 B[i] 10 << B[i+2] 5 B[i+3] 6 A[i] * Synthesis reapplies strength reduction to get optimal DFG Strength promotion lets synthesis decide on strength reduction, not software compiler Average Speedup of 1.5 test

45 Multiple ISA/Optimization Results
What about aggressive software compiler optimizations? May obscure binary, making decompilation impossible What about different instructions sets? Side effects may degrade hardware performance Speedups similar between ARM and MIPS Complex instructions of ARM didn’t hurt synthesis Speedups similar on ARM for –O1 and –O3 optimizations MicroBlaze speedups much larger MicroBlaze is a slower microprocessor -O3 optimizations were very beneficial to hardware Speedups similar on MIPS for –O1 and –O3 optimizations Speedup test

46 High-level vs. Binary Synthesis: Proprietary H.264 Decoder
High-level synthesis vs. binary synthesis Collaboration with Freescale Semiconductor H.264 Decoder MPEG-4 Part 10 Advanced Video Coding (AVC) 3x smaller than MPEG-2 Better quality MPEG2 H.264 test

47 High-level vs. Binary Synthesis: Proprietary H.264 Decoder
Binary synthesis competitive with high- level synthesis Binary synthesis was competitive with high-level synthesis High-level speedup – 6.56x Binary speedup – 6.55x test

48 Outline Introduction Warp Processing Overview
Enabling Technology – Binary Synthesis Key techniques for synthesis from binaries Decompilation Current and Future Directions Multi-Threaded Warp Processing Custom Communication test

49 Thread Warping - Overview
Architectural Trend – Include more cores on chip Result – More multi-threaded applications Profiler Warp FPGA b( ) Warp FPGA b( ) OS schedules 4 threads to custom accelerators for (i=0; i < 10; i++) createThread( b ); Function a( ) µP µP b( ) a( ) OS can only schedule 2 threads Warp tools create custom accelerators for b( ) µP µP Warp Tools b( ) Warp Tools OS OS Thread Queue b( ) Remaining 8 threads placed in thread queue 3x more thread parallelism test

50 Thread Warping - Overview
Profiler Profiler detects performance critical loop in b( ) Profiler Warp FPGA Warp FPGA b( ) Warp tools create larger/faster accelerators b( ) for (i=0; i < 10; i++) createThread( b ); Function a( ) µP µP b( ) a( ) µP µP Warp Tools Warp Tools b( ) OS Potentially > 100x speedup test

51 Thread Warping - Results
Thread warping 120x faster than 4-uP (ARM) system Comparison of thread warping (TW) and multi-core Simulated multi-cores ranging from 4 to 64 Thread warping – 4 cores + FPGA test

52 Warp Processing – Custom Communication
NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar] Problem: Best topology is application dependent App1 Performance µP µP Bus Mesh µP µP App2 Performance Bus Mesh test

53 Warp Processing – Custom Communication
NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar] Problem: Best topology is application dependent App1 FPGA µP FPGA FPGA µP Performance µP Bus Mesh App2 Performance Bus Mesh Warp processing can dynamically choose topology – 2x to 100x improvement Collaboration with Rakesh Kumar University of Illinois, Urbana-Champaign “Amoebic Computing” test

54 Summary Updated Binary Any Language Any Compiler Decompilation
Standard Binary Decompilation Any Compiler Developer is unaware of FPGA/synthesis Binary HW Binary Synthesis JIT FPGA Compilation Updated Binary Decompilation makes possible uP I$ D$ FPGA Profiler On-chip CAD FPGA Expandable Logic Warp Processing uP Performance Warp processing invisibly achieves > 100x speedups test

55 References Supported by NSF, SRC, Intel, IBM, Xilinx test Patent
Warp Processor for Dynamic Hardware/Software Partitioning. F. Vahid, R. Lysecky, G. Stitt. Patent Pending, 2004 Hardware/Software Partitioning of Software Binaries G. Stitt and F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2002, pp Warp Processors R. Lysecky, G. Stitt, and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2006, Volume 11, Number 3, pp Binary Synthesis G. Stitt and F. Vahid Accepted for publication in ACM Transactions on Design Automation of Electronic Systems (TODAES) Expandable Logic G. Stitt, F. Vahid Submitted to IEEE/ACM Conference on Design Automation (DAC), 2007. New Decompilation Techniques for Binary-level Co-processor Generation G. Stitt, F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2005, pp Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode G.Stitt, F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid IEEE/ACM Design Automation and Test in Europe (DATE), 2005, pp Dynamic Hardware/Software Partitioning: A First Approach G. Stitt, R. Lysecky and F. Vahid IEEE/ACM Conference on Design Automation (DAC), 2003, pp Supported by NSF, SRC, Intel, IBM, Xilinx test


Download ppt "Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis"

Similar presentations


Ads by Google