Presentation is loading. Please wait.

Presentation is loading. Please wait.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Similar presentations


Presentation on theme: "WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010."— Presentation transcript:

1 WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010

2 O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 2

3 O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 3

4 Implementing Critical Region on FPGA Special Compiler – compiling time HW/SW partitioning Large Speedups and power savings – 10-100X overall speedups Traditional Architecture FPGA Proc. Application

5 Bit-level Operations High Speedup x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Circuit for Bit Reversal Bit Reversed X Value........... Original X Value Requires only 1 cycle (speedup of 32x to 128x) for same clock Requires between 32 and 128 cycles

6 Parallelizable code High Speedup for (i=0; i < 128; i++) y += c[i] * x[i].. ************ ++++++ + ++ ++ + C Code for FIR Filter 1000’s of instructions Several thousand cycles Circuit for FIR Filter ~ 7 cycles Speedup > 100x for same clock

7 O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 7

8 Warp Processing Concepts Dynamic HW/SW Partitioning Transparent Optimizations A Warp Processor dynamically detects the binary’s critical region, re-implements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region.

9 Warp Architecture µP I$ D$ Warp FPGA Profiler On-chip CAD Module 1. Initially execute application in software only 2. Profile application to determine critical regions 3.Implement critical region in Hardware Configuration 4.Program Configurable Logic and update software binary 5. Partitioned application executes faster and with lower energy consumption.

10 Warp Architecture µP I$ D$ Warp FPGA Profiler On-chip CAD Module Execution Model Main processor and W-FPGA working in an exclusive mode. Benefits No cache coherence and consistency problem

11 O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 11

12 Warp-Oriented FPGA Configurable Logic Fabric CLB (Configurable Logic Block) SM (Switch Matrix) One CLB connected to one SM Short channel Long channel

13 Warp-Oriented FPGA CLB 2 3-input 2-output LUT Carry chain routing

14 Warp-Oriented FPGA Switch Matrix Single channel routing Long and short, long and long, short and short

15 Overall W-FPGA Architecture CLF Data Address Generator & Loop Control Hardware Registers 32-bit MAC (Multiply and Accumulate) Add custom circuits to aid the computing in FPGA.

16 O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 16

17 On-Chip CAD Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Using Profiler to locate the critical region in application Binary -> Assembly -> control and data flow graph Control and Data flow graph -> net-list of basic logic functions, like Nor, Nand, Xor …

18 On-Chip CAD Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

19 On-Chip CAD Lean algorithm 34720 lines of C code, 327 KB of instruction cache General algorithm – hundreds of thousands of lines Fast execution 1.2s on 40 MHz ARM 7 microprocessor General algorithm – minutes to hours Small cache required 3.6 MB of data cache General algorithm – exceeding 50 MB Alternative implementation Software task on Main Processor µP I$ D$ Warp FPGA Profiler On-chip CAD Module I$ D$

20 O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 20

21 Experimental Results Comparing Warp Processor Implementation and Traditional HW/SW Partition Implementation (Special Complier, and Xilinx FPGA) on embedded benchmarks. Speedups for the same critical region chosen by Profiler in Warp Processor Energy Consumption for the same critical region chosen by Profiler in Warp Processor

22 Experimental Results Better Speedup may be the results of custom circuits included in W-FPGA Architecture

23 Experimental Results

24 Warp Processor Achieved comparable performance and energy consumption Without special complier Totally transparent. Separate high-level code and FPGA architecture Applicable to any software programs

25 Experimental Results Achieve the lowest energy consumption compared to various processors’ implementation

26 Conclusion Feasibility Improved lean algorithm for partitioning, synthesis, placement and routing Improved Warp FPGA architecture

27 Questions ?


Download ppt "WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010."

Similar presentations


Ads by Google