WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010

O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 2

Implementing Critical Region on FPGA Special Compiler – compiling time HW/SW partitioning Large Speedups and power savings – 10-100X overall speedups Traditional Architecture FPGA Proc. Application

Bit-level Operations High Speedup x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Circuit for Bit Reversal Bit Reversed X Value........... Original X Value Requires only 1 cycle (speedup of 32x to 128x) for same clock Requires between 32 and 128 cycles

Parallelizable code High Speedup for (i=0; i < 128; i++) y += c[i] * x[i].. ************ ++++++ + ++ ++ + C Code for FIR Filter 1000’s of instructions Several thousand cycles Circuit for FIR Filter ~ 7 cycles Speedup > 100x for same clock

Warp Processing Concepts Dynamic HW/SW Partitioning Transparent Optimizations A Warp Processor dynamically detects the binary’s critical region, re-implements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region.

Warp Architecture µP I$ D$ Warp FPGA Profiler On-chip CAD Module 1. Initially execute application in software only 2. Profile application to determine critical regions 3.Implement critical region in Hardware Configuration 4.Program Configurable Logic and update software binary 5. Partitioned application executes faster and with lower energy consumption.

Warp Architecture µP I$ D$ Warp FPGA Profiler On-chip CAD Module Execution Model Main processor and W-FPGA working in an exclusive mode. Benefits No cache coherence and consistency problem

Warp-Oriented FPGA Configurable Logic Fabric CLB (Configurable Logic Block) SM (Switch Matrix) One CLB connected to one SM Short channel Long channel

Warp-Oriented FPGA CLB 2 3-input 2-output LUT Carry chain routing

Warp-Oriented FPGA Switch Matrix Single channel routing Long and short, long and long, short and short

Overall W-FPGA Architecture CLF Data Address Generator & Loop Control Hardware Registers 32-bit MAC (Multiply and Accumulate) Add custom circuits to aid the computing in FPGA.

On-Chip CAD Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Using Profiler to locate the critical region in application Binary -> Assembly -> control and data flow graph Control and Data flow graph -> net-list of basic logic functions, like Nor, Nand, Xor …

On-Chip CAD Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

On-Chip CAD Lean algorithm 34720 lines of C code, 327 KB of instruction cache General algorithm – hundreds of thousands of lines Fast execution 1.2s on 40 MHz ARM 7 microprocessor General algorithm – minutes to hours Small cache required 3.6 MB of data cache General algorithm – exceeding 50 MB Alternative implementation Software task on Main Processor µP I$ D$ Warp FPGA Profiler On-chip CAD Module I$ D$

Experimental Results Comparing Warp Processor Implementation and Traditional HW/SW Partition Implementation (Special Complier, and Xilinx FPGA) on embedded benchmarks. Speedups for the same critical region chosen by Profiler in Warp Processor Energy Consumption for the same critical region chosen by Profiler in Warp Processor

Experimental Results Better Speedup may be the results of custom circuits included in W-FPGA Architecture

Experimental Results

Warp Processor Achieved comparable performance and energy consumption Without special complier Totally transparent. Separate high-level code and FPGA architecture Applicable to any software programs

Experimental Results Achieve the lowest energy consumption compared to various processors’ implementation

Conclusion Feasibility Improved lean algorithm for partitioning, synthesis, placement and routing Improved Warp FPGA architecture

Questions ?

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Similar presentations

Presentation on theme: "WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Similar presentations

Presentation on theme: "WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010."— Presentation transcript:

Similar presentations

About project

Feedback