Download presentation
Presentation is loading. Please wait.
Published byAmie McDonald Modified over 9 years ago
1
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010
2
O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 2
3
O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 3
4
Implementing Critical Region on FPGA Special Compiler – compiling time HW/SW partitioning Large Speedups and power savings – 10-100X overall speedups Traditional Architecture FPGA Proc. Application
5
Bit-level Operations High Speedup x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Circuit for Bit Reversal Bit Reversed X Value........... Original X Value Requires only 1 cycle (speedup of 32x to 128x) for same clock Requires between 32 and 128 cycles
6
Parallelizable code High Speedup for (i=0; i < 128; i++) y += c[i] * x[i].. ************ ++++++ + ++ ++ + C Code for FIR Filter 1000’s of instructions Several thousand cycles Circuit for FIR Filter ~ 7 cycles Speedup > 100x for same clock
7
O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 7
8
Warp Processing Concepts Dynamic HW/SW Partitioning Transparent Optimizations A Warp Processor dynamically detects the binary’s critical region, re-implements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region.
9
Warp Architecture µP I$ D$ Warp FPGA Profiler On-chip CAD Module 1. Initially execute application in software only 2. Profile application to determine critical regions 3.Implement critical region in Hardware Configuration 4.Program Configurable Logic and update software binary 5. Partitioned application executes faster and with lower energy consumption.
10
Warp Architecture µP I$ D$ Warp FPGA Profiler On-chip CAD Module Execution Model Main processor and W-FPGA working in an exclusive mode. Benefits No cache coherence and consistency problem
11
O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 11
12
Warp-Oriented FPGA Configurable Logic Fabric CLB (Configurable Logic Block) SM (Switch Matrix) One CLB connected to one SM Short channel Long channel
13
Warp-Oriented FPGA CLB 2 3-input 2-output LUT Carry chain routing
14
Warp-Oriented FPGA Switch Matrix Single channel routing Long and short, long and long, short and short
15
Overall W-FPGA Architecture CLF Data Address Generator & Loop Control Hardware Registers 32-bit MAC (Multiply and Accumulate) Add custom circuits to aid the computing in FPGA.
16
O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 16
17
On-Chip CAD Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Using Profiler to locate the critical region in application Binary -> Assembly -> control and data flow graph Control and Data flow graph -> net-list of basic logic functions, like Nor, Nand, Xor …
18
On-Chip CAD Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing
19
On-Chip CAD Lean algorithm 34720 lines of C code, 327 KB of instruction cache General algorithm – hundreds of thousands of lines Fast execution 1.2s on 40 MHz ARM 7 microprocessor General algorithm – minutes to hours Small cache required 3.6 MB of data cache General algorithm – exceeding 50 MB Alternative implementation Software task on Main Processor µP I$ D$ Warp FPGA Profiler On-chip CAD Module I$ D$
20
O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 20
21
Experimental Results Comparing Warp Processor Implementation and Traditional HW/SW Partition Implementation (Special Complier, and Xilinx FPGA) on embedded benchmarks. Speedups for the same critical region chosen by Profiler in Warp Processor Energy Consumption for the same critical region chosen by Profiler in Warp Processor
22
Experimental Results Better Speedup may be the results of custom circuits included in W-FPGA Architecture
23
Experimental Results
24
Warp Processor Achieved comparable performance and energy consumption Without special complier Totally transparent. Separate high-level code and FPGA architecture Applicable to any software programs
25
Experimental Results Achieve the lowest energy consumption compared to various processors’ implementation
26
Conclusion Feasibility Improved lean algorithm for partitioning, synthesis, placement and routing Improved Warp FPGA architecture
27
Questions ?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.