WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

Digital Design Copyright © 2006 Frank Vahid 1 FPGA Internals: Lookup Tables (LUTs) Basic idea: Memory can implement combinational logic –e.g., 2-address.
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.
The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,
Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,
Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.
Configurable System-on-Chip: Xilinx EDK
Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.
Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.
Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,
Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.
Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Dynamic Hardware/Software Partitioning: A First Approach Authors -Greg Stitt, Roman Lysecky, Frank Vahid Presented By : Aditya Kanawade Guru Sharan 1.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Hardware-Software Partitioning. EEL6935 / 52 Hardware Software Definition Definition: Given an application, hw/sw partitioning maps each region of the.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Reconfigurable Hardware in Wearable Computing Nodes Christian Plessl 1 Rolf Enzler 2 Herbert Walder 1 Jan Beutel 1 Marco Platzner 1 Lothar Thiele 1 1 Computer.
An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
J. Christiansen, CERN - EP/MIC
FPGA Implementations for Volterra DFEs
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
Exploiting Parallelism
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
NISC set computer no-instruction
Roman Lysecky Department of Electrical and Computer Engineering University of Arizona Dynamic.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.
Embedded Systems Design
FPGAs in AWS and First Use Cases, Kees Vissers
Introduction to Reconfigurable Computing
Greg Stitt ECE Department University of Florida
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
Dynamically Reconfigurable Architectures: An Overview
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
HIGH LEVEL SYNTHESIS.
Dynamic FPGA Routing for Just-in-Time Compilation
Dynamic Hardware/Software Partitioning: A First Approach
Warp Processor: A Dynamically Reconfigurable Coprocessor
Presentation transcript:

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010

O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 2

O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 3

Implementing Critical Region on FPGA Special Compiler – compiling time HW/SW partitioning Large Speedups and power savings – X overall speedups Traditional Architecture FPGA Proc. Application

Bit-level Operations High Speedup x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x ) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x ) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Circuit for Bit Reversal Bit Reversed X Value Original X Value Requires only 1 cycle (speedup of 32x to 128x) for same clock Requires between 32 and 128 cycles

Parallelizable code High Speedup for (i=0; i < 128; i++) y += c[i] * x[i].. ************ C Code for FIR Filter 1000’s of instructions Several thousand cycles Circuit for FIR Filter ~ 7 cycles Speedup > 100x for same clock

O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 7

Warp Processing Concepts Dynamic HW/SW Partitioning Transparent Optimizations A Warp Processor dynamically detects the binary’s critical region, re-implements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region.

Warp Architecture µP I$ D$ Warp FPGA Profiler On-chip CAD Module 1. Initially execute application in software only 2. Profile application to determine critical regions 3.Implement critical region in Hardware Configuration 4.Program Configurable Logic and update software binary 5. Partitioned application executes faster and with lower energy consumption.

Warp Architecture µP I$ D$ Warp FPGA Profiler On-chip CAD Module Execution Model Main processor and W-FPGA working in an exclusive mode. Benefits No cache coherence and consistency problem

O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 11

Warp-Oriented FPGA Configurable Logic Fabric CLB (Configurable Logic Block) SM (Switch Matrix) One CLB connected to one SM Short channel Long channel

Warp-Oriented FPGA CLB 2 3-input 2-output LUT Carry chain routing

Warp-Oriented FPGA Switch Matrix Single channel routing Long and short, long and long, short and short

Overall W-FPGA Architecture CLF Data Address Generator & Loop Control Hardware Registers 32-bit MAC (Multiply and Accumulate) Add custom circuits to aid the computing in FPGA.

O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 16

On-Chip CAD Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Using Profiler to locate the critical region in application Binary -> Assembly -> control and data flow graph Control and Data flow graph -> net-list of basic logic functions, like Nor, Nand, Xor …

On-Chip CAD Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

On-Chip CAD Lean algorithm lines of C code, 327 KB of instruction cache General algorithm – hundreds of thousands of lines Fast execution 1.2s on 40 MHz ARM 7 microprocessor General algorithm – minutes to hours Small cache required 3.6 MB of data cache General algorithm – exceeding 50 MB Alternative implementation Software task on Main Processor µP I$ D$ Warp FPGA Profiler On-chip CAD Module I$ D$

O UTLINE FPGA Co-processing Warp Processing Concepts Warp FPGA On-chip CAD Experiments Results 20

Experimental Results Comparing Warp Processor Implementation and Traditional HW/SW Partition Implementation (Special Complier, and Xilinx FPGA) on embedded benchmarks. Speedups for the same critical region chosen by Profiler in Warp Processor Energy Consumption for the same critical region chosen by Profiler in Warp Processor

Experimental Results Better Speedup may be the results of custom circuits included in W-FPGA Architecture

Experimental Results

Warp Processor Achieved comparable performance and energy consumption Without special complier Totally transparent. Separate high-level code and FPGA architecture Applicable to any software programs

Experimental Results Achieve the lowest energy consumption compared to various processors’ implementation

Conclusion Feasibility Improved lean algorithm for partitioning, synthesis, placement and routing Improved Warp FPGA architecture

Questions ?