The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.


Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Chapter 16 Control Unit Operation No HW problems on this chapter. It is important to understand this material on the architecture of computer control units,
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.
The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.
Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.
Configurable System-on-Chip: Xilinx EDK
Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Chapter 16 Control Unit Implemntation. A Basic Computer Model.
Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.
Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Dynamic Hardware/Software Partitioning: A First Approach Authors -Greg Stitt, Roman Lysecky, Frank Vahid Presented By : Aditya Kanawade Guru Sharan 1.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Study of AES Encryption/Decription Optimizations Nathan Windels.
Hardware-Software Partitioning. EEL6935 / 52 Hardware Software Definition Definition: Given an application, hw/sw partitioning maps each region of the.
ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Automated Design of Custom Architecture Tulika Mitra
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Exploiting Parallelism
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
Architecture and algorithm for synthesizable embedded programmable logic core Noha Kafafi, Kimberly Bozman, Steven J. E. Wilton 2003 Field programmable.
Non-Intrusive Dynamic Application Profiling for Detailed Loop Execution Characterization Ajay Nair, Roman Lysecky Department of Electrical and Computer.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
Buffering Techniques Greg Stitt ECE Department University of Florida.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.
Embedded Systems Design
5.2 Eleven Advanced Optimizations of Cache Performance
CSCI1600: Embedded and Real Time Software
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Dynamic FPGA Routing for Just-in-Time Compilation
Dynamic Hardware/Software Partitioning: A First Approach
Warp Processor: A Dynamically Reconfigurable Coprocessor
Dynamic Hardware Prediction
CSCI1600: Embedded and Real Time Software
Presentation transcript:

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr. Sheldon Tan - Co-Principal Investigator Dr. Walid Najjar - Co-Principal Investigator et al. ( and supporting papers

So far For Us.. ● Ben showed us how instruction collapsing works and its potential benefits. – Instead of implementing as LUT accesses, warp reconfigures the fabric. ● Kynan showed how Chimera uses pre-compiled code/bitstreams to increase performance. – Warp dynamically generates the bitstream and modifies code on the fly. Warp combines the best of both worlds!

Warp Overview: P: Any standard micro processor Profiler: Watches the IF address to determine critical regions Instruction / data memory or cache Dynamic Partitioning Module. Synthesizes HW for the critical region and programmes WCLA. Also updates code binary. Warp Configurable Logic Architecture. Accessed through mem-mapped registers

Warp Overview: ● Warp is the name for the family of processors, not an individual implementation ● Website has an 100Mhz Arm7 as the main processor. And another for the DPM. – Quoted average speedup of 7.4 and energy reduction of 38-94% ● Can also apply to other Arms, MIPS, etc. x86 anyone?

On-Chip Profiler ● With current SOC, snooping address lines no longer an option. ● Requirements: non-intrusion, low power, and small area. ● Monitors sb's (short backwards branch) to show loop iteration. ● Cache indexed by sbb address. On hit 16bit counter is incremented. On saturation, all counters >> 1 to maintain correct ratios. ● 90-95% accuracy, power up by 2.4% and area up by 10.5% (or an extra mm 2 )

On-Chip Profiler ● Much of the power consumption is from the cache lookup / write. ● Common cache power techniques -> 1.5% ● Can decrease this by coalescing:keep count when the same sbb is repeatedly seen and only updating the cache upon sight of a new sbb. ● Now 0.53% power overhead, 11% area. ● For a 5% drop in accuracy, we can allow every n th ssb to be processed. This sampling drops power consumption to 0.02% above normal for n = 50.

DPM ● Partitioning – Taken from profiler. ● Decompilation ● DMA Configuration ● RT (Register Transfer) Synthesis ● JIT FPGA Compilation – Logic Synthesis – Technology Mapping – Placement – Routing Now have a HW description of code more appropriate for further synthesis

Decompilation ● Previous binary decompilers were poor. ● Extra optimizations can be made if we have high level constructs available – Smart buffers – Loop unrolling – Redundant operations due to ISA overhead. ● So we decompile: – Loops analysed for linear memory strides to determine array access. Overlaps for iterations placed in smart buffers rather than main memory. – Loop bounds found for unrolling – Size of data types tracked, min bits used for ALU operations.

DMA Configuration ● Uses the memory access patterns of the decompiled code to configure the DMA controller. ● Initially all data will be fetched before the loop begins. – Then one block can be fetched/written per cycle. The same rate as the collapsed loop needs it. ● Finally, after the loop one more DMA is scheduled to write back the final data.

RT Synthesis ● The high level data is then fed into a standard netlist generator. ● Netlists generated this way were found to be 53% faster than other forms of binary synthesis. ● When compared to synthesis from source, decompilation resulted in identical performance with a 10% increase in surface area. ● Area increase to decompiler's inability to remove some temporary registers found in long expressions out of the datapath.

JIT FPGA Compilation ● Logic Synthesis – Logically analyses the netlist at a gate level to minimise the amount of gates required. – Uses the Riverside On Chip Minimiser, an algorithm optimised for fast execution in a low memory environment. ● Technology Mapping – Mapping at a gate level the netlist onto the configurable material. – Uses a standard algorithm.

JIT FPGA Compilation ● Routing & Placing: most expensive part of the synthesis process ● Requires 14.8sec and 12M RAM -> Not good for an embedded system. ● Developed Riverside On Chip Router (ROCR) – Commercial FPGAs are overly complex for implementing only a collection of instructions. – Designed a simple fabric: easy to route for. – Another part of Riverside On Chip Partitioning Tools (ROCPART) suite. Difficulties: Solutions:

JIT FPGA Compilation ● Final design has 67x67 CLBs, channel width of 30. – Simpler, so easier to place and route. ● Suitable for on-chip! Results:

Code Update ● The DRM also updates the code memory. – Replaces one instruction with a branch to a specialises HW routine. – This starts the WCLA and places the processor in a sleep state. – Upon the WCLA's completion, an interrupt wakes the processor, which then jumps back to the end of the loop in the original code.

WCLA ● DADG: Data Address generator. – All mem access to/from WCLA. – Generate addr for 3 arrays – Delivers data to Reg{0,1,2} ● LCH: Loop Control Hardware. – Enables zero overhead looping – Needs preset loop upper bound, but allows early breaks depending on some configurable result. ● Regs. Input registers to the configurable logic fabric. Wired to the MAC, but also accessible directly from the fabric.

WCLA ● MAC: Multiply and Accumulator. – Most inner loops require some addition/multiplication. To save on logic area and routability, a dedicated MAC is included. ● SCLF: Simple Configurable Logic Fabric. – As described earlier. – Designed for simple bitstream generation by on chip devices with limited time and memory resources.

Results – MIPS 60MHz Details of the tools running on the co-processor. Weight of the critical regions in tests. Results using the WARP architecture. The whole process is automated, except binary modification with is currently done by hand.

Results – ARM 100MHz The speedup of the replaced kernel. The Notable high ones are due to the reimplementation being only bit shifting via wires, or mem access being replaced by a single block DMA transfer Overall speedup across the entire execution of the benchmark. Of note is the minimum speedup of 2.2 and an average of 7.4 This is accompanied with a 38-94% savings in energy consumption

FPGAs All The Way ● The developers of Warp also investigated putting the entire system on a FPGA. ● Soft-core MicroBlazes on a Spartan3 – One is the system processor – Other runs the DPM (currently) ● Ideally WCLA would directly utilise the underlying reconfigurable fabric. – Currently simulating the WCLA – And looking into implementing it on top. – Ideally use the spartan's own configurable fabric.

Implementation ● Modified MicroBlaze to include the profiler. ● Simulated execution on Xilinx apps to obtain program trace. ● Trace used to simulate behaviour of profiler, found single critical region. ● Used ROCPART to generate HW circuits. ● Measured these in a VHDL model of WCLA, combined with traces to obtain final performance / energy measurements. What they did:

Results: ● Warp architecture more than compensated for traditional speed/power issues normally found in FPGA solutions. ● Still maintains flexibility. ● Gives custom-built systems a run for their money What they found:

Resources ● Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware, Ann Gordon-Ross and Frank Vahid. ● Dynamic FPGA Routing for Just-in-Time FPGA Compilation, Roman Lysecky, Frank Vahida and Sheldon X.-D. Tan. ● Dynamic Hardware/Software Partitioning: A First Approach, Greg Stitt, Roman Lysecky and Frank Vahid. ● A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning,Roman Lysecky and Frank Vahid ● A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning, Roman Lysecky and Frank Vahid ● Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure, Greg Stitt, Zhi Guo, Frank Vahid and Walid Najjar.