Hardware-Software Partitioning. EEL6935 / 52 Hardware Software Definition Definition: Given an application, hw/sw partitioning maps each region of the.

Slides:



Advertisements
Similar presentations
VHDL Design of Multifunctional RISC Processor on FPGA
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
DSPs Vs General Purpose Microprocessors
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
Application-Specific Customization of FPGA Soft- core Processors Journal Paper Presentation Presented by: Ahmad Sghaier Course Instructor: Dr. Shawki Areibi.
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Configurable System-on-Chip: Xilinx EDK
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Dynamic Hardware/Software Partitioning: A First Approach Authors -Greg Stitt, Roman Lysecky, Frank Vahid Presented By : Aditya Kanawade Guru Sharan 1.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Study of AES Encryption/Decription Optimizations Nathan Windels.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
MICROPROCESSOR INPUT/OUTPUT
Automated Design of Custom Architecture Tulika Mitra
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.
IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Processor Architecture
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Presenter: Darshika G. Perera Assistant Professor
Embedded Systems Design
Instructor: Dr. Phillip Jones
Anne Pratoomtong ECE734, Spring2002
Introduction to cosynthesis Rabi Mahapatra CSCE617
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
Ann Gordon-Ross and Frank Vahid*
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Dynamic FPGA Routing for Just-in-Time Compilation
Dynamic Hardware/Software Partitioning: A First Approach
Presentation transcript:

Hardware-Software Partitioning

EEL6935 / 52 Hardware Software Definition Definition: Given an application, hw/sw partitioning maps each region of the application onto ▫ either a hardware (custom circuits) ▫ or a software (microprocessors), ▫ but not both A partition is a mapping of each region to either HW or SW Mapping is done to meet certain Design Goals with Constraints 2

EEL6935 / 52 Design Constraints & Goals 3 Space  Area Power Performance Yield Schedule Cost

EEL6935 / 52 You cannot get away with Everything ! 4

EEL6935 / 52 Challenges 5

EEL6935 / s 15s 25s 10s 5s 12s 8s 5s Sw Time: 50s Sw Time: 30s Sw Time: 20s Acknowledgement: Modified from G. Stitt’s slides in EEL5721 FIR()ACCUM() SEARCH() 5s 25s 10s 15s Possible Solutions: Use fastest implementations Use smallest implementations Consider all “middle” implementations =55s =50s =45s Performance: Best Partition 15s Process Hardware Implementation Options : Area and Execution Time Area Budget Application with the Multiple Hardware Software Options

EEL6935 / 52 Mathematical Modeling to arrive at the Optimum H/W-S/W Partition 7

EEL6935 / 52 Granularity 8

Dynamic Hardware-Software Partitioning : A First Approach Greg Stitt, Roman Lysecky, Frank Vahid, University of California, Riverside DAC 2003, June 2-6,2003, Anaheim, California, USA

EEL6935 / 52 Dynamic Hardware-Software Partitioning Dynamically ▫identify and re-implement  critical software kernels, loops etc. to configurable fabric  in order to achieve better performance, lower energy or meet other design goals 10

EEL6935 / 52 Multiple Applications an Illustration EEL

EEL6935 / Application Usage Profile: An Illustration EEL5935 Mr. Jazz Mr. Luigi Mr. MTB Music Games GPS User Data Access SMS Calls Different users have different usage profiles While designing a product usage profile needs to be assumed to give best user experience. However ▫Usage Profile (Application usage) may be User/code dependent  E.g. MP3, Camera, Video Playback, Call etc. ▫Usage profile may change over-time ▫Generic product assuming a certain profile  is optimum for the “assumed” profile but sub-optimal in terms of area or performance for other usage profiles Profiling in real time is key ▫usage profile may identify critical kernels  Critical components may be pushed to configurable area  To boost the performance and reduce energy

EEL6935 / 52 Dynamic HW/SW Partitioner Requirements 1.Detect critical code regions 2.Decompile and synthesize them to hardware 3.Place and Route the Hardware onto on-chip configurable logic 4.Update binary to communicate with the logic 13 Wait ! Did you say on- chip PnR ? You got to be kidding ! Right ? All of the above with on-chip implementable, very lean algorithms

EEL6935 / 52 Binary Level Partitioning and Advantage Partitioning at the binary level ▫offline or online ▫Steps 1.identify critical code sections, high loop sections 2.Consider assembly code and object code as HW candidates 3.Push these to configurable hardware Advantage ▫Works with any  software compiler  High level language 14 The Paper uses Binary Level partitioning approach. Critical Loops identified and implemented in the Configurable logic

EEL6935 / 52 Why Binary Level Partitioning instead of higher level optimizations ? Dynamic Partitioning ▫Needs to run on a small on-chip partitioning system ▫ Needs to be lean to be able to perform Place and Route etc. on-chip ▫Higher Level Partitioning Methodologies may be good for offline analysis, but very difficult to implement due to the compute constrain. 15

EEL6935 / 52 HW/SW Partitioning of Software Binary EEL Acknowledgement: Figure taken from G. Stitt, F. Vahid HW/SW Partitioning of Software Binaries ICCAD Nov 2002

EEL6935 / 52 System Architecture (Top) EEL Microprocessor and Memory for normal Software application On chip configurable module 1. Detects Most Frequently Executed Software region 2. Re-implements (1) in the configurable logic Architecture Based on Triscend A7 (60MHz)

EEL6935 / 52 System Architecture (Sub Blocks) EEL Direct Memory Access Controller to access memory Input Output Decompiles and synthesized selected binary regions for HW implementation Detects Most Frequently executed application- software loops 32-bit i/p – o/p register Partitioning Co-Processor Overhead: i.Not much : Very Lean compared to Main Processor ii.Platform with multiple Main Processors may share single Partitioning co- processor, reducing the overhead further

EEL6935 / 52 Simplified Configurable Logic Fabric Simplified Fabric to just support inner loop implementation designed Mapping, placing and routing a design to a general configurable logic fabric is time consuming 19

EEL6935 / 52 Architecture Limitations No sequential Logic support in the Configurable logic (in the platform chosen) ▫Constraint:  Loops to be implemented must have single cycle implementable body Number of loop iterations must be determined before the loop executes, in order to specify the DMA block size request. ▫Number of iterations may be determined :  Statically in case of constant bounds  Dynamically  requires extra instructions to configure the size of the DMA block request before HW execution starts United States Patent 5,440,245 : Galbraith, et al. August 8, 1995 Logic module with configurable combinational and sequential blocks 20

EEL6935 / 52 CLF Architecture EEL Either side connect-ability (only at bottom) 4 channel: Given Channel to Given Channel

EEL6935 / 52 Tool Flow : Loop Profiler EEL Detects critical SW regions that should be implemented in HW 2.Is Non intrusive 3.Monitors instruction addresses on the memory bus 4.Increments branch frequency in the cache for a given backward branch 5.Small cache with a dozen entries Need to save area and power Reference: Ann Gordon-Ross et. al Frequent Loop Detection Using Efficient Nonintrusive On-Chip Hardware IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 10, OCTOBER 2005

EEL6935 / 52 Decompilation Converts Software loops into higher level abstraction more suitable for synthesis ▫Step 1 : Converts each assembly instruction to register transfer ▫Step 2: Using Register Transfers Builds:  CFG (Control Flow Graph) for software region  DFG(Data Flow Graph) by parsing the Register transfers ▫Step 3: Applies compiler optimizations to remove overhead due to assembly code and instruction set 23

EEL6935 / 52 DMA Configuration Tool Function: Maps the memory access of the decompiled loop onto the DMA Architecture ▫Involves detection of  Reads/ writes  Increment and decrement address updates  Single and block request modes Remove following from Decompiled loop ▫Loop counters and exit conditions ▫Address calculations: As only sequential locations accessed DMA functioning: ▫DMA transfers data needed before the loop starts ▫After HW initialization, HW starts a block request that fetches 1 memory location per cycle in case of a read or write 24

EEL6935 / 52 Register Transfer Synthesis Converts each o/p bit into Boolean expression ▫By traversing the dataflow graphs of the software region Limitation: ▫Single cycle executable loop- bodies only ▫Multi cycle would need behavioral synthesis to schedule loop operations 25

EEL6935 / 52 Logic Synthesis  Tech Mapping  P&R Converts Boolean equations into a netlist Boolean equations transformed into DAG (directed acyclic Graph) of the Boolean Logic network ▫Internal Nodes of DAG correspond to simple logic gates (AND/OR/INV, XOR) Logic minimization ▫Light weight suited for on-chip execution  Applied at each node starting with the input nodes, while traversing through the network  Uses single expand phase to achieve good optimization Tech Mapping ▫Traverses DAG starting from output nodes  Combines nodes that may create 3 i/p 1 o/p LUT  Further combine nodes (where possible ) to form 3 i/p 2 o/p LUTs 26

EEL6935 / 52 LUT Placement Steps Step 1: Determine relative placement of LUTs to one another ▫by determining the critical path, and placing it on a horizontal row Step 2 : For remaining non-placed nodes place as per dependency (i/p or o/p) w.r.t. placed ▫Place above for inputs to Placed nodes ▫Place below for outputs from Placed nodes Step 3: Place in the Configurable Logic 27

EEL6935 / 52 Routing Simple Greedy algorithm ▫Routes wires in most direct fashion  Route the wires between input nodes and LUTs  Route wires from LUTs to outputs  Route wires connecting LUTs together Routing decisions at Switch Matrices for within conifugrable logic fabric 28

EEL6935 / 52 Bitfile Creation Combines ▫the Placed and routed hardware description with the DMA configuration information into a single bit file Bitfile can be used to initialize the configurable logic 29

EEL6935 / 52 Bitfile modification Update software binaries to utilize HW for loops Replace original software instruction for loop to a jump to HW initializing code ▫Initializing code sends HW enable signal through Memory mapped register ▫Code followed up with microprocessor power down trigger ▫Upon finishing HW asserts completion signal causing a software interrupt  Software interrupt wakes the microprocessor ▫Jump instruction at the end of the hardware initialization code to the end of the original software loop 30

EEL6935 / 52 Tool : Performance and Area overhead EEL Typical tools for De-compilation, synthesis, and Place and Route need huge LSF machines Designed tool very light weight and geared towards partitioning co- processor Data Size: Memory required for the tool execution Time : Execution time of each tool considering 60MHz clock and 1.5 cycle/Instruction

EEL6935 / 52 Results 32 Definitions: Loop Time Perc: Percentage of total software time, spent in the implemented loops Loop Size Perc: Percentage of the total instructions that the loop required Ideal Speedup: Speedup assuming HW implemented loops are executing in Zero time. Sw Loop Time: Time required by the loop if completely in software HW Loop Time: Time when loop implemented in HW

EEL6935 / 52 Conclusion Dynamic HW/SW Partitioning offers advantages over traditional approach: ▫Transparent i.e. Benefits of partitioning even with regular software flows ▫Can adapt as per actual usage profile ▫Upto 2.6 average speedup 33

EEL6935 / 52 Areas of Improvement of Future Work Power required by the partitioning module and the HW running specified as 10-20% of total power ▫Power data for individual modules not presented Realistic loops have sequential logic and may not be always single cycle ▫Extend implementation on sequential logic compatible CLF ▫Extend to include mutli cycle loops Applications seem too biased especially “url”, with 80% loop time with just 0.1% loop area overhead Place and Route, synthesis would have been difficult to do on single partitioning chip: ▫Today as on 2013 it should be possible to interface the modules with the cloud computing. I would rather have a complex algorithm run to get best suited partition profile, on a cloud network than to try small tricks with the lean co-processors  This would be application dependent 34

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid, University of California, Riverside Design, Automation and Test in Europe Conference and Exhibition (DATE’05)

EEL6935 / 52 Motivation (1/2) Hard-Processor ▫Pros: Performance ▫Cons: Flexibility Soft-processors ▫Pros: Flexibility ▫Cons: Degraded Performance and Energy Consumption 36 Can we leverage benefits of both using Warp Processing ?

EEL6935 / 52 Motivation (2/2) Warp Processing : Technique for ▫ optimizing a software application  by dynamically and transparently re-implementing “critical software kernels” as custom circuits in on- chip configurable logic Study MicroBlaze based Warp processing System to ▫Eliminate the performance and energy overhead of a soft-processor compared to a hard-processor 37

EEL6935 / 52 FPGA single-chip Systems: Hard-core Vs Soft-core Hard-core ▫Excellent Packaging and communication with the FPGA ▫Lower Power and Higher Performance than Soft-core ▫E.g. : Triscend, Atmel, Altera’s Excalibur, Virtex* with PowerPCs Soft-core ▫Lower Part cost ▫Extreme Flexibility during design process  Adding custom instructions or including/ excluding particular data-path coprocessors  Quickly integrate the processor within a FPGA  Varying number of processors as per need ▫E.g. NIOS, NIOS II …, Picoblaze, Microblaze 38 Use Hardware / Software Partitioning Techniques to alleviate Power and Performance overhead of Soft Processors

EEL6935 / 52 MicroBlaze Soft Processor Core 39 MicroBlaze – 32bit softcore by Xilinx LMB – Local Memory Bus BRAM – Block RAM : User Defined Size OPB – On-Chip Peripheral Bus Xilinx Platform Studio Tools Synthesizes design Bitstream Software Libraries Application Compile Final System Bitstream Specify system Architecture and configure MicroBlaze

EEL6935 / 52 Key features of MicroBlaze User Configurable options ▫Tailor processor’s functionality as per the design need ▫Configurable Instructions and data caches ▫Incorporate additional hardware:  Hardware multiplier ( mul instructions)  Hardware Divider ( div instructions )  Barrel Shifter (bs and bsi instructions)  Hardware bit manipulations and absolute plus 40

EEL6935 / 52 Peripheral Hardware Present Xilinx LogiCORE IP Floating-Point Operator v5.0 (Mar ‘11) Available for ▫Kintex™-7, Virtex®-7, Virtex-6,Virtex-5, Virtex-4, Spartan®-6, Spartan- 3/XA,Spartan-3E/XA, Spartan-3A/AN/3A DSP/XA FPGAs Supported operators: ▫multiply ▫add/subtract ▫divide ▫square-root ▫comparison ▫conversion from floating-point to fixed-point ▫conversion from fixed-point to floating-point ▫conversion between floating-point types ▫ Parameterized fraction and exponent word lengths 41

EEL6935 / 52 Applications analyzed brev (Powerstone benchmark suite) ▫Critical kernel performs bit reversal heavily relying on shift operations  Software only Implementation (without mul or barrel shift)  N-bit shift by using n-successive add operations  Configurable Hardware implementation  2.1X speed up “matmul” ▫Critical Region : Matrix multiplication  Hardware Multiplier provides 1.3X speedup 42

EEL6935 / 52 MicroBlaze-based Warp Processor 43 Identify Critical Kernels in execution time Implement critical Kernels in WCLA as cutom HW WCLA – Warp Configurable Logic Architecture

EEL6935 / 52 Warp Configerable Logic Architecture for Dynamic HW/ SW Partitioning 44 DADG: Data Address Generator Used for any memory accesses to/for Configurable logic LCH: Loop Control Hardware Handles loops and controls executions Reg 0, Reg 1 Reg 2: 1.i/p to CLF /or MAC (as per mapping) 2.Outputs from the configurable logic stored in Registers

EEL6935 / 52 MicroBlaze Multi-processor warp processing system Mutliple Soft-cores may be incorporated within a single FPGA ▫Limited only by the FPGA Size Multi-processor Warp Processing system may share a common DPM and WLCA and HW/SW partitioning may be done in round robin manner ▫No Overhead due to additional DPMs Partitioning tools may be implemented as software tasks running in one of the cores 45

EEL6935 / 52 Experimental Setup Execution Time and Power studied Embedded systems applications chosen from Powerstone and EEMBC benchmark suites studies MicroBlaze processor core implemented on Spartan3 FPGA ▫Barrel Shifter and Multiplier configured in Hardware ▫Note: MicroBlaze max frequency 85MHz; However FPGA circuits may run upto 250MHz 46

EEL6935 / 52 Profiling Simulation 47 Xilinx Microprocessor Debug Engine MicroBlaze Soft App 1 Soft App 2 Soft App n Inst Trace 1 Inst Trace 2 Inst Trace n Simulate On- chip Profiler Behavior Critical Region

EEL6935 / 52 Energy Equations 48

EEL6935 / 52 Performance / Power Simulation 49 Critical Regions VHDL Synopsys Design Compiler UMC 0.18um Library Synthesis Execution Traces of critical regions Execute HW Circuits (VHDL model for WCLA) for each partitioned Critical Region Determine final application performance Xilinx XPower MicroBlaze and system Component (excluding WCLA) Dynamic Power Static Power Configurable HW Power MicroBlaze Power

EEL6935 / 52 Results 50 ARM execution determined using Simple Scalar

EEL6935 / 52 Conclusion Warp processors (with soft-core), by pushing critical software kernels to the CFG can provide ▫Flexibility of the Soft-core  Due to soft-core implementation ▫Competitiveness of a Hard-core processors (as ARM)  Performance of the order of the Hard-core  By leveraging special Configured HW  5.8X (average) improvement (with MicroBlaze)  Eliminates Energy Overhead  By faster execution due to dedicated hardware and trimming down the soft-processor to perfectly fit design needs  Average Energy reduction ~ 57% ▫Opened Avenues for Soft-core processors which would not have been feasible previously due to energy/performance 51

EEL6935 / 52 Areas of Improvement & Future Work Real processing systems do not just do a execute just a single application at a time ▫For realistic data, multiple applications should be run simultaneously ▫Explore Parallel Processing architecture further Power Estimation Data ▫Estimation is good  It would be good to see real data as well Online Profiler has a dozen entries ▫Number of entries should be configurable to avoid local maxima Instead of simplified configurable logic fabric, how about using underlying FPGA physical fabric Algorithm to come up with re-partitioning time interval should be worked up 52

Questions

EEL6935 / 52 Design Goals EEL