MAPLD 2005Ardini1 Demand and Penalty-Based Resource Allocation for Reconfigurable Systems with Runtime Partitioning John Ardini
MAPLD 2005Ardini2 Motivation For a given HW architecture, including reconfigurable components –Optimize performance in consideration of long reconfiguration times and current demands for processing Application in systems with unknown runtime processing demands –Cognitive systems –Multisensor systems –Systems with unknown data lengths Take advantage of ability to express hardware implementations in high-level language (C) common to processor and programmable devices
MAPLD 2005Ardini3 Related Work Li, Compton, Hauck [00], based on Young[94] –“credit” for RC unit is proportional to size of the unit –Penalty Algorithm for defragmentation –Scoring approach here, but “credit” is proportional to amount of “acceleration” achieved with decision threshold based on size Vuletić, Pozzi, Ienne [04] –HW/SW abstraction layer proposed for RC transparent programming model
MAPLD 2005Ardini4 Goals Examine possible RTL generators allowing one set of source code for an algorithm –Binds to processor or programmable device (FPGA) –Minimal changes (I/O only) required to source –However, scheduling approach is not dependent on the capability or C to RTL generators Show easy creation of processor and FPGA implementations of logic Assume task scheduling is unknown at build time and is based on service requests Allow each task to support SW only and hardware accelerated versions Define simple logic to make “best” use of hardware resources, assign ownership dynamically Show benefit of RC via DMA in algorithms that can be bound to HW or SW Define API for application threads Demonstrate concept in real hardware
MAPLD 2005Ardini5 Experimental Environment Worker thread, coproc DMA model setup in Windows using VC++ multithreaded app Coprocessor is FPGA on PCI AlphaData card Implemented algorithm execution with/without coproc Used DMA to help hide overhead of reconfiguration: SW only threads can execute during configuration Service requests initiated by adjustable timers to exercise RC logic Event logging for analysis Mgr thread Worker thread 2 Worker thread 1 dataset savings dataset coproc DMA config Worker thd registration Service request savings
MAPLD 2005Ardini6 Hardware Environment Alpha-Data VirtexII Pro card on PCI bus Simple bus wrapper gets coprocessor IP onto Alpha-Data local bus PC chosen for easy development and focus on unique logic FPGA wrapper IP Local bus to PCI bridge, PC
MAPLD 2005Ardini7 RTL Generator ImpulseC chose for this study –ANSI C - like –Simple modifications to algorithm to compile for processor Data I/O path Word types as simple #defines –High level of abstraction Small learning curve Give up low-level control of registers/signals Some control over max gate delay using #pragma –Desktop simulation for fast algorithm debug
MAPLD 2005Ardini8 Software Manager and application in VC++ –Easily implemented in C as well For demo, windows “worker thread” model used, but other static thread + messaging methods could be used as well
MAPLD 2005Ardini9 Test Algorithms Two tasks implemented –FIR –FFT HW implementation flow –Code in C –ImpulseC RTL generator –Synplify –Xilinx implementation tools SW flow –Change I/O in HW algorithm to use shared memory buffer
MAPLD 2005Ardini10 IP Development Outline Write Task coprocessor for HW using ImpulseC Modify I/O for processor implementation Quantify savings in clock cycles for HW accelerated version Wrap both implementations into “worker thread” that will use one of the implementations based on coprocessor ownership Need to check coprocessor ownership on thread start Worker thread registration not considered here –Could be defined on power up or –Dynamically registered
MAPLD 2005Ardini11 Worker Thread Control Block One instantiated per worker thread Contains information about the coprocessor bit stream Points to the HW resource it currently owns –Would be used in multiple coprocessor systems for faster manager logic Contains base address of its coprocessor –Maintained by the manager and is used as a semaphore for coprocessor use
MAPLD 2005Ardini12 RC Thread Control Block Control block for HW resource Holds information about the resource, e.g. the ID of the resource Member function to kick off bit stream load process via DMA –Target thread can continue to run SW only until configuration is complete Member function to gain coprocessor access on behalf of a worker thread based on ownership and state (is it done loading the bit stream?)
MAPLD 2005Ardini13 Coprocessor Ownership All service requests pass through the thread manager Manager uses “Scoring” logic Upon completion, worker threads report “savings” that were achieved, or, could have been achieved using a coprocessor Manager increments score for that thread Highest scoring threads receive a coprocessor Reassignment not done until a threshold is passed –Set based on relative time penalty of performing a reconfiguration, e.g. do reconfig when score delta exceeds 10x the reconfiguration time.
MAPLD 2005Ardini14 Scoring logic Need to bound scores –Bound should be greater than RC threshold –2x RC threshold used in these tests Need to maintain “relative” performance of competing tasks, i.e. can’t have most scores saturating Therefore, when updating scores at thread completion, subtract the current lowest score off of all registered threads
MAPLD 2005Ardini15 Scoring Details Simple subtraction of lowest score is not enough –One inactive thread would allow “integrator windup” on the remaining threads Slow response when the inactive thread comes back online Saturation logic would prevent the selection of coprocessor owners, i.e. they would all “collect” at the top of the score list –Prevents initial accumulation of scores Therefore, subtract score x from each task where –x is the lowest nonzero score for all tasks other than the top scoring m threads where m is the number of available coprocessors
MAPLD 2005Ardini16 Coproc Assignment Get highest scoring non-owner in top m tasks Compare score to lowest ranking owner If diff is greater than threshold, RC –If current owner is using the resource skip RC If RC is still the right decision after current owner finishes, RC will happen at that time More logic could be used to continue comparing against current coproc owners t3t3 t2t2 *t 1 *t 4 t5t5 Δ > thresh? Top m tasks eligible for coprocessor ownership Ranked task scores Lower ranking tasks will run in SW * = current owner
MAPLD 2005Ardini17 Reconfiguration Thread Created by manager Kicks off DMA process Waits for done event Sends reconfiguration complete message back to manager Manager can then give access the Worker thread owner
MAPLD 2005Ardini18 Test Configuration Single HW resource available Two competing threads, FFT, FIR processing Fixed HW block sizes Fixed data set sizes = fixed savings Adjust for mismatch in microprocessor vs. FPGA clock rates Service request rates for each thread adjustable to exercise RC logic
MAPLD 2005Ardini19 Results RC event No owner Thread 1 owns Thread 2 owns RC Threshold hysteresis score saturation Service request rates for two threads vary with time
MAPLD 2005Ardini20 Reconfiguration Detail RC DMA period
MAPLD 2005Ardini21 RC DMA with Higher Demand Rate RC DMA period
MAPLD 2005Ardini22 Conclusions Coprocessor ownership given based on best sustained use of the resource Provides hysteresis to prevent frequent reconfigurations Low-overhead logic RC decision logic Hardware and software implementations allow DMA to hide reconfiguration overhead IP description in C allows it to be created once, compiled for microprocessor and FPGA targets