1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the U.S. National Science Foundation and a U.S. Dept. of Education GAANN Fellowship International Conference on Compilers, Architecture, and Synthesis for Embedded Systems 2003.
2 System Optimizations - Static Specialization of a system for a particular application or suite of applications to improve power consumption and/or performance Static optimizations are performed at design time by the designer There are many static optimization approaches: Critical regions can be partitioned to configurable logic Constant propagation and code specialization for statically determined invariant variables Critical regions can be locked to a specialized cache Many more…
3 System Optimizations - Static * Undergrad/microchip.jpg Modeled input stimulus * Power Consumption Design Time
4 Static Optimization Drawbacks Designer must perform optimizations May disrupt standard software tool flows Runtime environment may present optimization opportunities that are not evident during design time simulation Simulation is usually used Framework – may take months to set up a testing environment with realistic input stimuli Exploration time – may take weeks to run one of hundreds of possible configurations
5 Profiling and performing dynamic optimizations System Optimizations - Dynamic Dynamic software optimizations are becoming increasingly popular for improving software performance and power. Dynamic optimizations are performed in system during runtime End Product Runtime Input Power Consumption Execution Time System startup Modify system to use dynamic optimizations Optimized system Change in stimulus Designer Re-optimization of the system
6 System Optimizations - Dynamic There are many dynamic optimization approaches Dynamo performs dynamic software optimizations on the most frequently executed regions of code Frequently executed regions of code can be remapped to non-interfering cache locations Dynamic binary translation methods store translation results of frequently executed regions of code for quick look-up Value profiling can determine runtime invariant variables for constant propagation and/or code specialization Many others….
7 Dynamic Optimizations - Effectiveness For dynamic optimizations to be most effective, optimizations are typically applied to the most frequently executed regions of code For a large selection of the MediaBench benchmark suite, we observed that 90% of the execution time was spent in approximately 10% of the code Profiling is used to determine the critical regions of code
8 Previous Profiling Methods Desktop targeted profiling methods Instrumentation and sampling These methods are unsuitable for embedded systems Causes disruption of run-time behavior Early methods used logic analyzers Not possible for today’s systems-on-a-chip (SOCs) JTAG standard allows for internal registers to be read Typically used for testing and debugging Interrupts processor to write internal information to external pins Desktop Embedded
9 Profiling Methodology Goal The goal of our profiling approach is to design a profiling tool suitable for embedded systems to determine the most critical regions of code
10 Critical Region Detection - Operational Requirements Non-intrusion Important for real-time systems Minimizes the impact on current tool chains i.e. no special compilers or binary modification tools Low power Battery operated systems Systems with limited cooling Small area Less significant due to the large transistor capacities of current and future chips Accuracy Exact results are not required for the information to be useful -- instead, reasonable accuracy is acceptable
11 Frequent Loop Detection We analyzed the critical regions for various Powerstone and Mediabench benchmarks We translate the problem of finding the critical regions to finding the frequently executed loops Short backwards branch (sbb) instruction is typically the last instruction of a loop 15% - Subroutines with no inner loops 85% - Small inner loops All Critical Regions
12 Percentage of Execution Time for Frequent Loops In addition to detection of frequent loops we also want to know the loops’ percentage contribution to total execution time. Application X Application Y Loop A - 10% Loop B - 10% Loop C - 80% Loop A - 32% Loop B - 33% Loop C - 35% Optimization of most frequent loop only = 80% for X and 35% for Y Loop C - 80% Loop C - 35% Optimization of all critical loops = 80% for X and 100% for Y Loop C - 80% Loop C - 35% Loop B - 33% Loop A - 32%
13 Frequent Loop Detection - Cache Based Architecture Microprocessor Frequent Loop Cache Controller Frequent Loop Cache To L1 Memory rd/wr addr sbb rd/wr addr data saturation ++
Cache Operation Sbb TraceAddress Frequency CONFLICT
15 Cache Operation - Conflict Resolution Resolve most conflicts using associativity and an LRU replacement policy for further conflicts Further conflicts may cause frequent loops to constantly be replaced in the cache - thrashing Our experiments did not suffer from this contention but a victim buffer may be added if necessary
16 Cache Operation – Frequency Width Our goal is to find the smallest possible cache needed to determine the frequent loops We keep the cache small by allowing the frequency field width to be varied If the frequency field is too small, saturations can occur and frequency information may be lost
17 Cache Operation - Frequency Counter Saturation Address Freq (8 bits) Sbb Trace SATURATION All frequencies are divided by 2 with a shift right (built as a special feature of the cache and activated by asserting the saturation signal to the cache)
18 Experimental Setup We ran extensive experiments to determine the best frequent loop cache configuration Cache configurations simulated: To determine the accuracy of each cache configuration we wrote a trace simulator for the cache architecture in C++ Cache Sizes Cache Associativities Frequency Counter Field Widths 16, 32, and 64 entries 1, 2, 4, and 8-way 4 to 32 bits X X= 336 configurations
19 Experimental Setup Benchmarks Selected Powerstone benchmarks running on a 32-bit MIPS instruction set simulator Selected MediaBench benchmarks running on SimpleScalar Power consumption UMC 0.18-micron CMOS technology running at 250 MHz at 1.8V Cache memory power consumption obtained using the Artisan memory compiler Additional logic and functionality modeled in synthesizable VHDL using Synopsys Design Compiler
20 Accuracy - Sum of Differences We computed the average difference between the actual loop execution time percentage and the computed loop execution time percentage for the ten most frequently executed loops
21 Results - Sum of Differences Sum of differences results averaged over all Powerstone benchmarks
22 Results - Best Cache Configuration We determine the smallest possible cache configuration necessary to give good results Overall best cache configuration - 2-way 32-entry cache with a frequency width of 24 bits 95% accuracy for Powerstone 90% accuracy for MediaBench No change to system performance
23 Results - Base System Power and Area MIPS32 4Kp microprocessor core Embedded processor with a cache Small - area of 1.7 mm 2 Low power mW
24 Results - Frequent Loop Detector Power Overhead For the best cache configuration - Increase in average power consumption of the total system with frequent loop detector is 2.4% Power Consumption of Operations 142 mW per cache read and increment 156 mW per cache write 20.7 mW per saturation Average Frequency of Operations Cache updates – 4.25% Saturations – %
25 Results - Frequent Loop Detector Area Overhead Resulting area overhead of 10.5% compared to the reported size of the MIPS32 4Kp Our numbers are pessimistic while reported microprocessor areas are likely optimistic Area Overhead Frequent loop cache controller, incrementor and additional control/steering logic 1400 gates ( mm 2 ) Cache including saturation logic0.167 mm 2
26 Reducing Power Overhead via Frequent Update Coalescing Since frequent loops tend to iterate many times, the same entry is updated in the frequent loop cache many times in a row Coalesce consecutive sbb executions into one cache update to reduce cache updates 1110 increment frequency } add 4 to frequency Sbb Trace
27 Reducing Power Overhead via Frequent Update Coalescing Powerstone MediaBench
28 Sampling for Further Reduced Power Overhead Instead of tallying every sbb executed, only tally sbbs that occur at a fixed sampling interval This method does not require interrupting of the processor.
29 Sum of Differences for Sampling Powerstone MediaBench
30 Sum of Differences for Sampling When going from a sampling interval of 1 to 50 - Average accuracy decreases for Powerstone benchmarks by 5% Average accuracy increases for MediaBench benchmarks by 2%
31 Results for a Sampling Interval of 50 Coalescing plus sampling reduces the average system power overhead to a mere 0.02% Still no change to system performance Average Frequency of Operations Cache updates – 0.03% Coalesces % No Saturations
32 Example Use - Warp Processing The detector has been successfully incorporated into a novel prototype system- on-a-chip architecture performing what is presently known as warp processing (also being developed at UCR)
33 Warp Processing µPµP µPµPµPµP Mem SW ______ SW ______ µPµP Initially execute application in software only1 SW ______ SW ______ µPµP Profiler Profile application to determine critical regions2 Profiler Dynamic Partitioning Module Partition critical regions to hardware 3 Dynamic Partitioning Module Configurable Logic Program configurable logic & update software binary 4 Configurable Logic After some execution, time/energy are warped; User is unaware of configurable logic!5
34 Conclusions We have presented a frequent loop detector that is small, power-efficient, non-intrusive and accurately provides relative frequencies of loops 2-way set-associative 32-entry cache with a 24-bit frequency counter Power overhead of 2.4% compared to a low-power 32-bit embedded processor Power overhead is easily reducible to well below 0.1% using simple coalescing and sampling methods Currently being used in the profiling step of the Warp processor at UCR