A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference and Exhibition Volume: 1 Pages: 142 – 147 Feb. 2004
A Self-Tuning Cache Architecture for Embedded Systems 2/ /6/18 Abstract Memory accesses can account for about half of a microprocessor system’s power consumption. Customizing a microprocessor cache’s total size, line size and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in pre-fabricated microprocessor platforms. Tuning those caches to a program is still however a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids. We propose to move that CAD on-chip, which can greatly increase the acceptance of configurable caches. We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. We carefully designed the heuristic to avoid any cache flushing, since flushing is power and performance costly. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache can reduce memory- access energy by 45% to 55% on average, and as much as 97%, compared with a four-way set-associative base cache, completely transparently to the programmer.
A Self-Tuning Cache Architecture for Embedded Systems 3/ /6/18 What’s the Problem Tuning a configurable cache to a application is benefic for power and performance How to obtain the best cache configuration ?? Sometimes increase cache size (associativity) only improve limited performance but increase energy greatly Determine the best cache configuration via simulation Straightly, but slowly and can’t capture runtime behavior Thus, it’s essential to automatically tune a configurable cache dynamically as an application executes
A Self-Tuning Cache Architecture for Embedded Systems 4/ /6/18 Introduction Previous work of this team A highly configurable cache architecture [13],[14] Four parameters that designers can configure: 1) Cache total size: 8, 4 or 2 KB 2) Associativity: 4, 2 or 1 way for 8 KB; 2 or 1 way for 4KB; 1 way for 2KB 3) Cache line size: 64, 32 or 16 bytes 4) Way prediction : ON or OFF The proposed dynamic cache tuning method Cache tuning heuristic implementing with on-chip hardware Without exhaustively tries all possible cache configurations Dynamically tunes the cache to an executing program Automate the process of finding the best cache configuration The space of configuration may more larger
A Self-Tuning Cache Architecture for Embedded Systems 5/ /6/18 Energy Evaluation Equations for total memory access energy consumption E hit : cache hit energy per cache access E miss : cache miss energy E static_per_cycle : static energy dissipation Equation for the heuristic cache tuner energy consumption Time total : the total time used to finish one cache configuration search NumSearch: the number of cache configurations search Related to cache size, associativity Related to cache line size Related to cache size
A Self-Tuning Cache Architecture for Embedded Systems 6/ /6/18 Problem Overview A naive tuning approach Exhaustively tries all possible cache configurations Two main drawbacks Involves too many configurations Requires too many cache flushes Searching in an arbitrary order may require flushing the cache Goal: develop a self-tuning heuristic that Minimizes the number of cache configurations examined Minimizes cache flushing While still finding a near-optimal cache configuration 1)Tuning dynamically as execution 2) Can be enabled, disabled by SW
A Self-Tuning Cache Architecture for Embedded Systems 7/ /6/18 Heuristic Development Through Analysis Energy dissipation for benchmark parser at cache size from 1 KB to 1MB However this tradeoff point is different for application and exist not only for cache size, but also for cache associativity and line size Therefore, the goal of searching heuristic is to find the configuration Improve performance slightly but increase energy significantly Energy dissipation of off-chip memory decreases rapidly Increase cache performance and decrease total energy is observed
A Self-Tuning Cache Architecture for Embedded Systems 8/ /6/18 Determine the Impact of Each Parameter The parameter with the greatest impact configure first Vary cache size has the biggest impact on miss rate and energy Vary line size cause little energy variation for I$ but more variation for D$ Vary associativity has the smallest impact on energy consumption Different line size Different associativity Develop a search heuristic that finds best cache size first, then best line size, finally best associativity
A Self-Tuning Cache Architecture for Embedded Systems 9/ /6/18 Minimizing Cache Flushing The order of vary the values of each parameter One order may require flushing, a different order may not Cache flush analysis when changing cache size Increasing the cache size is preferable over decreasing When decreasing the cache size, an original hit may turn into miss EX: address 000 (index=00) and 110 (index=10) are misses after shutdown For D $, need to write back when the data in the shutdown ways is dirty When increasing the cache size does’t require flushing EX: address 100 (index=0) and 010 (index=0) No write back is needed and thus avoid flushing 8 byte Memory
A Self-Tuning Cache Architecture for Embedded Systems 10/ /6/18 Minimizing Cache Flushing Cache flush analysis when changing associativity Increasing the associativity is preferable over decreasing Decreasing the associativity may turn a hit into miss EX: address 000 (index=0) and 100 (index=0) Increasing the associativity will be no extra misses EX: address 000 (index=00) and 010 (index=10) Both still be hit after the associativity is increased
A Self-Tuning Cache Architecture for Embedded Systems 11/ /6/18 Search Heuristic for Determining the Best Cache Configuration Inputs to the heuristic Cache size: C[ i ], 1 ≤ i ≤ n n=3 in our configurable cache C[1]=2 KB, C[2]=4 KB, C[3]=8 KB Line size: L[ j ], 1 ≤ j ≤ p p=3 in our configurable cache L[1]=16 bytes, L[2]=32 bytes, L[3]=64 bytes Associativity: A[ k ], 1 ≤ k ≤ m m=3 in our configurable cache A[1]=1 way, A[2]=2 way, A[3]=4 way Way prediction W[1]= OFF,W[2]= ON E[1] As long as increase the cache size result in total energy decrease First Then And then Finally
A Self-Tuning Cache Architecture for Embedded Systems 12/ /6/18 The Efficiency of Search Heuristic Suppose there are n configurable parameters, and each parameter has m values Total of m n different combinations Our heuristic only searches m*n combinations at most EX: 10 configurable parameters, each has 10 values Brute force searching: searches combinations Our search heuristic: searches 100 combinations instead Thus, using our search heuristic Minimizes the number of cache configurations examined Avoids most of the cache flushing &
A Self-Tuning Cache Architecture for Embedded Systems 13/ /6/18 Implementing the Heuristic in Hardware Hardware-based approach is preferable over software SW approach not only change the runtime behavior of application but also affect the cache behavior FSMD of the cache tuner E hit : correspond to 8KB 4way, 2way and 1way; 4KB 2way and 1way; 2KB 1way E miss : correspond to line size of 16 bytes, 32bytes and 64 bytes E static_per_cycle : correspond to cache size of 8KB, 4KB and 2KB Configure register (7 bits wide) : 2 bits for cache size, 2 bits for line size, 2 bits for associativity and 1bit for way prediction Runtime information Application independent information Result of energy calculation Lowest of configuration tested Used to configure cache
A Self-Tuning Cache Architecture for Embedded Systems 14/ /6/18 Implementing the Heuristic in Hardware FSM of the cache tuner Composed of three smaller state machines EX: If the current state of PSM is P1 State V1 of VSM will determine the energy of 2 KB cache, V2 for 4 KB cache, V3 for 8 KB cache Why we need CSM ?? Because we have three multiplications but only one multiplier Used four states to compute the energy Determines best cache size Line size Tuning each cache parameter AssociativityWay prediction Determines the energy for many possible values of each parameter 2 KB 4 KB 8 KB Controls the calculation of energy PSM states depend on VSM, and VSM states depend on CSM
A Self-Tuning Cache Architecture for Embedded Systems 15/ /6/18 Results of Search Heuristic Searches average 5.8 configurations compared to 27 configurations Finds the optimal configuration in nearly all cases, except D-cache cfg. of pjepg D-cache cfg. of mpeg2
A Self-Tuning Cache Architecture for Embedded Systems 16/ /6/18 The Reason of the Inaccuracy Larger cache consume more dynamic and static energy Larger cache is only preferable if the reduction in E off_chip_mem overcomes the energy increase due to larger cache For mpeg2, using 8 KB cache, the reduction in E off_chip_mem is not larger enough to overcome the added energy by larger cache Therefore, selects a cache size of 4 KB When associativity is considered (increased from 1 way to 2 way), the miss rate of 8 KB cache is significantly reduced The heuristic does’t choose the optimal configuration due to When heuristic is determining the best cache size, it does’t predict what will happen when associativity is increased
A Self-Tuning Cache Architecture for Embedded Systems 17/ /6/18 Area and Power of the Tuning Hardware The area of cache tuner is about 4000 gates or mm 2 in 0.18 um technology An increase in area of just 3% over MIPS 4kp with cache The power consumption of cache tuner is 2.69 mw at 200 MHz Only 0.5% of the power consumed by a MIPS processor The average energy consumption of cache tuner Used 164 cycles to finish one cache configuration Average number of configurations searched is 5.4 The average energy dissipation of benchmarks is 2.34 J Impact of avoid flushing by careful ordering of search When cache size is configured in the order of 8 KB down to 2 KB The average energy consumption due to writing back dirty data is 5.38 mJ Thus, if we search the possible cache size from largest to smallest = 2.69 mw * (164/200M) * 5.4 = 11.9 nJ negligible The energy due to cache flushes would be 480,000 times than cache tuner
A Self-Tuning Cache Architecture for Embedded Systems 18/ /6/18 Conclusions Proposed a self-tuning on-chip CAD method finding the best configuration automatically Relieving designers from the burden to determine the best configuration Increasing the usefulness and acceptance of a configurable cache Our cache tuning heuristic Minimizes the number of configurations examined Minimizes the need for cache flushing Reduces 40% memory-access energy on average, compared to a standard cache