Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13
The abundance of wireless connectivity and the increased workload complexity have further underlined the importance of energy efficiency for modern embedded applications. The cache memory is a major contributor to the system power consumption, and as such is a primary target for energy reduction techniques. Recent advances in configurable cache architecture have enabled an entirely new set of approaches for application-driven energy- and cost-efficient cache resource utilization. We propose a run-time cross-layer specialization methodology, which leverages configurable cache architectures to achieve an energy- and performance-conscious adaptive mapping of instruction cache resources to tasks in dynamic multitasking workloads. Abstract - 2 -
Sizable leakage and dynamic power reductions are achieved with only a negligible and system-controlled performance impact. The methodology assumes no prior information regarding the dynamics and the structure of the workload. As the proposed dynamic cache partitioning alleviates the detrimental effects of cache interference, performance is maintained very close to the baseline case, while achieving 50%-70% reductions in dynamic and static leakage power for the on-chip instruction cache. Abstract – Cont
The cache memory is a major contributor to the total dynamic and leakage power Occupy up to 50% of die area and 80% of transistor budget How to customize the configurable cache dynamically to provide a task only its required cache volume Goal: reduce power consumption with limited degradation in performance What’s the Problem Task0 Performance doesn’t improve noticeably beyond half of cache Task0 Idle Energy Efficient Cache Normal Cache
Partition the instruction cache and adapt its utilization at run time Cache partitioning: eliminate cache interference Utilize configurable cache: only the required subsection of cache is active The Proposed Methodology for Dynamic Cache Customization Dynamic multitasking workload Task0 Task1 Task 2 Idle Only one task is active at a time From t 2 ~ t 3 :
Base on cache partitioning formation (initial partition) policy Cache requirements of each task (detailed later) 。 Task 0 : 2K 2-way, Task 1 : 8K 4-way, Task 2 : 4K 2-way Functional Overview Dynamic multitasking workload 16K 4-way Baseline Cache Map to subsection equal to the required $ size Active section during Task 2 execution Low power drowsy mode
However, overlap cache partitioning is inevitable Some tasks may require larger cache partitions Overlap brings the problem of cache interference Result in performance worse than the required miss rate bound Handle such case through dynamic partition update Update the overlapped partitions dynamically Functional Overview – Cont Ideal Case Task0 Task1 Task 2 Map to exclusive Task0 Task1 Task 2 Initial Partition Overlapping Task0 Task1 Task 2 Dynamic Partition Update Enlarge partition when performance worse
The mechanisms required for efficient cache utilization with minimal interference Initial partition formation 。 Identify the individual task cache requirement at compile-time Use the cache miss statistics information local to each task Initial partition assignment 。 Assign the initial partition to a task at run-time Set the “Cache Way Select Register (CWSR)” and the “mask register” to vary the # of sets Dynamic partition update policy 。 Fine-tune the partition size when performance worse Ensure miss-rate remain within the threshold bounds Dynamic Cache Customization - 8 -
Identify cache requirement and determine the initial partition size for each task Aim at reducing energy while keep performance close to the baseline case, i.e., BASE(T i ) Use the IND_BASE(T i ) instead 。 Then define a “Threshold” accounts for the cache interference Hence, the miss rate bound for a task is IND_BASE(T i ) + Threshold The starting cache configuration is picked such that 。 MISS(P i,j ) ≦ IND_BASE(T i ) + Threshold Part1: Initial Partition Formation Task 4 task0 task3 task2 task1 BASE(T i ) Actual baseline miss rate of task Ti with interference task0 IND_BASE(T i ) Miss rate of task Ti when baseline cache is used in isolation Miss rate of task Ti for sharing the baseline cache Not available at compiler-time Task-specific
MCS (Missrate Cache Space) Table Cache miss statistics for each cache configuration 。 Obtain through profiling Part1: Initial Partition Formation - Example Cache Way size 5121K2K 4K 8K 5121K2K 4K 8K Task 0 Task 1 Task 2 IND_BASE(T 0 )= 0% IND_BASE(T 1 )= 0.15% IND_BASE(T 2 )= 0.17% Threshold 0.1% MISS(P 0,j ) ≦ MISS(P 1,j ) ≦ MISS(P 2,j ) ≦ Find the minimal cache that satisfy condition Starting configuration for G721: 8K 2-way Starting configuration for LAME: 4K 4-way Starting configuration for GSM: 8K 2-way # of Ways
Assign the initial partition to a task at run-time Set the control register and mask register of configurable cache Attempt to assign partitions exclusive of each other But not always possible 。 Total $ requirement of G721, LAME, and GSM is 20K but only 16K is available Part2: Initial Partition Assignment At time t 0, allocate 8K 2-way to G721 At time t 1, allocate 4K 4-way to LAME (can’t exclusive of G721, and allow overlapping) At time t 1, allocate 4K 4-way to LAME (can’t exclusive of G721, and allow overlapping) At time t 2, allocate 8K 2- way to GSM (with a small portion being used by LAME) At time t 2, allocate 8K 2- way to GSM (with a small portion being used by LAME)
Tasks with overlapping partitions can’t be prevented Interference and miss rates may exceed the bound Part3: Dynamic Partition Update Trigger the dynamic partition update HW miss counter inside CPU > IND_BASE(T i ) + Threshold Trigger partition rescaling Enlarge the partition size until it is less than the miss rate bound
Partition rescaling trades-off power savings for meeting performance requirement Part3: Dynamic Partition Update - Example For LAME, the miss-rate bound is exceeded in the overlapped region The next configuration after 4K 4-way with miss rate less than 0.25% is 6K 3-way 5121K2K 4K 8K LAME: IND_BASE(T 1 ) + Threshold= 0.25% GSM rescaled to 12K 3-way due to increased overlap with the rescaled LAME partition
Partition reshuffling When a task leaves on completing execution 。 The cache resource is freed up and available to currently executing tasks The previously rescaled partition is considered for reshuffling 。 Completely allocate this task’s starting configuration without overlap Part3: Dynamic Partition Update - Example Reshuffling At time t4, both G721 and GSM complete only LAME is left executing Reshuffle to starting configuration (reverting to smaller partition results in reduced power) Reshuffle to starting configuration (reverting to smaller partition results in reduced power)
Use the cache configurations found in high-end embedded processor (Intel XScale and ARM9) 16K 4-way 32K 4-way Scheduling policy to model multitasking Round-robin policy with a context-switch frequency of 33K Inst. The miss-rate impact threshold is set to 0.1% Evaluate two categories of benchmark Static benchmarks: all tasks start and finish at the same time Dynamic benchmarks: Experiment Setup Structure of Dynamic Benchmarks
Partitioning: apply the initial partition assignment only Rescaling: apply partitioning + rescaling Reshuffling: apply partitioning + rescaling + reshuffling For some configuration, the rescaling and reshuffling are omitted Since the miss-rate impact is within the threshold after initial assignment Miss-Rate Impact: Increase Miss-Rate Compared to Baseline Cache Better After rescaling, the miss-rate impact is reduced
GSM is subjected to rescaling Miss-rate bound is exceeded due to interference in the overlapped The partition reshuffling maximizes power reduction Power reduction is achieved while keeping miss-rate impact below the threshold value BM_3 Individual Task Miss-Rates for 16K Cache Better Improve performance, even low than baseline case Exceed miss-rate bound of 0.27%
Shared cache Task 4 task0 task3 task2 task1 Thrashing