Download presentation
Presentation is loading. Please wait.
1
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros 4 1 Dept of Computer Science & Engineering - University of California, Riverside, USA 2 Campus Arapiraca – Federal University of Alagoas, Brazil 3 Centro de Informática - Federal University of Pernambuco, Brazil This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation
2
2 Ann Gordon-Ross Univ of Ca, Riverside Introduction Memory access: 50% of embedded processor’s system power Caches are power hungry ARM920T (Segars 01) M*CORE (Lee/Moyer/Arends 99) Thus, caches are a good candidate for optimizations 53% Main Mem L1 I Cache Processor L1 D Cache
3
3 Ann Gordon-Ross Univ of Ca, Riverside Introduction Different applications have vastly different cache requirements Total size, line size, and associativity Cache parameters that don’t match an application’s behavior can waste over 60% of energy (Gordon-Ross 05) Cache tuning is the process of determining the appropriate cache parameters for an application 4KB 16 byte 2-way 2KB 32 byte direct-mapped 8KB 64 byte 4-way
4
4 Ann Gordon-Ross Univ of Ca, Riverside Download application Runtime Cache Tuning Best cache configuration can be determined by searching the design space during runtime Runtime cache tuning is transparent to the designer and end user, but incurs runtime overhead in terms of energy and performance Energy Executing in base configuration Tunable cache Tuning hw TC Cache Tuning TC
5
5 Ann Gordon-Ross Univ of Ca, Riverside Download application Contribution We introduce specialized hardware for non-intrusive runtime cache evaluation Temporary energy overhead and no performance overhead Single-pass multi-cache evaluation - SPCE Special hardware simultaneously evaluates all cache configurations Enables switching to the best configuration in one-shot Tunable cache SPCE Energy Executing in base configuration SPCE causes an increase in energy but no performance overhead Switch to best config in “one-shot” SPCE TC
6
6 Ann Gordon-Ross Univ of Ca, Riverside SPCE Key Points Contributions compared to previous methods Evaluates a highly configurable cache –Previous method offer little configurability Little hardware overhead –Simple data structures –Elementary operations
7
7 Ann Gordon-Ross Univ of Ca, Riverside SPCE Monitors address stream to extract cache hit information for all configurations Fully-associative cache example (64-bit architecture) Address stream t 0 = 0 t 1 = 8 t 2 = 16 t 3 = 0 t 4 = 8 t 5 = 0 t 6 = 16 Table (stored hit info) b d 0 1 2 1 2 3 4 5 6 7 8 Line size (number of words) Number of lines 24 different configs Number of conflicts determines cache sizes that would result in a hit For each line size … >> 2 0 *8 t 0 = 0 t 1 = 1 t 2 = 2 t 3 = 0 t 4 = 1 t 5 = 0 t 6 = 2 HIT } 3 1 } 3 2 } 2 1 } 3 3 >> 2 1 *8 t 0 = 0 t 1 = 0 t 2 = 1 t 3 = 0 t 4 = 0 t 5 = 0 t 6 = 1 HIT 1 1 2 3 2 >> 2 2 *8 t 0 = 0 t 1 = 0 t 2 = 0 t 3 = 0 t 4 = 0 t 5 = 0 t 6 = 0 HIT 6 Cache with 2 lines with 2 1 words per line (32 bytes) will have 5 hits and 7-5=2 misses
8
8 Ann Gordon-Ross Univ of Ca, Riverside SPCE SPCE determines hits for other set-associativities by counting the number of unique conflicts in the address trace Tables (multiple layers) Direct-mapped 2-way 4-way Table (stored hit info) b s 0 1 2 1 2 3 4 5 6 7 8 Line size (number of words) Number of sets
9
9 Ann Gordon-Ross Univ of Ca, Riverside SPCE - Hardware (stack) Designed and evaluated in synthesizable VHDL
10
10 Ann Gordon-Ross Univ of Ca, Riverside Results - Energy Savings Energy savings compared to exploring the design space using a state-of-the-art intrusive heuristic (Zhang 03) Values less than 1 denote an energy increase 4.6x less energy expended
11
11 Ann Gordon-Ross Univ of Ca, Riverside Results - Tuning Speedup Tuning speedup obtained compared to a state-of- the-art intrusive heuristic 7.7x faster
12
12 Ann Gordon-Ross Univ of Ca, Riverside Overheads Evaluated SPCE compared to the ARM920T Area 12% area overhead –Due in large part to the TCAM stack structure Power Temporary 2.2X increase in power during short tuning cycle –Application need only iterate 4 times for average power overhead to reduce to 1%
13
13 Ann Gordon-Ross Univ of Ca, Riverside Conclusions SPCE is a specialized hardware structure to evaluate all cache configurations simultaneously Enables non-intrusive runtime cache evaluation Enables switching to best cache configuration in one shot Compared to a state-of-the-art intrusive cache tuning heuristic 4.6x less energy expended 7.7x speedup in tuning time 12% area overhead compared to ARM920T Temporary 2.2x increase in power during short tuning time –Only 4 application iterations to recoup power
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.