Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Hot Caches, Cool Techniques: Online Tuning of Highly Configurable Caches for Reduced Energy Consumption Ann Gordon-Ross Department of Computer Science.

Similar presentations


Presentation on theme: "1 Hot Caches, Cool Techniques: Online Tuning of Highly Configurable Caches for Reduced Energy Consumption Ann Gordon-Ross Department of Computer Science."— Presentation transcript:

1 1 Hot Caches, Cool Techniques: Online Tuning of Highly Configurable Caches for Reduced Energy Consumption Ann Gordon-Ross Department of Computer Science and Engineering University of California, Riverside Frank Vahid – PhD Advisor This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation

2 of 45 Ann Gordon-Ross, UC Riverside 2 Introduction Much research is devoted to reducing power consumption in mobile embedded devices Increased battery life Decreased cooling requirements

3 of 45 Ann Gordon-Ross, UC Riverside 3 Introduction Cache hierarchy consumes a lot of power We can use configurable caches to reduce power consumption However, configuring/tuning the cache is very difficult Many parameters lead to a very large design space In this talk, I describe research that addresses the problem of quickly tuning highly configurable caches Efficient heuristics for increasingly-complex configurable cache hierarchies Feedback-control system for online cache tuning

4 of 45 Ann Gordon-Ross, UC Riverside 4 Cache Power Consumption Memory access: 50% of embedded processor’s system power Caches are power hungry ARM920T (Segars 01) M*CORE (Lee/Moyer/Arends 99) Thus, caches are a good candidate for optimizations Main Mem L1 Cache Processor L2 Cache 53%

5 of 45 Ann Gordon-Ross, UC Riverside 5 Reducing Cache Energy Consumption Research shows that different applications have different cache requirements – Zhang ‘04 Depending on the working set of the application, the application may require different values for cache parameters: Total size Line size (block size) Associativity Cache parameters that don’t match an application’s behavior can waste over 40% of energy Balasubramonian ’00, Zhang ’03

6 of 45 Ann Gordon-Ross, UC Riverside 6 Excess energy Excess Cache Energy Consumption Size Excess fetch and static energy if too large = working set Excess thrashing energy if too small to next level of memory Stall cycles = excess energy Line size Excess fetch energy if line size too large = fetched Excess energy fetching unused data Excess stall energy if line size too small from next level of memory Stall cycles = excess energy

7 of 45 Ann Gordon-Ross, UC Riverside 7 Excess energy checking unused ways Excess Cache Energy Consumption Associativity Excess fetch energy per access if too high = working set Excess miss energy if too low – decreased performance Configurable caches allow for cache parameter values to be varied or tuned thus specializing the cache to the needs of an application

8 of 45 Ann Gordon-Ross, UC Riverside 8 Configurable Caches Soft cores – designer specified cache parameters ARM, MIPS, Tensillica Processor - HDL Specialized cache Chip with specialized cache Fab

9 of 45 Ann Gordon-Ross, UC Riverside 9 Configurable Caches Even hard processors contain configurable caches Specialized software instructions can change cache parameters Specialized hardware enables the cache to be configured at startup or in system during runtime Motorola M*CORE – Malik ISLPED’00, Albonesi MICRO’00, Zhang ISCA’03 2KB 8 KB, 4-way base cache 2KB 8 KB, 2-way 2KB 8 KB, direct- mapped Way concatenation 2KB 4 KB, 2-way 2KB 2 KB, direct- mapped Way shutdown Configurable Line size 16 byte physical line size Tunable cache Tuning hw

10 of 45 Ann Gordon-Ross, UC Riverside 10 Cache Tuning However, configurable caches are relatively new Designers are provided with configurable caches but are not told how to determine the best cache configuration Cache tuning is the process of determining the appropriate cache parameters for an application Cache tuning is very difficult - 100’s to 10000’s of different configurations

11 of 45 Ann Gordon-Ross, UC Riverside 11 Cache Tuning Difficulties Simulation method Microprocessor L2 cache L1 cache Main Memory TUNE Choose lowest energy configuration Possible Cache Configurations Energy Realistic input stimulus is difficult to model input A few seconds of real execution may take days or weeks to simulate Prediction method Chosen config Examine the code

12 of 45 Ann Gordon-Ross, UC Riverside 12 Cache Tuning Difficulties Runtime tuning Download application Time Energy System startup Cache tuning Exhaustive exploration can unnecessarily expand this high energy tuning time Tunable cache Tuning hw Runtime tuning allows for adaptation to new software and new operating environments

13 of 45 Ann Gordon-Ross, UC Riverside 13 Cache Tuning Difficulties Heuristic tuning method Design space 100’s – 10,000’s Lowest energy Simulation based approach Possible Cache Configurations Energy Exhaustive method Possible Cache Configurations Energy Heuristic method Runtime based approach System Startup Energy Exhaustive method System Startup Energy Heuristic method Existing heuristics do not address the complexities of tuning a highly configurable cache consisting of 10,000’s of different configurations

14 of 45 Ann Gordon-Ross, UC Riverside 14 Outline Develop an efficient tuning heuristic for a highly configurable two-level cache hierarchy Develop using a simulation-based environment but is applicable to a dynamic tuning environment 62% energy savings on average Current research Feedback-control system for online cache tuning

15 of 45 Ann Gordon-Ross, UC Riverside 15 Challenge for Two Level Cache Tuning Heuristic Development Current methods L1 - Size - Line size - Assoc. 10’s of configurations Single level configuration Two-level configuration 10’s of configurations L1 L2 Hierarchy L1 L2 Hierarchy L1 L2 Hierarchy L1 L2 Hierarchy … Our Two-level Cache Tuning Goal Two-level configuration with separate L2 caches L1 - Size - Line size - Assoc. - Size - Line size - Assoc. D I L2 - Size - Line size - Assoc. - Size - Line size - Assoc. 30 configs per cache * * 30*30 + 30*30 = 1800 configs L1 - Size - Line size - Assoc. - Size - Line size - Assoc. D I - Size - Line size - Assoc. L2 * * 30*30*30 = 27,000 configs Two-level configuration with a unified second level of cache

16 of 45 Ann Gordon-Ross, UC Riverside 16 Single Level Tuning Heuristic Microprocessor Main Memory I$ D$ Tuner Zhang’s Configurable Cache 18 configurations per cache Independently tuned L1 Microprocessor Main Memory I$ D$ Tuner I$ D$ Our Extended Configurable Cache 216 configurations per cache hierarchy L1 L2 Tuning dependency Impact-ordered heuristics have been shown effective in previous tuning efforts (Zhang’03) Tune parameters in order of energy impact – highest impact first i.e., vary each parameter while holding others fixed, measure change Impact order for cache: 1. Total size 2. line size 3. associativity Search parameters from smallest to largest Minimize flushing in a dynamic environment Tune instruction cache then tune data cache

17 of 45 Ann Gordon-Ross, UC Riverside 17 First Heuristic – Tune Levels One-at-a-Time Tune each cache using impact- ordered heuristic for one-level cache tuning Tune L1, the L2 Initial L2: 64 KByte, 4-way, 64 byte line size For best L1 configuration, tune L2 cache Microprocessor Main Memory L1 Cache L2 Cache

18 of 45 Ann Gordon-Ross, UC Riverside 18 Results of First Heuristic Base cache configuration Level 1 – 8KByte, 4-way, 32 byte line size Level 2 – 64KByte, 4-way, 64 byte line size Energy consumption normalized to the base cache configuration Base line 32% vs 53% Worse than base cache

19 of 45 Ann Gordon-Ross, UC Riverside 19 Interlacing Heuristic Did not find optimal in most cases Sometimes 200% or 300% worse Conclusion: The two levels should not be explored separately Too much interdependence among L1 and L2 cache parameters – not addressed with Zhang’s method L2 cache performance depends on how much and what misses in the L1 cache To more fully explore the dependencies between the two levels, we interlaced the exploration of the level one and level two caches Interlacing performed better than the initial heuristic but there was still much room for improvement Microprocessor Main Memory I$ D$ Tuner I$ D$ L1 L2 1. Tune L1 Size I$ 2. Tune L2 Size I$ 3. Tune L1 Line Size I$ 4. Tune L2 Line Size I$ 5. Tune L1 Associativity I$ 6. Tune L2 Associativity Do the same for the data cache hierarchy

20 of 45 Ann Gordon-Ross, UC Riverside 20 Final Heuristic: Interlaced with Local Search Some cases were still sub-optimal - manually examined Limitation of the configurable cache architecture Certain associativities were not possible for some sizes Determined small local search needed to overcome the limitation Final heuristic - The Two Level Cache Tuner (TCaT)

21 of 45 Ann Gordon-Ross, UC Riverside 21 TCaT Results Energy consumption normalized to the base cache configuration Base line 53% energy savings – near optimal

22 of 45 Ann Gordon-Ross, UC Riverside 22 Extending the TCaT - Exploring a Unified Second Level of Cache Unified second level caches are standard in desktop computers and are becoming increasingly popular in embedded microprocessors Current cache tuning heuristics do not directly apply due to the added circular dependency A change in any cache affects the performance of all other caches in the hierarchy Microprocessor Main Memory I$ D$ Tuner U$

23 of 45 Ann Gordon-Ross, UC Riverside 23 Level Two Cache Configurability For maximum configurability, the level two cache utilized the Motorola M*CORE style way management U-way Traditional, 4-way unified level two cache Motorola M*CORE way management cache Cfg Way I-way D-way U-way I-way D-way U-way I-way D-way U-way I-way D-way U-way I-way D-way U-way I-way U-way I-way D-way Sample way management L2 caches In addition, the L2 cache offers the same line size configurability as in the L1 caches Design space explodes to 18,000 configurations

24 of 45 Ann Gordon-Ross, UC Riverside 24 Alternating Cache Exploration with Additive Way Tuning (ACE-AWT) D Tune level one sizes I Tune level two size I Tune level one line sizes D Tune level two line size Tune level two associativity { } { } I { } D Tune level one associativities These steps are difficult because changing size and associativity is synonymous in a way management style cache

25 of 45 Ann Gordon-Ross, UC Riverside 25 Way Management I-way D-way U-way 8Kb 1-way Increase L2 size I-way D-way U-way 16Kb 2-way I-way U-way 16Kb 2-way I-way U-way 16Kb 2-way I-way D-way U-way 24Kb 3-way Decrease L2 associativity I-way D-way U-way 16Kb 2-way I-way D-way U-way 16Kb 2-way I-way D-way U-way 16Kb 2-way

26 of 45 Ann Gordon-Ross, UC Riverside 26 ACE-AWT First Phase – L2 Size Exploration Start with empty L2 cache Current L2 config Simulate I-way D-way U-way + + + Add one of each way type… Current L2 config I-way Current L2 config D-way U-way = = = …resulting in 3 candidate configs Select minimum energy energy If cache max size cmp energy DONE If increase in energy If decrease in energy Min energy cfg Current L2 config Selected L2 cfg

27 of 45 Ann Gordon-Ross, UC Riverside 27 Simulate ACE-AWT Fine Tuning Phase – Associativity Exploration Start with current cache configuration Current L2 cfg Size and availability permitting, try 3 way additions and removals … I-way D-way U-way I-way D-way U-way + + + - - - Current L2 cfg I-way D-way U-way Current L2 cfg I-way D-way U-way … resulting in 6 candidate configs = = = = = = Select minimum energy energy cmp energy DONE If increase in energy If decrease in energy Min energy cfg Current L2 cfg If no new configuration to explore Selected L2 cfg

28 of 45 Ann Gordon-Ross, UC Riverside 28 Simulate ACE-AWT Fine Tuning Phase – Associativity Exploration Start with current cache configuration Current L2 cfg Size and availability permitting, try 3 way additions and removals … I-way D-way U-way I-way D-way U-way + + + - - - Current L2 cfg I-way D-way U-way Current L2 cfg I-way D-way U-way … resulting in 6 candidate configs = = = = = = Select minimum energy energy cmp energy DONE If increase in energy If decrease in energy Min energy cfg Current L2 cfg If no new configuration to explore Selected L2 cfg

29 of 45 Ann Gordon-Ross, UC Riverside 29 Results Heuristic achieved near optimal results (when optimal computed) 62% energy savings compared to base cache Yet only searched 0.2% of the search space Key to previous heuristics Combined proven space pruning method (impact-ordering of parameters) with architecture-specific knowledge  highly-efficient and effective results Base line

30 of 45 Ann Gordon-Ross, UC Riverside 30 Outline Develop an efficient tuning heuristic for a highly configurable two-level cache hierarchy Develop using a simulation-based environment but is applicable to a dynamic tuning environment 62% energy savings on average Current research Feedback-control system for online cache tuning

31 of 45 Ann Gordon-Ross, UC Riverside 31 Online Cache Tuning Reconfigure the cache dynamically to adapt to different phases of program execution or different applications in a multi-application environment Base cache energy Application-tuned Time Energy Consumption Phase-tuned Change cache

32 of 45 Ann Gordon-Ross, UC Riverside 32 Online Cache Tuning Challenges Need a good tuning interval Tuning interval is the time between invocations of the tuning hardware Should closely match phase interval - length of time the system executes between phase changes Base cache energy Time Energy Consumption Phase Interval Base cache energy Time Energy Consumption Runtime energy Tuning interval Excess tuning energy Tuning interval too short Tuning interval too long Base cache energy Time Energy Consumption Runtime energy Tuning interval Wasted energy in suboptimal configuration

33 of 45 Ann Gordon-Ross, UC Riverside 33 Previous Online Cache Tuning Largely ad hoc Fixed tuning interval Inspect counters and adjust cache Search very small configuration space ≈ 4 Limited tuning overhead Adjusted tuning thresholds Do not analyze the chosen tuning interval None attempted to tune the tuning interval

34 of 45 Ann Gordon-Ross, UC Riverside 34 Periodic System Phase interval fixed at 10 million cycles Tuning interval too short Tuning interval too long Energy savings = 32% Severely penalized if phase interval is not precisely followed Energy savings = 28% Penalty is acceptable Goal: Tuning interval should be 1/2 of the phase interval

35 of 45 Ann Gordon-Ross, UC Riverside 35 Online Algorithms Need to determine tuning interval while system is executing Online algorithms process data piecemeal - unable to view entire dataset Online tuner must be able to determine the tuning interval based on current and past events with no knowledge of future

36 of 45 Ann Gordon-Ross, UC Riverside 36 Feedback Control System Plant (System under control) Set-Point (Goal) Actuator (device to manipulate plant) Controller (compute input to plant) u t = F(x t ) ∑ Error detector Reference input r t Sensor Measured error Disturbances Difficulty: Set-points are typically fixed values. We want minimization of energy which makes developing the control system much more difficult.

37 of 45 Ann Gordon-Ross, UC Riverside 37 Online Cache Tuner Goal: Adjust tuning interval to match phase interval Observe change in energy due to tuning Compare energy before and after tuning If there is a change, then tuning interval is too long, missed a phase change If there is no change, then tuning interval is too short

38 of 45 Ann Gordon-Ross, UC Riverside 38 Online Cache Tuner - Feedback Control System Plant (System under control) Set-Point (Goal) Actuator (device to manipulate plant) Controller (compute input to plant) u t = F(x t ) ∑ Error detector Reference input r t Sensor Measured error Disturbances Plant (Microprocessor) $ Set-Point (minimize energy) Actuator Cache Tuner Controller (activate cache tuner on tuning interval) Miss rate Sensor (energy model) Store previous energy (phase changes) %∆E

39 of 45 Ann Gordon-Ross, UC Riverside 39 Controller Logic Based on attack/decay online algorithm Increase tuning interval slow to avoid overshooting Decrease tuning interval quickly to avoid wasted energy Draw on fuzzy logic to stabilize tuning interval Change tuning interval based on how close or far the system is to being stable 2 part equation

40 of 45 Ann Gordon-Ross, UC Riverside 40 Controller Logic %∆E 0 100% Change to tuning interval (∆TI) Stable System PoS 1.0 Large energy change, tunes too infrequently, decrease interval Small energy change, tunes too frequently, increase interval U D If %∆E < PoS, If %∆E >= PoS, %∆E averaged over last W measurements to eliminate erratic behavior Determine U, D, PoS and W through experimentation %∆ E ∆TI

41 of 45 Ann Gordon-Ross, UC Riverside 41 Tracking Interval Length Over Time Tuning interval oscillates near 1/2 of the phase interval

42 of 45 Ann Gordon-Ross, UC Riverside 42 Online Cache Tuner Energy Savings Base line Observed similar results for less periodic systems. 29% energy savings - within 8% of optimal Normalized Energy

43 of 45 Ann Gordon-Ross, UC Riverside 43 Conclusions Developed a very efficient cache tuning heuristic for a highly configurable cache Offers 18,000 different cache configurations 62% energy savings in the cache hierarchy while only searching 0.2% of the search space Key: Combination of efficient heuristic method with knowledge of architecture features Developed a feedback control system for online cache tuning 29% energy savings on average - 8% from optimal Key: Application of control theory to online cache tuning Continuing work for more random systems

44 of 45 Ann Gordon-Ross, UC Riverside 44 Future Work Future work Dynamic optimizations in a multi-core environment Cache hierarchy – some levels may be shared Dynamic load distribution Dynamic per-core shutdown or voltage reduction for reduced power consumption Etc – Many single-core optimizations can be non-trivially applied to a multi-core environment Dynamic tuning enables energy savings with no extra designer effort – suitable for standard binary situations, changing environment situations, etc. Other multi-core issues Ease development for a multi-core system Designer writes an application without specialization for multi-core and the application is transparently mapped to a multi-core system Architectural support for debugging - i.e. shared resources

45 of 45 Ann Gordon-Ross, UC Riverside 45 Publications Journal Papers Frequent Loop Detection Using Non-Intrusive On-Chip Hardware A. Gordon-Ross, F. Vahid, IEEE Transactions on Computing - Best of the 2003 MICRO and CASES conferences special issue. Special Issue- Embedded Systems, Microarchitecture, and Compilation Techniques, in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005, Vol. 54, Issue 10, pp 1203-1215. Tiny Instruction Caches For Low Power Embedded Systems A. Gordon-Ross, S. Cotterell, F. Vahid, ACM Transactions on Embedded Computing Systems, Vol. 2, Issue 4, Nov. 2003, pp. 449-481. Exploiting Fixed Programs in Embedded Systems: A Loop Cache Example A. Gordon-Ross, S. Cotterell, F. Vahid, IEEE Computer Architecture Letters, Vol I, January 2002. Conference Papers A One-Shot Configurable-Cache Tuner for Improved Energy and Performance A. Gordon-Ross, P. Viana, F. Vahid, W. Najjar, E. Barros. IEEE/ACM DATE, April 2007. Configurable Cache Subsetting for Fast Cache Tuning P. Viana, A. Gordon-Ross, E. Keogh, E. Barros, F. Vahid. IEEE DAC, July 2006 Fast Configurable-Cache Tuning with a Unified Second-Level Cache A. Gordon-Ross, F. Vahid, N. Dutt. IEEE/ACM ISLPED, August 2005 A First Look at the Interplay of Code Reordering and Configurable Caches A. Gordon-Ross, F. Vahid, N. Dutt. ACM GLSVLSI April 2005. Automatic Tuning of Two-Level Caches to Embedded Applications A. Gordon-Ross, F. Vahid, N. Dutt IEEE/ACM DATE, February 2004. Frequent Loop Detection Using Non-Intrusive On-Chip Hardware A. Gordon-Ross, F. Vahid, IEEE/ACM CASES, October 2003. Dynamic Loop Caching Meets Preloaded Loop Caching -- A Hybrid Approach A. Gordon-Ross, F. Vahid, IEEE ICCD, September 2002. A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power F. Vahid, A. Gordon- Ross, IEEE/ACM ISLPED, August 2001.


Download ppt "1 Hot Caches, Cool Techniques: Online Tuning of Highly Configurable Caches for Reduced Energy Consumption Ann Gordon-Ross Department of Computer Science."

Similar presentations


Ads by Google