Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma
2 Applying Control Theory to the Caches of Multiprocessors Shared L2 cache is one of the most important on-chip shared resource. Largest area and leakage power consumer One of the dominant players in terms of performance Two Papers: Relative Cache Latency Control for Performance Differentiations in Power-Constrained Chip Multiprocessors SHARP Control: Controlled Shared Cache Management in Chip Multiprocessors
Relative Cache Latency Control for Performance Differentiations in Power- Constrained Chip Multiprocessors Department of EECS University of Tennessee, Knoxville Xiaorui Wang, Kai Ma, Yefu Wang
4 Background NUCA (Non Uniform Cache Architecture) Key idea: Different cache banks have different access latencies. 13
5 Introduction The power of the cache part needs to be constrained. With controlled power, the performance of the caches also need to be guaranteed. Why control relative latency (the ratio between the average cache access latencies of two threads)? 1. Accelerate critical threads 2. Reduce priority inversion
6 System Design Thread 1 on core 1 Thread 0 on core 0 Thread 3 on core 3 Latency Monitor Thread 2 on core 2 Relative Latency Controller Cache Resizing and Partitioning Modulator Power Monitor Power Controller Latency Monitor Relative Latency Controller Shared L2 Cache Relative Latency Control Loop Power Control Loop Cache bank of Thread 0 Cache bank of Thread 2 Cache bank of Thread 3 Cache bank of Thread 1 Inactive cache bank
7 Relative Latency Controller (RLC) New cache ratio RL RLC Relative latency set point PI (Proportional-Integral) controller System modeling Controller design Control analysis 1.5 Error: 0.3 Increase 0.2 Workload variation Total cache size variation 1.5 Shared L2 caches 1.2
8 Relative Latency Model is the relative latency between and core is the cache size ratio between and core RL model System identification Model orders Parameters Model Orders and Error
9 Controller Design PID controller Proportional Integral Design: Root Locus New cache ratio Relative latency Relative Latency set point Error Shared L2 caches
10 Control Analysis Derive the transfer function of the controller Derive the transfer function of the system with system model variations Derive the transfer function of the close-loop system and compute the poles The control period of the power control loop is selected to be longer than the settling time of the relative latency control loop. Stability range:
11 Power Controller is the total cache size in the power control period. is the cache power in the power control period. are the parameters depended on applications System Model Leakage power is proportional to the cache size. Leakage power counts for the largest portion of cache power. PI Controller Controller analysis: and
12 Simulation Simulator Simplescalar with NUCA cache (Alpha like core) Power reading Dynamic part: Wattch (with CACTI) Leakage part: Hotleakage Workload Selected workloads from SPEC2000 Actuator Cache bank resizing and partitioning
13 Single Control Evaluation Switch workloads here RLC set point changePower controller set point change Workload switchTotal cache bank count change
14 Relative Latency & IPC
15 Coordination Cache access latencies and IPC values of the four threads on the four cores of the CMP. Cache access latencies and IPC values of the two threads on Core 0 and Core 1 for different benchmarks.
16 Conclusions Relative Cache Latency Control for Performance Differentiations in Power-Constrained Chip Multiprocessors Simultaneously control power and relative latency Achieve desired performance differentiations Theoretically analyze the single loop control and coordinated system stability
SHARP Control: Controlled Shared Cache Management in Chip Multiprocessors Shekhar Srikantaiah, Mahmut Kandemir, *Qian Wang Department of CSE *Department of MNE The Pennsylvania State University
18 Introduction Lack of control over shared on-chip resource Faded performance isolation Lack of Quality of Service (QoS) guarantee It is challenging to achieve high utilization meanwhile guaranteeing the QoS. Static/dynamic resource reservations may lead to low resource utilization. Existing heuristics adjustment cannot provide theoretical guarantee like “settling time” or “stability range”.
19 Contribution Two-layer control theory based SHARP (SHAred Resource Partitioning) architecture Propose an empirical model Design a customized application controller (Reinforced Oscillation Resistant controller) Study two policies can be used in SHARP SD (Service Differentiation) FSI (Fair Speedup Improvement)
20 System Design
21 Why not PID? Disadvantages of PID (Proportional-Integral- Derivative) controller Painstaking to tune the parameters Hard to be integrated with hierarchical architecture Sensitive to model variation during run time Static parameters Generic controller (not problem-specific) Linear model based controller
22 Application Controller
23 Pre-Actuation Negotiator (PAN) Map an overly demanded cache partition to a feasible partition Policies: SD (Service Differentiation ) FSI (Fair Speedup Improvement )
24 SHARP Controller Increase IPC set points when cache ways are under utilized FSI & SD policies The proof of guaranteed optimal utilization
25 Experimental Setup Simulator : Simics (Full system simulator) Operating System: Solaris 10 Configuration (2, 8 cores) Workload: 6 mixes of applications selected from SPEC2000
26 Evaluation (Application Controller) Long run results of PID controller and ROR controller
27 Evaluation (FSI) SHARP vs Baselines
28 Evaluation (SD) Adaptation of IPC with the SD policy using the ROR controllers.
29 Sensitivity & Scalability Sensitivity analysis for different reference points Scalability (8 cores)
30 Conclusion SHARP Control: Controlled Shared Cache Management in Chip Multiprocessor Propose and design the SHARP control architecture for shared L2 caches Validate SHARP with different management policies (FSI or SD) Achieve desired FS and SD specifications
31 Critiques (1) How to decide the relative latency set point? For accelerating critical thread purpose, the parallel workloads may be more applicable.
32 Critiques (2) No stability proof Insufficient description about how to update the parameters for the application controllers
33 Comparison Relative latency control with the power constraint SHARP control architecture GoalGuarantee NUCA L2 cache relative latency with different power budget Improve the normal L2 cache utilization while guaranteeing the QoS metrics DesignTwo-layer hierarchical design ControllerPIDROR Coordination & StabilityYesNo ActuatorCache bank resizing and partitioning Cache way resizing and partitioning EvaluationSimplescalarSimics
34 Q & A Thank you
35 Backup Slides Start
36 Relative Controller Evaluation (2)
37 Application Controller Evaluation (2)
38 Guaranteed Optimal Utilization Proof are time varying coefficient depended on applications
39 System Design