Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

1 Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis

2 Introduction When a program executes, it passes through phases where performance characteristics and hardware resources may vary. If these phases changes can be detected, optimizations for performance and/or power can be applied and dynamic reconfiguration can be invoked. Configurable units can have a fixed number of configurations e.g. four different cache sizes. This paper appose many configuration algorithms that can be applied to any multi-configuration unit.

3 Dynamically configurable hardware Configurable caches and TLBs Allocation of memory hierarchy resources Allocation of memory buffer resources Configurable branch predictors Configurable instruction windows Configurable pipelines A combination of these methods can be used. Complexity of the optimization problem increases, especially if the methods interact with one another.

4 Dynamic reconfiguration algorithm

5 Dynamic reconfiguration algorithm(2) Refer as Rochester Algorithm, used to control multi- configuration data cache Interval 100K instructions 2 states: Stable, Unstable Tuning and reconfiguration in unstable state Dynamically the threshold change to detect phase change Detection efficiency (ability to detect phase change) Reconfiguration overhead (depends on the amount of state) (10 cycles to 1000 e.g. data cache). Tuning Overhead (time to find optimal configuration). Each tuning can lead to multiple reconfigurations. Why we study working sets? Because phase changes are manifestations of working set changes.

6 Working with working set Working set W(ti,τ) for i=1,2…, set of distinct segments {s1,s2…,sω} touched over the i th window (interval) of size τ Working set (w.s.) size is ω, cardinality of the set. Segments are memory regions of some fixed size (page) General model, phase transition model. Phase is defined as the maximal interval over which the working set remains more or less constant. The program follow a series of steady state phases with abrupt transitions in between. Focus on the instruction working set (distinguish data, instr) Window size is important to capture a working set. In this paper, working set contains cache line granularity sized elements (32-256 bytes), units are caches and predictors Use of non-overlapping window

7 Working with working set(2) Main goal identify w.s, measure size & detect change in w.s. Measure to compare two phases with w.s. W(ti,τ) and W(tj,τ) Relative working set distance. Large δ value indicates a w.s. change, a small no change. If δ=0 the sets identical, δ=1 sets are totally different. Define threshold δ th, there is a w.s change if δ> δ th

8 Working set signature Difficult to manipulate complete working sets A lossy-compressed w.s representation,w.s signature (w.s.s) W.s.s is a n-bit vector mapping w.s.elements into n-buckets Use of random hash function (srand and rand) Low-order b address bit ignored in hash (w.s. elements cache line granularity) Range of bit-vector 32-128 bytes Bit-vector is cleared at the begging of every interval.

9 Working set signature(2)

10 Working set signature(3) The size of the signature is related to the w.s size. K random keys are hashed into n buckets, the fraction of buckets filled, f, W.s.size can be estimated 90% filled table corresponds w.s.size about 2.5 times larger than the number of filled entries. The measure of similarity of two signatures S1,S2, the relative signature distance is defined as Use a threshold value Δ th to detect phase changes

11 3. Methology Modified version SimpleScalar and SPEC2000 benchmarks Compile using base-level optimization Choice of benchmarks 1)long and short term phase with differing performance 2)recurring phases (test w.s. identification) 3)different w.s in a benchmark that lead to similar behavior for certain cache/predictor configuration & completely different behavior (test reconfiguration) 100K instructions/interval, 20,000 intervals (2 billion intst) Signature vector size is 1024 bits(128 bytes)

12 Signature accuracy Evaluate accuracy of w.s.s distance with comparison full w.s Measure the relative distances between pairs of consecutive windows

13 Signature accuracy (2) Rochester algorithm uses dynamic count of conditional branches to measure w.s.changes Relative distance metric for conditional branches

14 Signature accuracy (3) Some correlation, high level dispersion. Several significant working set changes that are associated with very small relative branch Define Δ th = 0,5 which filters out most of the noise and detects significant phase changes. Phase changes is relatively insensitive to Δ th, because phase change tends to be abrupt.

15 Evaluation: managing configurable hardware Signature based algorithm

16 Evaluation: managing configurable hardware (2) Three states: Stable – program w.s is stable & configuration is optimal, Unstable – w.s is in transition, Tuning w.s is stable & different configurations explored. Similar with Rochester Algorithm, but the signature-based algorithm does not tune while the w.s is in transition Icache size configured to 2KB,8KB,32KB or 128KB Parameters for Rochester algorithm base_br_noise = 4500, br_dec = 50, br_inc = 1000, base_perf_noise = 450, perf_dec = 5, perf_inc = 100 and threshold = 2%

17 Evaluation: managing configurable hardware (3)

18 Evaluation: managing configurable hardware (4)

19 Evaluation: managing configurable hardware (5)

20 Evaluation: managing configurable hardware (6)

21 Evaluation: managing configurable hardware (7) To reduce unnecessary tunings, we extend the signature- based algorithm to wait for 4 stable intervals before tuning. If the state is UNSTABLE for more than 10 intervals and performance is below threshold, the cache size is increased to the maximum. This acts as a backup strategy in cases where the working set does not stabilize, so tuning is never performed.

22 4. Measuring working set sizes The signature size is closely related to the actual working set size. In cases where performance is directly related to the working set size, for example instruction and data caches, the signature size can be used to determine the optimal configuration; there is no need for tuning.

23 Working set size experiments For small w.s. the graph is close to linear and as it gets bigger the graph becomes non-linear. Even in the non-linear, the signature can give reasonably accurate w.s size estimates 3-4x the maximum signature size. So a typical signature size (32-128 bytes) with line-size granularities (32-128 bytes) can estimate w.s sizes of many tens to hundreds of KB

24 Evaluation: reconfiguration using signature size The extended signature-based algorithm can be modified to use the signature size for selecting an optimal cache configuration – the smallest that holds the current working set (plus 10% to allow for some noise). To determine the appropriate size, equation 2 (Section 2) is used. This eliminates the need to tune, and it typically reduces the number of reconfigurations as well. (signature size)

25 Identifying recurring phases The same phases often recur multiple times during program execution. Implement an algorithm to save recurring phase information to avoid re-tuning. This will be done by maintaining a phase table in memory. After tuning has determined the optimal configuration for a particular phase, it will be stored in the table. Later, if the phase recurs, the optimal configuration can be reinstated without going through the tuning process.

26 Phase statistics

27 Evaluation: recurring working sets The algorithm (phase table) for exploiting recurring working sets is similar to the basic algorithm. On detecting a phase change, the algorithm first performs a table lookup to see if configuration information for the phase exists in the table. If so, the optimal configuration is reinstated. If not, the algorithm goes into the TUNING state. At the end of tuning, the optimal configuration is committed to the signature table. In addition to the configuration information, the table also keeps track of phase lengths. If, during its last execution, the length was fewer than four intervals (400,000 instructions), then tuning is not performed. This avoids tuning for insignificant phases. Four intervals are chosen because the tuning process takes a maximum of four intervals.

