Presentation is loading. Please wait.

Presentation is loading. Please wait.

Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg.

Similar presentations


Presentation on theme: "Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg."— Presentation transcript:

1 Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg

2 Program 2 Program 1 Heterogeneity Processor A Single-Core:

3 Program 2 Processor Program 1 Heterogeneity Processor Multiple Cores:

4 Program 2 Processor Program 1 Heterogeneity Multiple Cores: Processor

5 Program 1 Program 2 Heterogeneity Processor Heterogeneous Cores:

6 Heterogeneous CMP Design Must determine: 1) Best processor configuration for a group of workloads. 2) Best way to group workloads together.

7 The Challenge: A B C D Core 1 Core 2 Workload SpaceBest core configurations Core 1 Core 2 Communal Customization E F G H I J K L M N

8 Existing Approaches Regression models: Enable speedy exploration. Subsetting: Reduce workloads to a representative subset based on characteristics.

9 The Argument Subsetting isn’t a valid substitute or facilitator for communal customization. Reason: complex interdependencies between different architectural units.

10 Ties that bind 1)The global clock intertwines the sizing of different architectural units. 2) The burden of compromise in one unit can be passed on to another.

11 Example: The Global Clock solid line: delay of the issue queue, dashed line: access delay of the cache 1ns Cache Issue Queue 0.66ns Cache Issue Queue 0.66ns Cache Issue Queue 1ns Cache Issue Queue Pipeline: Less slackSlack Pipeline too deep Small Issue-queue Needlessly large cache

12 Example: The Global Clock The clock period, issue-queue size and cache size can not be optimized independent of each other. 1ns Cache Issue Queue 0.66ns Cache Issue Queue 0.66ns Cache Issue Q 1ns Cache Issue Queue

13 Ties that bind 1) The global clock intertwines the sizing of different architectural units. 2) The burden of compromise in one unit can be passed on to another.

14 Example: Passing on the Burden A) Working-set size, B) Branch predictability C) Density of dependence chains D) Frequency of loads E) Frequency of conditional branches * All normalized to a scale of 0~10 βα γ

15 Example: Passing on the Burden A) Working-set size B) Branch predictability C) Density of dependence chains D) Frequency of loads E) Frequency of conditional branches * all normalized to a scale of 0~10 β α γ LH Speed: Core Cache Core Cache LH LH LH LH Customized Architectures:

16 Example: Passing on the Burden A) Working-set size, B) Branch predictability C) Density of dependence chains D) Frequency of loads E) Frequency of conditional branches * all normalized to a scale of 0~10 β α γ Speed: Core Cache Core LH LHLH Customized Architectures:

17 A More Accurate Solution Represent workloads by their customized architectural configurations. Allows for direct and accurate evaluation how well different workloads do on customized configurations. We call this Configurational Workload Characterization

18 Design Process Overview Important workloads Rep. workloads Optimal core combination Select representative workloads based on workload behavior Search for opt. core combination Important workloads Customized architectures Optimal core combination Customize a core for each workload (configurational characterization) Search for opt. core combination How not to do it How to do it

19 Pros & Cons -more costly to determine + provides a more optimal design solution + provides a systematic approach + can be performed prior to the design phase that is critical for time-to-market

20 XP-SCALAR A superscalar design-space exploration frame work www4.ncsu.edu/~hhashem/xpscalar.htm Uses Simplescalar to perform cycle- accurate simulations Uses CACTI model to approximate the access latency of the different units

21 XP-SCALAR What parameters are varied: Clock period, Processor width, Size of the issue queue, Size of the register-file, Size of the load-store queue, Size of the L1 and L2 caches

22 XP-SCALAR How they are varied: a) Clock period is varied, and architecture parameters are adjusted to make latencies fit within pipeline stages. b) Number of pipeline stages of a unit is varied and its configuration appropriately adjusted.

23 Determining the Best cores Execute all benchmarks on each-other’s customized configurations. From that, determine best grouping through a complete search.

24 Best Core Results customized core(s)avg. IPThar. IPT best config for avg. & har. IPTgcc2.061.57 2 best configs for avg. IPTparser, twolf2.271.76 2 best configs for har. IPTgcc, mcf2.121.88 3 best configs for avg. IPTcrafty, parser, twolf2.351.82 3 best configs for har. IPTcrafty, mcf, twolf2.272.05 4 best configs for avg. & har. IPTcrafty, mcf, parser, twolf2.322.08 each benchmark on its own customized architecture -2.382.12

25 The effect of subsetting Subsetting of a single pair of benchmarks results in the extraction of a totally different set of best cores.

26 Representation Dendogram are

27 Conclusions There are interdependencies between architectural units in how they are customized. In the design of a heterogeneous CMP subsetting can lead to performance degradation.


Download ppt "Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg."

Similar presentations


Ads by Google