Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz November, 2004.

Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz zguz@tx.technion.ac.il November, 2004

2 Outline Motivation Heterogeneous multi-core architecture ToDo list and open questions  Different Objective functions  SMT as building blocks  Phase detection Summary

3 References “Single-ISA Heterogeneous Multi-Core Architecture for Multithreaded Workload Performance” Rakesh Kumar, Dean M. Tullsen, Parthasarath Ranganathan, Norman P.Jouppi, Keith I. Farkas In Proceedings of the 31 st International Symposium on Computer Architecture (ISCA’04), June, 2004 “Single-ISA Heterogeneous Multi-Core Architecture: The Potential for Processor Power Reduction” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the 36 st International Symposium on Microarchitecure, December 2003 “A Multi-Core Approach to Addressing the Energy-Complexity Problem In Microprocessor” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the Workshop on Complexity-Effective Design (WCED), June 2003 “Processor Power Reduction Via Single-ISA Heterogeneous Multi- Core Architecture” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, Computer Architecture Letters, Volume 2, Apr. 2003 Heterogeneous multi-core architecture

4 Diminishing performance return per chip area The infamous power/performance ratio The power wall Chip area is bounded Tomer’s assumptions:  VLSI sad facts of life: Processor’s characteristics

5 Few different generations of Alpha’s cores  All scaled to 0.10 micron EV6+

6 EV8-EV6EV5EV4Processor 8 (OOO)6 (OOO)42 Issue-width 64 KB, 4-way64 KB, 2-way 8 KB, DM I-Cache 64 KB, 4-way64 KB, 2-way 8 KB, DM D-Cache hybrid 2 level (2X EV6 size) hybrid 2 level 2K gshare Branch Pred. 1111 Threads 23624.55.062.87 Area (mm 2 ) 92.8817.809.834.97 Peak-power (Watt) 46.4410.686.883.73 Typical Power (Watt) Processor’s characteristics Few different generations of Alpha’s cores  All scaled to 0.10 micron 4.8x 1.5x

7 Processor’s characteristics ⇒ large number of small processors is better than small number of large processors Processor are expected to supply competing objectives:  High throughput for multi-thread environments  Good single thread performance But what if TLP isn’t large enough ?

8 workloads characteristics Different amount of ILP Different TLP  Among different applications  Among different workloads  Vary with time Legacy code Many applications under-utilize the hardware  Suffer little performance loss when run on a less aggressive processor Great diversity among different applications

9 workloads characteristics Wildly different intra-thread behavior Programs fall into phases, each phase presents different behavior  Variation in resources demands  Memory/computation bound  Branch mispredictions  Cache misses During many phases the processor is under-utilized

10 workloads characteristics Wildly different intra-thread behavior gzip

11 workloads characteristics Wildly different intra-thread behavior gcc

12 Main Idea A multiprocessor composed of asymmetric cores  Better area-efficient coverage of the different workloads demands: Single thread performance (legacy code) Elevated throughput for high TLP Single-ISA heterogeneous Multi-Core

13 Main Idea A multiprocessor composed of asymmetric cores  Better area-efficient coverage of the different workloads demands: Single thread performance (legacy code) Elevated throughput for high TLP Use a smart dynamic task-to-core assignment  Assign each application to the core best suite to meet its performance demands  Exploit the variations in resource demands between different application's phases Single-ISA heterogeneous Multi-Core

14 Main Idea A multiprocessor composed of asymmetric cores  Better area-efficient coverage of the different workloads demands: Single thread performance (legacy code) Elevated throughput for high TLP Use a smart dynamic task-to-core assignment  Assign each application to the core best suite to meet its performance demands  Exploit the variations in resource demands between different application's phases Use of-the-shelf cores  Amortize design and verification effort Single-ISA heterogeneous Multi-Core

15 The Potential of Heterogeneity

16 Architecture Model 3 different multi-core systems were compared:  4 EV6 cores (homogeneous MP)  20 EV5 cores (homogeneous MP)  3 EV6 and 5 EV5 cores (heterogeneous MP) Each core has its own L1 caches All cores share an on chip 4 MB L2 cache Chip area of all 3 configurations is roughly the same Using the correct power model, so is the total power…

17 Scheduling issues OS scheduler is responsible for thread scheduling and assignment  Core-switch at OS timeslice intervals. (10-100msec) The core-switch overhead is piggybacked with OS context switch  Application phase length are typically large, hence suite this timeslices

18 Scheduling issues OS scheduler is responsible for thread scheduling and assignment  Core-switch at OS timeslice intervals. (10-100msec) The core-switch overhead is piggybacked with OS context switch  Application phase length are typically large, hence suite this timeslices Sampling-based :  During the Sampling phase Thread migrate between different cores Statistics is gathered for every allocation  During the Steady Phase: The most beneficial allocation is used

19 Evaluation Metric: weighted speedup  Maximizing average performance gain over all applications The jobs assigned to the EV5 are those that are least affected by its inferior capabilities Scheduling issues Objective function

20 Scheduling issues sample-one: run each thread on each core once sample-avg: run each thread on each core at least twice sample-sched: constrained to choose only assignment that were actually sampled Sampling Strategy

21 Simulation Results Static scheduling

22 Simulation Results Dynamic scheduling, phases with constant length dynamic

23 Scheduling issues individual-event: whenever a thread’s IPC changes by more than 50% global-event: whenever the total change in IPC for all threads exceeds 100% bounded-global-event: the same as the global-event with minimum and maximum thresholds Trigger Mechanism

24 Simulation Results Dynamic scheduling, triggered phases

25 Priorities among different threads  Heterogeneous architecture ideally suite these task  Exploring different objective functions Todo list (open questions)

26 Priorities among different threads.  Heterogeneous architecture ideally suite these task.  Minimize energy consumption  Use performance threshold Exploring different objective functions Todo list (open questions)

27 Priorities among different threads.  Heterogeneous architecture ideally suite these task.  Minimize energy consumption.  Use performance threshold Minimize the energy-delay product  Exploring different objective functions Todo list (open questions)

28 Energy-delay product during applu life-time Todo list (open questions)

29 Great potential for energy saving:  Pervious work, considering only one thread at a time, achieved more than 30% of energy saving  Idle cores can be shut down Objective function may change on the fly according to changing power conditions Todo list (open questions) Exploring different objective functions

30 Using SMT Cores Enlarge flexibility and throughput with only modest area and power penalty Motivation No free lunches: Interaction between threads can no longer be ignored  Thread compete for virtually all processor resources Only sampled assignments can be used Permutations space of potential assignments is huge  Can not sample all the assignments  The sampling space must be pruned  The sampling strategy is much more important

31 Simulation Results (SMT) Heterogeneous system with SMT cores

32 References “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor” Allan Snavely, Dean M. Tullsen, In the Proceedings of the 9 th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), Novemeber, 2000 “Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor” Allan Snavely, Dean M. Tullsen, In proceedisng of the 9 th International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 02), June, 2002 SOS (Sample, Optimize, Symbios)

33 SOS - (Sample, Optimize, Symbios) A first attempt to take into account threads interaction in SMT symbiosis – the effectiveness with which multiple jobs achieve speedup when coexcecuted on multithreaded machines  Throughput may actually go down Plundered from Uri’s slides.

34 SOS - (Sample, Optimize, Symbios) Using sampling phases to profile execution  Choose combination that maximize overall weighted speedup Which predictor to use ? Extremely architecture dependent  Encapsulation of hardware details from software IPC and Dcache are inconsistent performers

35 The moral of the SMT case When considering more ‘clustered’ architectures, groups of cores may share resources:  L1 caches  Memory hierarchy  FPU, TLB ? SMT and CMP are just two extremes of a viable spectrum Threads interaction significantly complicates our life

36 The moral of the SMT case A scheduler has 2 tasks:  Define a running set – jobs to be executed during the upcoming timeslice  Assign jobs from the running set to the different cores We have overlooked the first task and simplified the second We’ll have to tackle both If only memory is to be shared our life may be easier  Not by that much, though Threads interaction significantly complicates our life

37 References “Phase Tracking and Prediciton” Timothy Sherwood, Suleyman Sair, Brad Calder, Proceedings of the 30th annual international symposium on Computer architecture, IEEE CS Press, 2003, pp.336-349 “Discovering and Exploiting Program Phases” Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder, IEEE Micro : Micro's Top Picks from Computer Architecture Conferences, Nov./Dec. 2003 Phase tracking

38 It is Profitable to accurately identify program phases  React quickly to phase changes  Spare samplings overheads IPC is not necessarily the most appropriate representative Smart phase detection Todo list (open questions)

39 Phases are a direct function of the way program traverse its code during execution  Use basic blocks ratios to identify phases  Architectural independent Basic Block Vector  One dimension array with an index for every basic block in the program  Each element represent the execution frequencies of the basic blocks weighted by instruction count, normalized Phase Tracking Main Idea

40 Phase Tracking Main Idea Basic Block Vector  One dimension array with an index for every basic block in the program  Each element represent the execution frequencies of the basic blocks weighted by instruction count, normalized

41 Phase Tracking Phase capture:

42 Work’s Innovation Using heterogeneous muti-cores to gain superior performance  Previous works targeted only power consumption  First real simulation results General-purpose processors  Previous works concentrated on SoC with known workloads Dynamic task scheduling and task-to-core assignment  Most of the works use static scheduling and a full knowledge of the application characteristic

43 Summary Heterogeneous multi-core architecture can provide significantly higher performance Covers a wide spectrum of workloads Dynamic core assignments policy exploit intra-thread and inter-thread diversity Open issues:  Optimize energy consumption  Thread interactions  Phase detection  A lot more… Any questions ?

44 References of the day “Single-ISA Heterogeneous Multi-Core Architecture for Multithreaded Workload Performance” Rakesh Kumar, Dean M. Tullsen, Parthasarath Ranganathan, Norman P.Jouppi, Keith I. Farkas In Proceedings of the 31 st International Symposium on Computer Architecture (ISCA’04), June, 2004 “Single-ISA Heterogeneous Multi-Core Architecture: The Potential for Processor Power Reduction” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the 36 st International Symposium on Microarchitecure, December 2003 “A Multi-Core Approach to Addressing the Energy-Complexity Problem In Microprocessor” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the Workshop on Complexity-Effective Design (WCED), June 2003 “Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architecture” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, Computer Architecture Letters, Volume 2, Apr. 2003 “Phase Tracking and Prediciton” Timothy Sherwood, Suleyman Sair, Brad Calder, Proceedings of the 30th annual international symposium on Computer architecture, IEEE CS Press, 2003, “Discovering and Exploiting Program Phases” Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder, IEEE Micro : Micro's Top Picks from Computer Architecture Conferences, Nov./Dec. 2003 “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor ” Allan Snavely, Dean M. Tullsen, In the Proceedings of the 9 th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), Novemeber, 2000

45 References of the day “Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor” Allan Snavely, Dean M. Tullsen, In proceedisng of the 9 th International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 02), June, 2002 “Conjoind-core Chip Multiprocessing” Rakesh Kumar, Norman P.Jouppi, Dean M. Tullsen, In Proceedings of the 37 st International Symposium on Microarchitecure, December 2004

46 Backup

47 Sample 2n configuration for n threads workload. Pruning strategies:  pref-EV6 : assumes it is best to run 2 thread on each EV6 before using EV5  pref-EV5 : assumes it’s best to run on EV5 rather than put 2 threads on EV6  pref-nigther : sample random schedule  pref-similar: sampling is biased toward a configuration similar to the current one used. Sampling strategies Using Multithreaded Cores

48 Workload construction 8 benchmarks from spec2000 Thread number vary up to the maximum number of available processor contexts. Various compositions are simulated. Large memory footprint int fp

Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz November, 2004.

Similar presentations

Presentation on theme: "Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz November, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz November, 2004.

Similar presentations

Presentation on theme: "Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz November, 2004."— Presentation transcript:

Similar presentations

About project

Feedback