Download presentation
Presentation is loading. Please wait.
Published byChristopher Palmer Modified over 9 years ago
1
Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz zguz@tx.technion.ac.il November, 2004
2
2 Outline Motivation Heterogeneous multi-core architecture ToDo list and open questions Different Objective functions SMT as building blocks Phase detection Summary
3
3 References “Single-ISA Heterogeneous Multi-Core Architecture for Multithreaded Workload Performance” Rakesh Kumar, Dean M. Tullsen, Parthasarath Ranganathan, Norman P.Jouppi, Keith I. Farkas In Proceedings of the 31 st International Symposium on Computer Architecture (ISCA’04), June, 2004 “Single-ISA Heterogeneous Multi-Core Architecture: The Potential for Processor Power Reduction” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the 36 st International Symposium on Microarchitecure, December 2003 “A Multi-Core Approach to Addressing the Energy-Complexity Problem In Microprocessor” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the Workshop on Complexity-Effective Design (WCED), June 2003 “Processor Power Reduction Via Single-ISA Heterogeneous Multi- Core Architecture” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, Computer Architecture Letters, Volume 2, Apr. 2003 Heterogeneous multi-core architecture
4
4 Diminishing performance return per chip area The infamous power/performance ratio The power wall Chip area is bounded Tomer’s assumptions: VLSI sad facts of life: Processor’s characteristics
5
5 Few different generations of Alpha’s cores All scaled to 0.10 micron EV6+
6
6 EV8-EV6EV5EV4Processor 8 (OOO)6 (OOO)42 Issue-width 64 KB, 4-way64 KB, 2-way 8 KB, DM I-Cache 64 KB, 4-way64 KB, 2-way 8 KB, DM D-Cache hybrid 2 level (2X EV6 size) hybrid 2 level 2K gshare Branch Pred. 1111 Threads 23624.55.062.87 Area (mm 2 ) 92.8817.809.834.97 Peak-power (Watt) 46.4410.686.883.73 Typical Power (Watt) Processor’s characteristics Few different generations of Alpha’s cores All scaled to 0.10 micron 4.8x 1.5x
7
7 Processor’s characteristics ⇒ large number of small processors is better than small number of large processors Processor are expected to supply competing objectives: High throughput for multi-thread environments Good single thread performance But what if TLP isn’t large enough ?
8
8 workloads characteristics Different amount of ILP Different TLP Among different applications Among different workloads Vary with time Legacy code Many applications under-utilize the hardware Suffer little performance loss when run on a less aggressive processor Great diversity among different applications
9
9 workloads characteristics Wildly different intra-thread behavior Programs fall into phases, each phase presents different behavior Variation in resources demands Memory/computation bound Branch mispredictions Cache misses During many phases the processor is under-utilized
10
10 workloads characteristics Wildly different intra-thread behavior gzip
11
11 workloads characteristics Wildly different intra-thread behavior gcc
12
12 Main Idea A multiprocessor composed of asymmetric cores Better area-efficient coverage of the different workloads demands: Single thread performance (legacy code) Elevated throughput for high TLP Single-ISA heterogeneous Multi-Core
13
13 Main Idea A multiprocessor composed of asymmetric cores Better area-efficient coverage of the different workloads demands: Single thread performance (legacy code) Elevated throughput for high TLP Use a smart dynamic task-to-core assignment Assign each application to the core best suite to meet its performance demands Exploit the variations in resource demands between different application's phases Single-ISA heterogeneous Multi-Core
14
14 Main Idea A multiprocessor composed of asymmetric cores Better area-efficient coverage of the different workloads demands: Single thread performance (legacy code) Elevated throughput for high TLP Use a smart dynamic task-to-core assignment Assign each application to the core best suite to meet its performance demands Exploit the variations in resource demands between different application's phases Use of-the-shelf cores Amortize design and verification effort Single-ISA heterogeneous Multi-Core
15
15 The Potential of Heterogeneity
16
16 Architecture Model 3 different multi-core systems were compared: 4 EV6 cores (homogeneous MP) 20 EV5 cores (homogeneous MP) 3 EV6 and 5 EV5 cores (heterogeneous MP) Each core has its own L1 caches All cores share an on chip 4 MB L2 cache Chip area of all 3 configurations is roughly the same Using the correct power model, so is the total power…
17
17 Scheduling issues OS scheduler is responsible for thread scheduling and assignment Core-switch at OS timeslice intervals. (10-100msec) The core-switch overhead is piggybacked with OS context switch Application phase length are typically large, hence suite this timeslices
18
18 Scheduling issues OS scheduler is responsible for thread scheduling and assignment Core-switch at OS timeslice intervals. (10-100msec) The core-switch overhead is piggybacked with OS context switch Application phase length are typically large, hence suite this timeslices Sampling-based : During the Sampling phase Thread migrate between different cores Statistics is gathered for every allocation During the Steady Phase: The most beneficial allocation is used
19
19 Evaluation Metric: weighted speedup Maximizing average performance gain over all applications The jobs assigned to the EV5 are those that are least affected by its inferior capabilities Scheduling issues Objective function
20
20 Scheduling issues sample-one: run each thread on each core once sample-avg: run each thread on each core at least twice sample-sched: constrained to choose only assignment that were actually sampled Sampling Strategy
21
21 Simulation Results Static scheduling
22
22 Simulation Results Dynamic scheduling, phases with constant length dynamic
23
23 Scheduling issues individual-event: whenever a thread’s IPC changes by more than 50% global-event: whenever the total change in IPC for all threads exceeds 100% bounded-global-event: the same as the global-event with minimum and maximum thresholds Trigger Mechanism
24
24 Simulation Results Dynamic scheduling, triggered phases
25
25 Priorities among different threads Heterogeneous architecture ideally suite these task Exploring different objective functions Todo list (open questions)
26
26 Priorities among different threads. Heterogeneous architecture ideally suite these task. Minimize energy consumption Use performance threshold Exploring different objective functions Todo list (open questions)
27
27 Priorities among different threads. Heterogeneous architecture ideally suite these task. Minimize energy consumption. Use performance threshold Minimize the energy-delay product Exploring different objective functions Todo list (open questions)
28
28 Energy-delay product during applu life-time Todo list (open questions)
29
29 Great potential for energy saving: Pervious work, considering only one thread at a time, achieved more than 30% of energy saving Idle cores can be shut down Objective function may change on the fly according to changing power conditions Todo list (open questions) Exploring different objective functions
30
30 Using SMT Cores Enlarge flexibility and throughput with only modest area and power penalty Motivation No free lunches: Interaction between threads can no longer be ignored Thread compete for virtually all processor resources Only sampled assignments can be used Permutations space of potential assignments is huge Can not sample all the assignments The sampling space must be pruned The sampling strategy is much more important
31
31 Simulation Results (SMT) Heterogeneous system with SMT cores
32
32 References “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor” Allan Snavely, Dean M. Tullsen, In the Proceedings of the 9 th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), Novemeber, 2000 “Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor” Allan Snavely, Dean M. Tullsen, In proceedisng of the 9 th International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 02), June, 2002 SOS (Sample, Optimize, Symbios)
33
33 SOS - (Sample, Optimize, Symbios) A first attempt to take into account threads interaction in SMT symbiosis – the effectiveness with which multiple jobs achieve speedup when coexcecuted on multithreaded machines Throughput may actually go down Plundered from Uri’s slides.
34
34 SOS - (Sample, Optimize, Symbios) Using sampling phases to profile execution Choose combination that maximize overall weighted speedup Which predictor to use ? Extremely architecture dependent Encapsulation of hardware details from software IPC and Dcache are inconsistent performers
35
35 The moral of the SMT case When considering more ‘clustered’ architectures, groups of cores may share resources: L1 caches Memory hierarchy FPU, TLB ? SMT and CMP are just two extremes of a viable spectrum Threads interaction significantly complicates our life
36
36 The moral of the SMT case A scheduler has 2 tasks: Define a running set – jobs to be executed during the upcoming timeslice Assign jobs from the running set to the different cores We have overlooked the first task and simplified the second We’ll have to tackle both If only memory is to be shared our life may be easier Not by that much, though Threads interaction significantly complicates our life
37
37 References “Phase Tracking and Prediciton” Timothy Sherwood, Suleyman Sair, Brad Calder, Proceedings of the 30th annual international symposium on Computer architecture, IEEE CS Press, 2003, pp.336-349 “Discovering and Exploiting Program Phases” Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder, IEEE Micro : Micro's Top Picks from Computer Architecture Conferences, Nov./Dec. 2003 Phase tracking
38
38 It is Profitable to accurately identify program phases React quickly to phase changes Spare samplings overheads IPC is not necessarily the most appropriate representative Smart phase detection Todo list (open questions)
39
39 Phases are a direct function of the way program traverse its code during execution Use basic blocks ratios to identify phases Architectural independent Basic Block Vector One dimension array with an index for every basic block in the program Each element represent the execution frequencies of the basic blocks weighted by instruction count, normalized Phase Tracking Main Idea
40
40 Phase Tracking Main Idea Basic Block Vector One dimension array with an index for every basic block in the program Each element represent the execution frequencies of the basic blocks weighted by instruction count, normalized
41
41 Phase Tracking Phase capture:
42
42 Work’s Innovation Using heterogeneous muti-cores to gain superior performance Previous works targeted only power consumption First real simulation results General-purpose processors Previous works concentrated on SoC with known workloads Dynamic task scheduling and task-to-core assignment Most of the works use static scheduling and a full knowledge of the application characteristic
43
43 Summary Heterogeneous multi-core architecture can provide significantly higher performance Covers a wide spectrum of workloads Dynamic core assignments policy exploit intra-thread and inter-thread diversity Open issues: Optimize energy consumption Thread interactions Phase detection A lot more… Any questions ?
44
44 References of the day “Single-ISA Heterogeneous Multi-Core Architecture for Multithreaded Workload Performance” Rakesh Kumar, Dean M. Tullsen, Parthasarath Ranganathan, Norman P.Jouppi, Keith I. Farkas In Proceedings of the 31 st International Symposium on Computer Architecture (ISCA’04), June, 2004 “Single-ISA Heterogeneous Multi-Core Architecture: The Potential for Processor Power Reduction” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the 36 st International Symposium on Microarchitecure, December 2003 “A Multi-Core Approach to Addressing the Energy-Complexity Problem In Microprocessor” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the Workshop on Complexity-Effective Design (WCED), June 2003 “Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architecture” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, Computer Architecture Letters, Volume 2, Apr. 2003 “Phase Tracking and Prediciton” Timothy Sherwood, Suleyman Sair, Brad Calder, Proceedings of the 30th annual international symposium on Computer architecture, IEEE CS Press, 2003, “Discovering and Exploiting Program Phases” Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder, IEEE Micro : Micro's Top Picks from Computer Architecture Conferences, Nov./Dec. 2003 “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor ” Allan Snavely, Dean M. Tullsen, In the Proceedings of the 9 th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), Novemeber, 2000
45
45 References of the day “Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor” Allan Snavely, Dean M. Tullsen, In proceedisng of the 9 th International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 02), June, 2002 “Conjoind-core Chip Multiprocessing” Rakesh Kumar, Norman P.Jouppi, Dean M. Tullsen, In Proceedings of the 37 st International Symposium on Microarchitecure, December 2004
46
46 Backup
47
47 Sample 2n configuration for n threads workload. Pruning strategies: pref-EV6 : assumes it is best to run 2 thread on each EV6 before using EV5 pref-EV5 : assumes it’s best to run on EV5 rather than put 2 threads on EV6 pref-nigther : sample random schedule pref-similar: sampling is biased toward a configuration similar to the current one used. Sampling strategies Using Multithreaded Cores
48
48 Workload construction 8 benchmarks from spec2000 Thread number vary up to the maximum number of available processor contexts. Various compositions are simulated. Large memory footprint int fp
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.