Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz November, 2004.

Slides:

Advertisements

Similar presentations

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

Lecture 6: Multicore Systems

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.

SMT Parallel Applications –For one program, parallel executing threads Multiprogrammed Applications –For multiple programs, independent threads.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder.

How Multi-threading can increase on-chip parallelism

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder Used with permission of author.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.

Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction University of California MICRO ’03 Presented by Jinho Seol.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Scheduling Issues on a Heterogeneous Single ISA Multicore IRISA, France Robert Guziolowski, André Seznec. Contact: 1. M. Becchi and P.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Authors: Matthew DeVuyst, Rakesh Kumar, and Dean M. Tullsen.

Workload Clustering for Increasing Energy Savings on Embedded MPSoCs S. H. K. Narayanan, O. Ozturk, M. Kandemir, M. Karakoy.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker ： Chun-Chung Chen Single-ISA.

Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM

CS Lecture 20 The Case for a Single-Chip Multiprocessor

Adaptive Cache Partitioning on a Composite Core

Simultaneous Multithreading

/ Computer Architecture and Design

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Phase Capture and Prediction with Applications

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

CARP: Compression-Aware Replacement Policies

Massachusetts Institute of Technology

Phase based adaptive Branch predictor: Seeing the forest for the trees

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz November, 2004

2 Outline Motivation Heterogeneous multi-core architecture ToDo list and open questions  Different Objective functions  SMT as building blocks  Phase detection Summary

3 References “Single-ISA Heterogeneous Multi-Core Architecture for Multithreaded Workload Performance” Rakesh Kumar, Dean M. Tullsen, Parthasarath Ranganathan, Norman P.Jouppi, Keith I. Farkas In Proceedings of the 31 st International Symposium on Computer Architecture (ISCA’04), June, 2004 “Single-ISA Heterogeneous Multi-Core Architecture: The Potential for Processor Power Reduction” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the 36 st International Symposium on Microarchitecure, December 2003 “A Multi-Core Approach to Addressing the Energy-Complexity Problem In Microprocessor” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the Workshop on Complexity-Effective Design (WCED), June 2003 “Processor Power Reduction Via Single-ISA Heterogeneous Multi- Core Architecture” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, Computer Architecture Letters, Volume 2, Apr Heterogeneous multi-core architecture

4 Diminishing performance return per chip area The infamous power/performance ratio The power wall Chip area is bounded Tomer’s assumptions:  VLSI sad facts of life: Processor’s characteristics

5 Few different generations of Alpha’s cores  All scaled to 0.10 micron EV6+

6 EV8-EV6EV5EV4Processor 8 (OOO)6 (OOO)42 Issue-width 64 KB, 4-way64 KB, 2-way 8 KB, DM I-Cache 64 KB, 4-way64 KB, 2-way 8 KB, DM D-Cache hybrid 2 level (2X EV6 size) hybrid 2 level 2K gshare Branch Pred Threads Area (mm 2 ) Peak-power (Watt) Typical Power (Watt) Processor’s characteristics Few different generations of Alpha’s cores  All scaled to 0.10 micron 4.8x 1.5x

7 Processor’s characteristics ⇒ large number of small processors is better than small number of large processors Processor are expected to supply competing objectives:  High throughput for multi-thread environments  Good single thread performance But what if TLP isn’t large enough ?

8 workloads characteristics Different amount of ILP Different TLP  Among different applications  Among different workloads  Vary with time Legacy code Many applications under-utilize the hardware  Suffer little performance loss when run on a less aggressive processor Great diversity among different applications

9 workloads characteristics Wildly different intra-thread behavior Programs fall into phases, each phase presents different behavior  Variation in resources demands  Memory/computation bound  Branch mispredictions  Cache misses During many phases the processor is under-utilized

10 workloads characteristics Wildly different intra-thread behavior gzip

11 workloads characteristics Wildly different intra-thread behavior gcc

12 Main Idea A multiprocessor composed of asymmetric cores  Better area-efficient coverage of the different workloads demands: Single thread performance (legacy code) Elevated throughput for high TLP Single-ISA heterogeneous Multi-Core

13 Main Idea A multiprocessor composed of asymmetric cores  Better area-efficient coverage of the different workloads demands: Single thread performance (legacy code) Elevated throughput for high TLP Use a smart dynamic task-to-core assignment  Assign each application to the core best suite to meet its performance demands  Exploit the variations in resource demands between different application's phases Single-ISA heterogeneous Multi-Core

14 Main Idea A multiprocessor composed of asymmetric cores  Better area-efficient coverage of the different workloads demands: Single thread performance (legacy code) Elevated throughput for high TLP Use a smart dynamic task-to-core assignment  Assign each application to the core best suite to meet its performance demands  Exploit the variations in resource demands between different application's phases Use of-the-shelf cores  Amortize design and verification effort Single-ISA heterogeneous Multi-Core

15 The Potential of Heterogeneity

16 Architecture Model 3 different multi-core systems were compared:  4 EV6 cores (homogeneous MP)  20 EV5 cores (homogeneous MP)  3 EV6 and 5 EV5 cores (heterogeneous MP) Each core has its own L1 caches All cores share an on chip 4 MB L2 cache Chip area of all 3 configurations is roughly the same Using the correct power model, so is the total power…

17 Scheduling issues OS scheduler is responsible for thread scheduling and assignment  Core-switch at OS timeslice intervals. (10-100msec) The core-switch overhead is piggybacked with OS context switch  Application phase length are typically large, hence suite this timeslices

18 Scheduling issues OS scheduler is responsible for thread scheduling and assignment  Core-switch at OS timeslice intervals. (10-100msec) The core-switch overhead is piggybacked with OS context switch  Application phase length are typically large, hence suite this timeslices Sampling-based :  During the Sampling phase Thread migrate between different cores Statistics is gathered for every allocation  During the Steady Phase: The most beneficial allocation is used

19 Evaluation Metric: weighted speedup  Maximizing average performance gain over all applications The jobs assigned to the EV5 are those that are least affected by its inferior capabilities Scheduling issues Objective function

20 Scheduling issues sample-one: run each thread on each core once sample-avg: run each thread on each core at least twice sample-sched: constrained to choose only assignment that were actually sampled Sampling Strategy

21 Simulation Results Static scheduling

22 Simulation Results Dynamic scheduling, phases with constant length dynamic

23 Scheduling issues individual-event: whenever a thread’s IPC changes by more than 50% global-event: whenever the total change in IPC for all threads exceeds 100% bounded-global-event: the same as the global-event with minimum and maximum thresholds Trigger Mechanism

24 Simulation Results Dynamic scheduling, triggered phases

25 Priorities among different threads  Heterogeneous architecture ideally suite these task  Exploring different objective functions Todo list (open questions)

26 Priorities among different threads.  Heterogeneous architecture ideally suite these task.  Minimize energy consumption  Use performance threshold Exploring different objective functions Todo list (open questions)

27 Priorities among different threads.  Heterogeneous architecture ideally suite these task.  Minimize energy consumption.  Use performance threshold Minimize the energy-delay product  Exploring different objective functions Todo list (open questions)

28 Energy-delay product during applu life-time Todo list (open questions)

29 Great potential for energy saving:  Pervious work, considering only one thread at a time, achieved more than 30% of energy saving  Idle cores can be shut down Objective function may change on the fly according to changing power conditions Todo list (open questions) Exploring different objective functions

30 Using SMT Cores Enlarge flexibility and throughput with only modest area and power penalty Motivation No free lunches: Interaction between threads can no longer be ignored  Thread compete for virtually all processor resources Only sampled assignments can be used Permutations space of potential assignments is huge  Can not sample all the assignments  The sampling space must be pruned  The sampling strategy is much more important

31 Simulation Results (SMT) Heterogeneous system with SMT cores

32 References “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor” Allan Snavely, Dean M. Tullsen, In the Proceedings of the 9 th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), Novemeber, 2000 “Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor” Allan Snavely, Dean M. Tullsen, In proceedisng of the 9 th International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 02), June, 2002 SOS (Sample, Optimize, Symbios)

33 SOS - (Sample, Optimize, Symbios) A first attempt to take into account threads interaction in SMT symbiosis – the effectiveness with which multiple jobs achieve speedup when coexcecuted on multithreaded machines  Throughput may actually go down Plundered from Uri’s slides.

34 SOS - (Sample, Optimize, Symbios) Using sampling phases to profile execution  Choose combination that maximize overall weighted speedup Which predictor to use ? Extremely architecture dependent  Encapsulation of hardware details from software IPC and Dcache are inconsistent performers

35 The moral of the SMT case When considering more ‘clustered’ architectures, groups of cores may share resources:  L1 caches  Memory hierarchy  FPU, TLB ? SMT and CMP are just two extremes of a viable spectrum Threads interaction significantly complicates our life

36 The moral of the SMT case A scheduler has 2 tasks:  Define a running set – jobs to be executed during the upcoming timeslice  Assign jobs from the running set to the different cores We have overlooked the first task and simplified the second We’ll have to tackle both If only memory is to be shared our life may be easier  Not by that much, though Threads interaction significantly complicates our life

37 References “Phase Tracking and Prediciton” Timothy Sherwood, Suleyman Sair, Brad Calder, Proceedings of the 30th annual international symposium on Computer architecture, IEEE CS Press, 2003, pp “Discovering and Exploiting Program Phases” Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder, IEEE Micro : Micro's Top Picks from Computer Architecture Conferences, Nov./Dec Phase tracking

38 It is Profitable to accurately identify program phases  React quickly to phase changes  Spare samplings overheads IPC is not necessarily the most appropriate representative Smart phase detection Todo list (open questions)

39 Phases are a direct function of the way program traverse its code during execution  Use basic blocks ratios to identify phases  Architectural independent Basic Block Vector  One dimension array with an index for every basic block in the program  Each element represent the execution frequencies of the basic blocks weighted by instruction count, normalized Phase Tracking Main Idea

40 Phase Tracking Main Idea Basic Block Vector  One dimension array with an index for every basic block in the program  Each element represent the execution frequencies of the basic blocks weighted by instruction count, normalized

41 Phase Tracking Phase capture:

42 Work’s Innovation Using heterogeneous muti-cores to gain superior performance  Previous works targeted only power consumption  First real simulation results General-purpose processors  Previous works concentrated on SoC with known workloads Dynamic task scheduling and task-to-core assignment  Most of the works use static scheduling and a full knowledge of the application characteristic

43 Summary Heterogeneous multi-core architecture can provide significantly higher performance Covers a wide spectrum of workloads Dynamic core assignments policy exploit intra-thread and inter-thread diversity Open issues:  Optimize energy consumption  Thread interactions  Phase detection  A lot more… Any questions ?

44 References of the day “Single-ISA Heterogeneous Multi-Core Architecture for Multithreaded Workload Performance” Rakesh Kumar, Dean M. Tullsen, Parthasarath Ranganathan, Norman P.Jouppi, Keith I. Farkas In Proceedings of the 31 st International Symposium on Computer Architecture (ISCA’04), June, 2004 “Single-ISA Heterogeneous Multi-Core Architecture: The Potential for Processor Power Reduction” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the 36 st International Symposium on Microarchitecure, December 2003 “A Multi-Core Approach to Addressing the Energy-Complexity Problem In Microprocessor” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, In Proceedings of the Workshop on Complexity-Effective Design (WCED), June 2003 “Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architecture” Rakesh Kumar, Keith I. Farkas, Norman P.Jouppi, Parthasarath Ranganathan, Dean M. Tullsen, Computer Architecture Letters, Volume 2, Apr “Phase Tracking and Prediciton” Timothy Sherwood, Suleyman Sair, Brad Calder, Proceedings of the 30th annual international symposium on Computer architecture, IEEE CS Press, 2003, “Discovering and Exploiting Program Phases” Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder, IEEE Micro : Micro's Top Picks from Computer Architecture Conferences, Nov./Dec “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor ” Allan Snavely, Dean M. Tullsen, In the Proceedings of the 9 th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), Novemeber, 2000

45 References of the day “Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor” Allan Snavely, Dean M. Tullsen, In proceedisng of the 9 th International Conference on Measurement and Modeling of Computer Systems (Sigmetrics 02), June, 2002 “Conjoind-core Chip Multiprocessing” Rakesh Kumar, Norman P.Jouppi, Dean M. Tullsen, In Proceedings of the 37 st International Symposium on Microarchitecure, December 2004

46 Backup

47 Sample 2n configuration for n threads workload. Pruning strategies:  pref-EV6 : assumes it is best to run 2 thread on each EV6 before using EV5  pref-EV5 : assumes it’s best to run on EV5 rather than put 2 threads on EV6  pref-nigther : sample random schedule  pref-similar: sampling is biased toward a configuration similar to the current one used. Sampling strategies Using Multithreaded Cores

48 Workload construction 8 benchmarks from spec2000 Thread number vary up to the maximum number of available processor contexts. Various compositions are simulated. Large memory footprint int fp