Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Authors: Matthew DeVuyst, Rakesh Kumar, and Dean M. Tullsen.

Slides:

Advertisements

Similar presentations

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

Lecture 6: Multicore Systems

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.

A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.

International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,

Institute of Networking and Multimedia, National Taiwan University, Jun-14, 2014.

Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.

Chapter 17 Parallel Processing.

How Multi-threading can increase on-chip parallelism

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Low-Power Wireless Sensor Networks

Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

The Owner Share scheduler for a distributed system 2009 International Conference on Parallel Processing Workshops Reporter: 李長霖.

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

CMT OS scheduling summary Yipkei Kwok 03/18/2008.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz November, 2004.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Processor Architecture

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

Evaluating the Impact of Job Scheduling and Power Management on Processor Lifetime for Chip Multiprocessors (SIGMETRICS 2009) Authors: Ayse K. Coskun,

Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker ： Chun-Chung Chen Single-ISA.

Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM

Chapter 5a: CPU Scheduling

Simultaneous Multithreading

Multi-core processors

Improved schedulability on the ρVEX polymorphic VLIW processor

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Applying SVM to Data Bypass Prediction

Chapter 6: CPU Scheduling

Multithreaded Programming

Presentation transcript:

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Authors: Matthew DeVuyst, Rakesh Kumar, and Dean M. Tullsen Source: 20 th IEEE International Parallel & Distributed Processing Symposium, (IPDPS 06)

Outline Introduction Architecture Scheduling Policies –Sampling-based Policies –Electron Policies Experimental Methodology Analysis and Results Conclusions

Outline Introduction Architecture Scheduling Policies –Sampling-based Policies –Electron Policies Experimental Methodology Analysis and Results Conclusions

Introduction Given a set of applications and a set of multithreaded cores, the space of possible schedules of threads to cores can be enormous. –which threads, and how many, are co-scheduled on cores. App. 1 App. 2 App. 3App. 6 Core 2 Core 3 Core 4 Core 1 App. 4App. 5 App. 7

Introduction (cont.) Conventional multiprocessor schedulers, applied to this architecture, will always seek to balance the number of threads on each core. This research demonstrates that such an approach eliminates one of the big advantages of this architecture – the ability to use unbalanced schedules to allocate the right amount of execution resources to each thread.

Introduction (cont.) App. 1 App. 2 App. 3 Core 2 Core 1 App. 1 App. 2 App. 3 Core 2 Core 1 High marginal performance High marginal power Low marginal performance Less marginal power Power gating

Introduction (cont.) This paper examines system-level thread scheduling policies for a chip multithreaded architecture. Particular attention is paid to enabling performance and energy efficiency through unbalanced schedules. These schedules give the system the ability –Cluster threads that have low execution demands –Amortize the power cost of using a core

Outline Introduction Architecture Scheduling Policies –Sampling-based Policies –Electron Policies Experimental Methodology Analysis and Results Conclusions

Architecture The architecture is a chip multiprocessor consisting of –4 homogeneous simultaneous multithreaded cores –4 contexts per core, for a total of 16 contexts in the system –Each SMT core is out-of-order –Implemented in the 0.1μm process –Shared L2 and L3 caches –Each core has its own L1 data cache and L1 instruction cache –The implementation is assumed to be at 2.1 GHz and latencies are determined accordingly Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Core 16

Architecture (cont.) When a thread migrates to another core –The dirty data in the L1 data cache of the core it was previously running on will be written to both the L1 cache of the core on which the thread is now running and the shared L2 cache. In this work, we assume that unused cores are completely powered down, rather than left idle. –unused cores suffer no static leakage or dynamic switching power –it takes 30μs to turn on or off a core, include plenty of time for dirty data to be flushed from the caches, as well

Architecture (cont.)

Outline Introduction Architecture Scheduling Policies –Sampling-based Policies –Electron Policies Experimental Methodology Analysis and Results Conclusions

Scheduling Policies We assume an operating system level thread scheduler that makes global (CMP-wide) scheduling decisions. The processor typically makes scheduling decisions based on sampled data of processor power and performance. –processor power and performance can be estimated by reading counter registers that are already found in most modern processors –new samples are collected and new scheduling decisions can be made as often as every operating system timer interrupt or when the mix of jobs changes

Scheduling Policies (cont.) Assume all jobs have equal priority. The scheduling policies we evaluate come in two broad categories: –Sampling-based policies –Electron policies

Outline Introduction Architecture Scheduling Policies –Sampling-based Policies –Electron Policies Experimental Methodology Analysis and Results Conclusions

Sampling-based policies The sampling-based policies work in a series of alternating temporal phases: –A sampling phase –A steady phase During the sampling phase, a number of different schedules are tried; at the end of the sampling phase the sample schedule with the best metric value is chosen as the schedule to be in effect throughout the steady phase.

Sampling-based policies (cont.) Consider the following sampling-based scheduling policies –Balanced Symbiosis –Symbiosis –Balanced Random –Prefer Last – Numbers –Prefer Last – Swap –Prefer Last – Move All these policies are evaluated assuming that any unused cores are power gated.

Sampling-based policies (cont.) We consider four system-level objective functions in this paper –Performance The metric for performance is calculated weighted speedup (CWS). –Power The sum of the power of all the cores plus the power of the shared structures. –Energy –Energy Delay Product (EDC)

Outline Introduction Architecture Scheduling Policies –Sampling-based Policies –Electron Policies Experimental Methodology Analysis and Results Conclusions

Electron Policies The policies in the previous section rely on sampling for intelligent scheduling. –However, such policies become less effective as the search space expands. Rather than sampling, the electron policies rely on more explicit evidence –That a particular core is over-scheduled, under-scheduled, or just poorly scheduled.

Electron Policies (cont.) Cores will attract threads that fill a void and repel threads when contention is high. Threads will move around each interval to create a better fit. The schedule naturally adapts as threads enter new phases of execution.

Electron Policies (cont.) Consider the following electron policies customized for each metric –Electron – Performance –Electron – Energy –Electron – EDP For the electron policies, the duration of a period is moderately long. If a new schedule results in a core being left idle, that core is powered down immediately.

Electron Policies (cont.) If the new schedule calls for cores to be either powered on or powered off, the scheduler does not sample any performance or power counters during the transition. Note that electron schedulers run into the risk of continuing to alternate among two schedules. –To help avoid such situations, history information has been incorporated into the scheduling policy decisions. –The electron scheduling policies will be able to adapt to workload phase changes.

Outline Introduction Architecture Scheduling Policies –Sampling-based Policies –Electron Policies Experimental Methodology Analysis and Results Conclusions

Experimental Methodology Our thread scheduler is assumed to be a part of the operating system. New samples are collected and new scheduling decisions can be made at every operating system time-slice interval. We have assumed an operating system time-slice of a quarter million cycles. This time-slice is artificially short to keep our simulations from taking too long.

Experimental Methodology (cont.) Twelve benchmarks from the SPEC 2000 benchmark suite were chosen to construct workloads for our evaluation. –Each benchmark was fast-forwarded for 2 billion instructions before detailed simulation.

Experimental Methodology (cont.) All our simulations are done using a chip-multi-threaded multiprocessor derivative of SMTSIM. Appropriate modifications were done to simulate the effect of OS-level scheduling as well as the availability of hardware counters. To gather power statistics we integrated a modified version of Wattch into our simulator.

Outline Introduction Architecture Scheduling Policies –Sampling-based Policies –Electron Policies Experimental Methodology Analysis and Results Conclusions

Analysis and Results The marginal performance achieved by using an additional core in a CMP of SMTs is typically higher than the marginal performance improvement from using an additional SMT context on the same core. On the other hand, the converse is true for energy. That is, energy efficiency increases with the number of contexts in operation, so we tend to aggregate threads when scheduling for energy Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8

Analysis and Results (cont.) Scheduling for a CMP of SMTs such that both power and performance are optimized, then, requires a careful balance between these two competing objectives.

Analysis and Results (cont.) We perform all our evaluations in this section using the energy-delay product (EDP) metric. EDP recognizes the importance of both power and performance and is used widely as an important objective function for desktop as well as server processors.

Unbalanced Scheduling Static Ideal –That allows unbalanced scheduling against the best static scheduling policies that are constrained to be balanced. Static Balanced –Ensures that each core runs the same number of threads and, hence, is similar to the traditional load-balancing schedulers. Cluster Balanced –Ensures that only as many cores as necessary to run a given number of threads are kept on and the rest are power gated; among the active cores, each core runs the same number of threads.

Unbalanced Scheduling (cont.) 12% 6.6%

Unbalanced Scheduling (cont.) 62.5% 75%

Unbalanced Scheduling (cont.) The advantages due to unbalanced scheduling depend on the characteristics of the workload. –Benchmarks gcc and gzip are averse to running with other applications due to high ICache working set sizes and high core utilization, respectively. –Balanced scheduling policies force these applications to be co- scheduled with some other thread on the same core resulting in a significant performance hit. –However, Static Ideal allows these applications to be on a core by themselves resulting in high overall efficiency.

Exploring the Search Space through Directed Sampling 2.3% 7.8%

Exploring the Search Space through Directed Sampling (cont.) 91% 59% 89%90%

Exploring the Search Space through Directed Sampling (cont.) 1.9% 5.9% 2.3% 7.8%

Non-sampling Strategies 2X 2.4X

Non-sampling Strategies (cont.) 83% 99%100%

Conclusions A traditional multiprocessor scheduler will not identify the best schedules. A good scheduler must be able to explore both balanced and unbalanced schedules. This paper proposes several operating system level thread scheduling policies for both performance and energy are first-class concerns, and that adapt dynamically to the current program behavior. We show gains, versus a random scheduler that always uses balanced schedules, of 6-11% in energy-delay product.