4. Workload directed adaptive SMP multicores

Slides:



Advertisements
Similar presentations
Operating Systems: Internals and Design Principles
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Distributed Systems CS
Energy-Efficient System Virtualization for Mobile and Embedded Systems Final Review 2014/01/21.
Institute of Networking and Multimedia, National Taiwan University, Jun-14, 2014.
Project Overview 2014/05/05 1. Current Project “Research on Embedded Hypervisor Scheduler Techniques” ◦ Design an energy-efficient scheduling mechanism.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Low Power Processor --- For Mobile Devices Ziwei Zheng Wenjia Ouyang.
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
CS 423 – Operating Systems Design Lecture 22 – Power Management Klara Nahrstedt and Raoul Rivas Spring 2013 CS Spring 2013.
Microprocessors SUBTITLE Team 3: David Meadows David Foster Sichao Ni Khareem Gordon.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
CPU Scheduling - Multicore. Reading Silberschatz et al: Chapter 5.5.
Computer System Architectures Computer System Software
Chapter 4 Threads, SMP, and Microkernels Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Operating Systems Lecture 09: Threads (Chapter 4)
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:
Operating System 4 THREADS, SMP AND MICROKERNELS
KVM/ARM: The Design and Implementation of the Linux ARM Hypervisor Christoffer Dall Department of Computer Science Columbia University
I-SPAN’05 December 07, Process Scheduling for the Parallel Desktop Designing Parallel Operating Systems using Modern Interconnects Process Scheduling.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
KVM/ARM: The Design and Implementation of the Linux ARM Hypervisor Christoffer Dall Department of Computer Science Columbia University
Critical Power Slope Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen Ram Rajamony Raj Rajkumar.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
An Energy-Efficient Hypervisor Scheduler for Asymmetric Multi- core 1 Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer.
Mobile Middleware for Energy-Awareness Wei Li
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
Embedded System Lab. 김해천 The TURBO Diaries: Application-controlled Frequency Scaling Explained.
Scott Ferguson Section 1
Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.
ARM 2007 Chapter 15 The Future of the Architecture by John Rayfield Optimization Technique in Embedded System (ARM)
Operating System 4 THREADS, SMP AND MICROKERNELS.
CS 546: Intelligent Embedded Systems Gaurav S. Sukhatme Robotic Embedded Systems Lab Center for Robotics and Embedded Systems Computer Science Department.
Progress Report 2013/11/07. Outline Further studies about heterogeneous multiprocessing other than ARM Cache miss issue Discussion on task scheduling.
Operating Systems: Internals and Design Principles
Dezső Sima big.LITTLE technology December 2015 Vers. 2.1.
Research on Embedded Hypervisor Scheduler Techniques 2014/10/02 1.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
E-MOS: Efficient Energy Management Policies in Operating Systems
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Background Computer System Architectures Computer System Software.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Dezső Sima big.LITTLE technology December 2015 Vers. 2.1.
Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker : Chun-Chung Chen Single-ISA.
Matthew Locke November 2007 A Linux Power Management Architecture.
Big.LITTLE technology Dezső Sima Vers. 2.0 December 2016.
Multiprocessing.
Progress Report 2014/05/23.
Microarchitecture.
Adaptive Cache Partitioning on a Composite Core
Green cloud computing 2 Cs 595 Lecture 15.
CS5102 High Performance Computer Systems Thread-Level Parallelism
The Multikernel: A New OS Architecture for Scalable Multicore Systems
April 6, 2001 Gary Kimura Lecture #6 April 6, 2001
Ching-Chi Lin Institute of Information Science, Academia Sinica
Assignment 4 – (a) Consider a symmetric MP with two processors and a cache invalidate write-back cache. Each block corresponds to two words in memory.
Liquid computing – the rVEX approach
Evaluating Asymmetric Multiprocessing for Mobile Applications
Symmetric Multiprocessing (SMP)
Chapter 2: The Linux System Part 3
Operating System 4 THREADS, SMP AND MICROKERNELS
Odroid XU4.
Progress Report 2014/04/23.
Research on Embedded Hypervisor Scheduler Techniques
Presentation transcript:

4. Workload directed adaptive SMP multicores Dezső Sima December 2013 (v1.1, Last updated 10/12/2013) © Dezső Sima 2013

4.1 Introduction to workload directed adaptive SMP multicores

4.1 Introduction to workload directed adaptive SMP multicores (1) Interpretation of the terms symmetric multiprocessing/multiprocessors (SMP) In a symmetric multiprocessor all processors access main memory in the same way, as indicated below. Processor 1 2 n Figure: Example of an SMP multiprocessor (Based on [1])

4.1 Introduction to workload directed adaptive SMP multicores (2) Interpretation of the term symmetric multicore processor In a symmetric multicore processor all cores access main memory in the same way, as indicated below. Core 1 2 n Figure: Example of a symmetric multicore processor (Based on [1])

4.1 Introduction to workload directed adaptive SMP multicores (3) Note Unfortunately, in the literature usually, both symmetrical multiprocessors and symmetrical multicores are designated in the same way, simply by the term SMP. Despite the fact that this may lead to confusion in the terminology, in this section subsequently we will designate symmetrical multicore processors as SMP multicores or SMPs.

4.1 Introduction to workload directed adaptive SMP multicores (4) Workload directed adaptive SMPs Asynchronous adaptive SMPs (aSMPs) Synchronous adaptive SMPs aSMP enables each core to run at different voltage and frequency This results in lower power by scaling down voltage and frequency of each core to the actual load It restricts each core to run at the same voltage and frequency Voltage and frequency of the core cluster is determined by the highest load in the cluster Cores running low intensity applications waste power running at higher voltage and frequency than needed. Examples Qualcomm Snapdragon S4 (2012) ARM’s big.LITTLE technology (2011) Nvidia’s vSMP technology (2011) Based on [2]

4.1 Introduction to workload directed adaptive SMP multicores (5) Synchronous adaptive SMPs Synchronous adaptive SMP in the 1 + n configuration Synchronous adaptive SMP in the n + n configuration It is Nvidia’s solution, called Variable SMP E.g. ARM’s solution, called big.LITTLE processing Using two clusters of cores a cluster of a single low power core and a cluster of high performance cores. Using two clusters of cores a cluster of low power (LITTLE) cores and a cluster of high performance (big) cores. E.g. Cluster of high performance cores Cluster of big cores Cluster of LITTLE cores CPU0 CPU1 CPU0 CPU1 “Cluster” of a single low power core CPU0 CPU1 CPU2 CPU3 CPU2 CPU3 CPU0 CPU2 CPU3 Cache coherent interconnect Cache coherent interconnect To memory To memory

4.1 Introduction to workload directed adaptive SMP multicores (6) Example for performance and energy efficiency of high performance (Cortex-A15) and low power (Cortex-A7) cores [3]

4.2 Principle of Nvidia’s variable SMP

4.2 Principle of Nvidia’s variable SMP (1) Nvidia’s variable SMP is in fact synchronous adaptive SMP in the 1 + n configuration. It includes two clusters of cores, as shown below: a “cluster” of a single low power core and a cluster of high performance cores. E.g. Cluster of high performance cores CPU0 CPU1 “Cluster” of a single low power core CPU2 CPU3 CPU0 Cache coherent interconnect Figure: Example layout of Nvidia’s variable SMP

4.2 Principle of Nvidia’s variable SMP (2) Usage models of synchronous adaptive SMPs in the 1 + n configuration Usage models of synchronous adaptive SMPs in the 1 + n configuration Exclusive use of the clusters Inclusive use of the clusters E.g. Nvidia’s Variable SMP in Tegra 3 (2011) and Tegra 4 (2013) It is not implemented yet.

4.2 Principle of Nvidia’s variable SMP (3) The low power core Optimized for low power consumption by using transistors that require low power to operate, as shown below [4].

4.2 Principle of Nvidia’s variable SMP (4) Power-Performance curve of Nvidia’s vSMP [4]

4.3 Principle of ARM’s big.LITTLE technology

4.3 Principle of ARM’s big.LITTLE technology (1) ARM’s big.LITTLE technology is in fact synchronous adaptive SMP in the n + n configuration. It includes two clusters of cores, as shown below: a cluster of a low power cores, termed as the LITTLE cores and a cluster of high performance cores, termed as the big cores. CPU0 CPU1 CPU2 CPU3 Cache coherent interconnect Cluster of LITTLE cores big cores To memory Figure: Example layout of ARM’s big.LITTLE technology

4.3 Principle of ARM’s big.LITTLE technology (2) Usage models of synchronous adaptive SMPs in the n + n configuration Usage models of synchronous adaptive SMPs in the n+n configuration Exclusive/inclusive use of the clusters The cluster migration model

4.3 Principle of ARM’s big.LITTLE technology (3) Exclusive/inclusive use of the clusters Exclusive/inclusive use of the clusters Exclusive use of the clusters Inclusive use of the clusters Clusters are used exclusively, i.e. at a time one of the clusters is in use as shown below for the cluster migration model (to be discussed later) Clusters are used inclusively, i.e. at a time both clusters can be used partly or entirely Low load High load Low load High load Cluster of big cores Cluster of big cores Cluster of big cores Cluster of big cores Cluster of LITTLE cores Cluster of LITTLE cores CPU0 CPU1 CPU0 CPU1 Cluster of LITTLE cores CPU0 CPU1 Cluster of LITTLE cores CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU2 CPU3 CPU2 CPU3 CPU2 CPU3 CPU2 CPU3 CPU2 CPU3 CPU2 CPU3 CPU2 CPU3 CPU2 CPU3 Cache coherent interconnect Cache coherent interconnect Cache coherent interconnect Cache coherent interconnect

4.3 Principle of ARM’s big.LITTLE technology (2) Usage models of synchronous adaptive SMPs in the n + n configuration Usage models of synchronous adaptive SMPs in the n+n configuration Exclusive/inclusive use of the clusters The cluster migration model

4.3 Principle of ARM’s big.LITTLE technology (4) The cluster migration model [5] The cluster migration model Exclusive use of the clusters Inclusive use of the clusters Cluster migration Core migration Core migration big.LITTLE processing with cluster migration big.LITTLE processing with core migration big.LITTLE MP

4.3 Principle of ARM’s big.LITTLE technology (5) Big.LITTLE processing with cluster migration [5] There are two core clusters, the LITTLE core cluster and the big core cluster. Tasks run on either the LITTLE or the big core cluster, so only one core cluster is active at any time (except a short interval during a cluster switch). Low workloads, such as background synch tasks, audio or video playback run typically on the LITTLE core cluster. If the workload becomes higher than the max performance of the LITTLE core cluster the workload will be migrated to the big core cluster and vice versa.

4.3 Principle of ARM’s big.LITTLE technology (6) Cluster switches [6] Cluster selection is driven by OS power management. OS (e.g. the Linux cpufreq routine) samples the load for all cores in the cluster and selects an operating point for the cluster. It switches clusters at terminal points of the current clusters DVFS curve, as illustrated in the next Figure.

4.3 Principle of ARM’s big.LITTLE technology (7) Power/performance curve during cluster switching [7] DVFS operating points (Low power core) (High performance core) A switch from the low power cluster to the high performance cluster is an extension of the DVFS strategy. A cluster switch lasts about 30 kcycles.

4.3 Principle of ARM’s big.LITTLE technology (8) Big.LITTLE processing with core migration [5], [8] There are two core clusters, the LITTLE core cluster and the big core cluster. Cores are grouped into pairs of one big core and one LITTLE core. The LITTLE and the big core of a group are used exclusively. Each LITTLE core can switch to its big counterpart if it meets a higher load than its max. performance and vice versa. Each core switch is independent from the others.

4.3 Principle of ARM’s big.LITTLE technology (9) Core switches [6] Core selection in any core pair is performed by OS power management. The DVFS algorithm monitors the core load. When a LITTLE core cannot service the actual load, a switch to its big counterpart is initiated and the LITTLE core is turned off and vice versa.

4.3 Principle of ARM’s big.LITTLE technology (10) big.LITTLE MP processing with core migration [8],[5] The OS scheduler has all cores of both clusters at its disposal and can activate all cores at any time. Tasks can run or be moved between the LITTLE cores and the big cores as decided by the scheduler. big.LITTLE MP termed also as Heterogeneous Multiprocessing (HMP).

4.3 Principle of ARM’s big.LITTLE technology (11) Use of the big.LITTLE technology in recent mobile processors big.LITTLE tecnology Exclusive use of the clusters Inclusive use of the clusters Cluster migration Core migration Core migration big.LITTLE processing with cluster migration big.LITTLE processing with core migration big.LITTLE MP (Heterogeneous Multiprocessing) Described first in ARM’s White Paper (2011) [3] Described first in ARM’s White Paper (2012) [9] Described first in ARM’s White Paper (2011) [3] Used in Samsung Exynos 5 Octa 5410 (2013) (4 + 4 cores) Samsung HMP on Exynos 5 Octa 5420 (2013) (4 + 4 cores) Mediatek MT 8135 (2013) (2 + 2 cores) Renesas MP 6530 (2013) (2 + 2 cores)

References (1) [1]: Wikipedia, File:SMP - Symmetric Multiprocessor System.svg, http://en.wikipedia.org/wiki/File:SMP_-_Symmetric_Multiprocessor_System.svg [2]: Sag A., Qualcomm Snapdragon S4 Benchmarking Day, BSN, July 25 2012, http://www.brightsideofnews.com/news/2012/7/25/qualcomm-snapdragon-s4- benchmarking-day.aspx [3]: Greenhalgh P., Big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7, White Paper, Sept. 2011, http://www.arm.com/files/downloads/big.LITTLE_Final.pdf [4]: Variable SMP – A Multi-Core CPU Architecture for Low Power and High Performance, Nvidia, Whitepaper, 2011, http://www.nvidia.com/content/PDF/tegra_white_papers/tegra- whitepaper-0911b.pdf [5]: Klug B., Samsung Announces big.LITTLE MP Support in Exynos 5420, AnandTech, Sept. 11 2013, http://www.anandtech.com/show/7313/samsung-announces-biglittle- mp-support-in-exynos-5420 [6]: Gupta A., Implications of Per CPU switching in a big.LITTLE system, ARM [7]: Gálffy Cs., Érkezik a valóban nyolcmagos Samsung Exynos 5, HWSW, Sept. 10 2013, http://www.hwsw.hu/hirek/50915/samsung-exynos-5-octa-arm-big-little-hmp-cortex.html [8]: MediaTek Enables ARM big.LITTLE Heterogeneous Multi-Processing Technology in Mobile SoCs, http://www.mediatek.com/_en/Event/201307_TrueOctaCore/MediaTekEnablesARM bigLITTLEHMPTechnology.pdf [9]: Jeff B., Advances in big.LITTLE Technology for Power and Energy Savings, White Paper, Sept. 2012, http://www.arm.com/files/pdf/Advances_in_big.LITTLE_Technology_for_ Power_and_Energy_Savings.pdf