This module created with support form NSF under grant # DUE 1141022 Module developed Spring 2013 by Apan Qasem Task Orchestration : Scheduling and Mapping.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Scheduling Criteria CPU utilization – keep the CPU as busy as possible (from 0% to 100%) Throughput – # of processes that complete their execution per.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Scheduling. Main Points Scheduling policy: what to do next, when there are multiple threads ready to run – Or multiple packets to send, or web requests.
Chap 5 Process Scheduling. Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU–I/O Burst Cycle – Process execution consists of a.
CPU Scheduling CS 3100 CPU Scheduling1. Objectives To introduce CPU scheduling, which is the basis for multiprogrammed operating systems To describe various.
CPU Scheduling Basic Concepts Scheduling Criteria
Operating System Support Focus on Architecture
CPU Scheduling. Schedulers Process migrates among several queues –Device queue, job queue, ready queue Scheduler selects a process to run from these queues.
Scheduler Activations Effective Kernel Support for the User-Level Management of Parallelism.
5: CPU-Scheduling1 Jerry Breecher OPERATING SYSTEMS SCHEDULING.
Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CS 423 – Operating Systems Design Lecture 22 – Power Management Klara Nahrstedt and Raoul Rivas Spring 2013 CS Spring 2013.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
CPU Scheduling - Multicore. Reading Silberschatz et al: Chapter 5.5.
LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.
CSC 360- Instructor: K. Wu CPU Scheduling. CSC 360- Instructor: K. Wu Agenda 1.What is CPU scheduling? 2.CPU burst distribution 3.CPU scheduler and dispatcher.
OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
Scheduling Basic scheduling policies, for OS schedulers (threads, tasks, processes) or thread library schedulers Review of Context Switching overheads.
This module created with support form NSF under grant # DUE Module developed Fall 2014 by Apan Qasem Parallel Computing Fundamentals Course TBD.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
Process Scheduling III ( 5.4, 5.7) CPE Operating Systems
Chapter 5 Processor Scheduling Introduction Processor (CPU) scheduling is the sharing of the processor(s) among the processes in the ready queue.
Chapter 5: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 5: CPU Scheduling Basic.
Processor Architecture
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
Copyright © Curt Hill Parallelism in Processors Several Approaches.
Martin Kruliš by Martin Kruliš (v1.1)1.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Chapter 3: Processes. 3.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts - 7 th Edition, Feb 7, 2006 Chapter 3: Processes Process Concept.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
1 CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Processor Level Parallelism 1
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Lynn Choi School of Electrical Engineering
lecture 5: CPU Scheduling
Thread & Processor Scheduling
Scheduling of Non-Real-Time Tasks in Linux (SCHED_NORMAL/SCHED_OTHER)
Task Scheduling for Multicore CPUs and NUMA Systems
Architecture & Organization 1
Architecture & Organization 1
Chapter 4: Threads.
Chapter 5: CPU Scheduling
Process scheduling Chapter 5.
CPU SCHEDULING.
Chapter 1 Introduction.
CPU scheduling decisions may take place when a process:
Chapter 6: CPU Scheduling
Computer Evolution and Performance
Presentation transcript:

This module created with support form NSF under grant # DUE Module developed Spring 2013 by Apan Qasem Task Orchestration : Scheduling and Mapping on Multicore Systems Course TBD Lecture TBD Term TBD

TXST TUES Module: D2 Outline Scheduling for parallel systems Load balancing Thread affinity Resource sharing Hardware threads (SMT) Multicore architecture Significance of the multicore paradigm shift 2

TXST TUES Module: D2 Increase in Transistor Count Moore’s Law 3

TXST TUES Module: D2 Increase in Performance 16 MHz 25 MHz Full Speed Level 2 Cache instruction pipeline longer issue pipeline double speed artihmetic 4 Image adapted from Scientific American, Feb 2005, “A Split at the Core”

TXST TUES Module: D2 Increase in Power Density 5

TXST TUES Module: D2 The Power Wall Going with Moore’s Law results in too much heat dissipation and power consumption Moore’s law still holds but does not seem to be economically viable The multicore paradigm Put multiple simplified (and slower) processing cores on the same chip area Fewer transistors per cm 2 implies less heat! 6

TXST TUES Module: D2 Why is Multicore Such a Big Deal? 16 MHz 25 MHz Full Speed Level 2 Cache instruction pipeline longer issue pipeline double speed arithmetic more responsibility on software 7

TXST TUES Module: D2 Why is Multicore Such a Big Deal? Parallelism is mainstream “In future, all software will be parallel” - Andrew Chien, Intel CTO (many concur) Parallelism no longer a matter of interest just for the HPC people Need to find more programs to parallelize Need to find more parallelism in existing parallel applications Need to consider parallelism when writing any new program 8

TXST TUES Module: D2 Why is Multicore Such a Big Deal? Parallelism is Ubiquitous 9

TXST TUES Module: D2 OS Role in the Multicore Era There are several considerations for multicore operating systems Scalability Sharing and contention of resources Non-uniform communication latency Biggest challenge is in scheduling of threads across the system 10

TXST TUES Module: D2 Scheduling Goals Most of the goals don’t change when scheduling for multicore or multi-processor systems Still care about CPU utilization Throughput Turnaround time Wait time Fairness definitions may become more complex 11

TXST TUES Module: D2 Scheduling Goals for Multicore Systems Multicore systems give rise to a new set of goals for the OS scheduler Load balancing Resource sharing Energy usage and power consumption 12

TXST TUES Module: D2 Single CPU Scheduler Overview CPU Ready Queue I/O Queue Long-term Queue 13

TXST TUES Module: D2 Multicore Scheduler Overview I Core 0 Common Ready Queue Common I/O Queue Long-term Queue Core 1 Core 2 Core 3 Cores sharing the same short-term queues 14

TXST TUES Module: D2 Multicore Scheduler Overview II Long-term Queue Cores with individual ready queues Core 0 Ready Queue I/O Queue end Core 1 Ready Queue I/O Queue end Core 3 Ready Queue I/O Queue end Core 2 Ready Queue I/O Queue end No special handling required for many of the single CPU scheduling algorithms 15

TXST TUES Module: D2 Load Balancing Goal is to distribute tasks across all cores on the system such that CPU and other resources are utilized evenly A load-balanced system will generally lead to higher throughput BADGOOD Scenario 1Scenario 2 16

TXST TUES Module: D2 The OS as a Load Balancer Main strategy Identify metric for balanced load average number of processes waiting in ready queues (aka load average) Track load balance metric probe ready queues uptime and who utilities Migrate threads if a core exhibits a high average load If load average for cores 0-3 is 2, 3, 1, and 17 then move threads from core3 to core 2 17

TXST TUES Module: D2 Thread Migration The process of moving a thread from one ready queue to another is known as thread migration Operating systems implement two types of thread migration mechanism Push migration Run a separate process that will migrate a thread from one core to another May be integrated within the kernel as well Pull migration (aka work stealing) Each core fetches threads from other cores 18

TXST TUES Module: D2 Complex Load Balancing Issues Does it help to load balance this workload? Will it make a difference in performance? Thread migration or load balancing will only help with overall performance if t1 runs faster when moved to core0, core2 or core3 Assume only one thread executing on each core 19

TXST TUES Module: D2 Load Balancing for Power This unbalanced workload can have huge implications for power consumption P = cv 2 f and P t One core running at a higher frequency than the others may result in more heat dissipation and overall increased power consumption Power consumption is tied to how fast a processor is running P = power, f = frequency, c = static power dissipation v = voltage and t = temperature 20

TXST TUES Module: D2 Load Balancing for Power Operating Systems are incorporating power metrics in thread migration and load balancing decisions Achieve power-balance by eliminating hotspots Can also try to change the frequency AMD Powernow, Intel Sidestep Can utilize hardware performance counters core utilization core temperature Linux implements this type of scheduling sched_mc_power_savings 21

TXST TUES Module: D2 Load Balancing Trade-offs Thread migration may involve multiple context switch on more than one core Context switches can be very expensive Potential gains from load balancing must be weighed against the increased cost from context switches Load balancing policy may conflict with CFS scheduling balanced load does not imply fair sharing of resources 22

TXST TUES Module: D2 Resource Sharing Current multicore and SMT systems share resources at various levels OS needs to be aware of resource utilization in making scheduling decisions 23

TXST TUES Module: D2 Thread Affinity Tendency for a process to run on a given core for as long as possible without being migrated to a different core aka CPU affinity and processor affinity The operating system uses the notion of thread affinity in performing resource-aware scheduling on multicore systems If a thread has affinity for a specific core (or core group) then priority should be given to schedule the map and schedule the thread to that specific core (or group) Current approach is to use thread affinity to address skewed workloads Start with default (all cores) Set affinity of task I to core j if core j is deemed underutilized 24

TXST TUES Module: D2 Adjusting Thread Affinity Linux kernel maintains the task_struct data structure for every process in the system Affinity information is stored as a bitmask in the cpus_allowed field Can modify or retrieve the affinity of a thread from user- space sched_setaffinity(), sched_getaffinity() pthread_setaffinity_np(), pthread_getaffinity_np() taskset demo 25

TXST TUES Module: D2 Thread Affinity in the Linux Kernel /* Look for allowed, online CPU in same node. */ for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask) if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) return dest_cpu; /* Any allowed, online CPU? */ dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask); if (dest_cpu < nr_cpu_ids) return dest_cpu; /* No more Mr. Nice Guy. */ if (unlikely(dest_cpu >= nr_cpu_ids)) { dest_cpu = cpuset_cpus_allowed_fallback(p); /* * Don't tell them about moving exiting tasks or * kernel threads (both mm NULL), since they never * leave kernel. */ if (p->mm && printk_ratelimit()) { printk(KERN_INFO "process %d (%s) no " "longer affine to cpu%d\n", task_pid_nr(p), p->comm, cpu); } Linux kernel , sched.c 26

TXST TUES Module: D2 Affinity Based Scheduling Affinity based scheduling can be performed under different criteria, using different heuristics Assume 4 cores with two L2 shared between core 0 and core 1, core 2 and core 3, L3 shared among all cores Core 0Core 1Core 2Core 3 L1 L2 L3 L2 27

TXST TUES Module: D2 Affinity Based Scheduling Scheduling for data locality in L2 A thread t i that shares data with t j should be placed in the same affinity group Can lead to unbalanced ready-queues but improved memory performance CP Shared Data in L2 P C time reuse of data 28

TXST TUES Module: D2 Affinity Based Scheduling Core 0Core 1Core 2Core 3 L1 L2 L3 L2 Core 0Core 1Core 2Core 3 L1 L2 L3 L2 P C P C Poor schedule for producer- consumer program Good schedule for producer- consumer program C P P C 29

TXST TUES Module: D2 Affinity Based Scheduling Scheduling for better cache utilization Thread t i and t j only utilize 10% of the cache, t p and t q each demand 80% of the cache Schedule t i and t p on core 0 and core 1, t j and t q on core 2 and core 3 May lead to loss of locality Core 0Core 1Core 2Core 3 L1 L2 L3 L2 titi tjtj tptp tqtq 30

TXST TUES Module: D2 Affinity Based Scheduling Scheduling for better power management t i and t j are CPU-bound while t p and t q are memory- bound Schedule t i and t p on core 0 and t j and t q to core 2 Core 0Core 1Core 2Core 3 L1 L2 L3 L2 t i + t p t j + t q 31

TXST TUES Module: D2 Gang Scheduling Two-step scheduling process Identify a set (or gang) of threads to and adjust affinity for them to execute in a specific core group Gang formation can be done based on resource utilization and sharing Suspend the threads in a gang to let one job have dedicated access to the resources for a configured period of time Traditionally used for MPI programs running on high- performance clusters Becoming mainstream for multicore architectures Patches available that integrates gang scheduling with CFS in Linux 32