Task Scheduling for Multicore CPUs and NUMA Systems

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Scheduling Criteria CPU utilization – keep the CPU as busy as possible (from 0% to 100%) Throughput – # of processes that complete their execution per.
Distributed Systems CS
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
CS533 Concepts of Operating Systems Class 3 Integrated Task and Stack Management.
Race Conditions CS550 Operating Systems. Review So far, we have discussed Processes and Threads and talked about multithreading and MPI processes by example.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Previous lecture review n Out of basic scheduling techniques none is a clear winner: u FCFS - simple but unfair u RR - more overhead than FCFS may not.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 15 Scheduling Read Ch.
Processor Architecture
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Martin Kruliš by Martin Kruliš (v1.1)1.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Single Instruction Multiple Threads
Chapter 4 – Thread Concepts
Guy Martin, OSLab, GNU Fall-09
lecture 5: CPU Scheduling
Processes and threads.
Scheduling of Non-Real-Time Tasks in Linux (SCHED_NORMAL/SCHED_OTHER)
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Process Management Process Concept Why only the global variables?
Operating System Concepts
Processes and Threads Processes and their scheduling
Chapter 4 – Thread Concepts
The University of Adelaide, School of Computer Science
A task-based implementation for GeantV
Chapter 4 Threads.
Parallel Algorithm Design
Chapter 4: Threads.
Threads & multithreading
Process management Information maintained by OS for process management
CMSC 611: Advanced Computer Architecture
Chapter 4: Threads.
Chapter 15 – Part 1 The Internal Operating System
Chapter 15, Exploring the Digital Domain
Chapter 4: Threads.
Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
Chapter 5: CPU Scheduling
Process scheduling Chapter 5.
CPU scheduling decisions may take place when a process:
Threads Chapter 4.
Distributed Systems CS
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Chapter 6: CPU Scheduling
Multithreaded Programming
Concurrency: Mutual Exclusion and Process Synchronization
High Performance Computing
Chapter 4 Threads, SMP, and Microkernels
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
- When you approach operating system concepts there might be several confusing terms that may look similar but in fact refer to different concepts:  multiprogramming, multiprocessing, multitasking,
Jakub Yaghob Martin Kruliš
CSC Multiprocessor Programming, Spring, 2011
Chapter 3: Process Management
Chapter 4:Threads Book: Operating System Principles , 9th Edition , Abraham Silberschatz, Peter Baer Galvin, Greg Gagne.
Presentation transcript:

Task Scheduling for Multicore CPUs and NUMA Systems Martin Kruliš by Martin Kruliš (v1.2) 6. 10. 2017

Parallel Multiprocessing Logical structure of processes and threads Fibers (user threads) Processes (full threads) Kernel threads CPU cores by Martin Kruliš (v1.2) 6. 10. 2017

Scheduling in OS Thread/Process Scheduling in OS Operating systems have multiple requirements Fairness (regarding multiple processes) Throughput (maximizing CPU utilization) Latency (minimizing response time) Efficiency (minimizing overhead) Additional constraints (I/O bound operations) Threads are planned on available cores Preemptively (thread can be removed from a core) Optimal solution does not exist A compromise between requirements is established by Martin Kruliš (v1.2) 6. 10. 2017

Parallel Programming Parallel User Applications Fork/Join Model More “cooperative” parallel processing The whole application has the same top-level objective Typically aspires to reduce processing time/latency Fork/Join Model One of the simplest and most often employed models Easily achieved without special libraries in many langs Usually employed in a wrong way Clumsy, great overhead of creating threads, … Suitable mostly for large-task parallelism Data parallelism is more important nowadays by Martin Kruliš (v1.2) 6. 10. 2017

Task-based Parallelism Task-based Decomposition An abstraction for programming Task One piece of work to be performed (code + data) Computational in nature, executed non-preemptively Typically represented by an object (functor) Much more light-weighted than a thread Ideal size ~ 10-100 thousand instructions Decomposition Both task-parallel and data-parallel problems can be easily decomposed to small tasks Require some implicit synchronization mechanisms by Martin Kruliš (v1.2) 6. 10. 2017

Task-Based Programming Task-Based Programming Issues Task spawning Scheduling (i.e., load balancing) (Implicit) synchronization Example: Intel Threading Building Blocks Tasks can be spawned by other tasks Simplifies nesting and load balancing Each task have pointer to a successor and refcount (how many successors are pointing to it) Task decrements its successor’s refcount on completion Tasks with refcount == 0 can be executed by Martin Kruliš (v1.2) 6. 10. 2017

Task Spawning Blocking Pattern Parent spawns children and waits for them Special blocking call, that invokes scheduler Parent has refcount > # of children by Martin Kruliš (v1.2) 6. 10. 2017

Task Spawning Continuation Passing Parent creates continuation task and children Children starts immediately (refcount == 0) Continuation has refcount == # of children by Martin Kruliš (v1.2) 6. 10. 2017

Parallel Algorithm Templates Parallel-reduce Decomposition Too large task Reduce <0,100) Finalize Too large tasks Reduce <0,50) Finalize Reduce <50,100) Finalize Reduce <0,25) <25,50) Reduce <50, 75) <75,100) by Martin Kruliš (v1.2) 6. 10. 2017

Task Scheduling Task Scheduling Thread Pool Oversubscription Workers are processing available tasks (e.g., tasks with refcount == 0 in TBB) # of workers ~ # available CPU cores Various scheduling strategies (which tasks goes first) Oversubscription Creating much more tasks than available workers Provides opportunity for loadbalancing Even when the length of the task is data-driven (and thus unpredictable) by Martin Kruliš (v1.2) 6. 10. 2017

Task Scheduling Task Scheduling Strategies Static Scheduling When number and length of tasks is predictable Assigned to the threads at the beginning Virtually no scheduling overhead (after assignment) Dynamic Scheduling Task can be reassigned to other workers Task are assigned as the workers become available Other Strategies Scheduling in phases – tasks are assigned statically, once some workers become available, overall reassignment is performed by Martin Kruliš (v1.2) 6. 10. 2017

Dynamic Task Scheduling Scheduling Algorithms Many different approaches that are suitable for different specific scenarios Global task queue Threads atomically pop tasks (or push tasks) The queue may become a bottleneck Private task queues per thread Each thread process/spawns its own tasks What should thread do, when its queue is empty? Combined solutions Local and shared queues by Martin Kruliš (v1.2) 6. 10. 2017

CPU Architecture Modern Multicore CPUs by Martin Kruliš (v1.2) 6. 10. 2017

NUMA Non-Uniform Memory Architecture First-touch Physical Memory Allocation by Martin Kruliš (v1.2) 6. 10. 2017

Cache Coherency Memory Coherency Problem MESI Protocol Implemented on cache level All cores must perceive the same data MESI Protocol Each cache line has a special flag Modified Exclusive Shared Invalid Memory bus snooping + update rules MOESI protocol – similar, adds “Owning” state by Martin Kruliš (v1.2) 6. 10. 2017

Cache Coherency MESI Protocol by Martin Kruliš (v1.2) 6. 10. 2017

TBB Task Scheduler Intel Threading Building Blocks Scheduler Thread pool with private task queues Local thread gets/inserts tasks from/to the bottom of its queue Thread steals tasks from the top of the queue by Martin Kruliš (v1.2) 6. 10. 2017

TBB Task Scheduler Task Dependency Tree Stack-like local processing leads to DFS tree expansion within one thread Reduces memory consumption Improves caching Queue-like stealing leads to BFS tree expansion by Martin Kruliš (v1.2) 6. 10. 2017

Locality Aware Scheduling Challenges Maintaining NUMA locality Efficient cache utilization vs. thread affinity Avoiding false sharing Key ideas Separate requests on different NUMA nodes Task scheduling consider cache sharing Related tasks – on cores that are close Minimize overhead of task stealing by Martin Kruliš (v1.2) 6. 10. 2017

Locality Aware Scheduling Locality Aware Scheduler (Z. Falt) Key ideas Queues are associated with cores (not threads) Threads are bound (by affinity) to NUMA node Two methods for task spawning Immediate task – related/follow-up work Deferred task – unrelated work Task stealing reflects CPU core distance NUMA distance – number of NUMA hops Cache distance – level of shared cache (L1, L2, …) by Martin Kruliš (v1.2) 6. 10. 2017

Locality Aware Scheduling Locality Aware Scheduler (Z. Falt) by Martin Kruliš (v1.2) 6. 10. 2017

Discussion by Martin Kruliš (v1.2) 6. 10. 2017