Task Scheduling for Multicore CPUs and NUMA Systems Martin Kruliš by Martin Kruliš (v1.2) 6. 10. 2017
Parallel Multiprocessing Logical structure of processes and threads Fibers (user threads) Processes (full threads) Kernel threads CPU cores by Martin Kruliš (v1.2) 6. 10. 2017
Scheduling in OS Thread/Process Scheduling in OS Operating systems have multiple requirements Fairness (regarding multiple processes) Throughput (maximizing CPU utilization) Latency (minimizing response time) Efficiency (minimizing overhead) Additional constraints (I/O bound operations) Threads are planned on available cores Preemptively (thread can be removed from a core) Optimal solution does not exist A compromise between requirements is established by Martin Kruliš (v1.2) 6. 10. 2017
Parallel Programming Parallel User Applications Fork/Join Model More “cooperative” parallel processing The whole application has the same top-level objective Typically aspires to reduce processing time/latency Fork/Join Model One of the simplest and most often employed models Easily achieved without special libraries in many langs Usually employed in a wrong way Clumsy, great overhead of creating threads, … Suitable mostly for large-task parallelism Data parallelism is more important nowadays by Martin Kruliš (v1.2) 6. 10. 2017
Task-based Parallelism Task-based Decomposition An abstraction for programming Task One piece of work to be performed (code + data) Computational in nature, executed non-preemptively Typically represented by an object (functor) Much more light-weighted than a thread Ideal size ~ 10-100 thousand instructions Decomposition Both task-parallel and data-parallel problems can be easily decomposed to small tasks Require some implicit synchronization mechanisms by Martin Kruliš (v1.2) 6. 10. 2017
Task-Based Programming Task-Based Programming Issues Task spawning Scheduling (i.e., load balancing) (Implicit) synchronization Example: Intel Threading Building Blocks Tasks can be spawned by other tasks Simplifies nesting and load balancing Each task have pointer to a successor and refcount (how many successors are pointing to it) Task decrements its successor’s refcount on completion Tasks with refcount == 0 can be executed by Martin Kruliš (v1.2) 6. 10. 2017
Task Spawning Blocking Pattern Parent spawns children and waits for them Special blocking call, that invokes scheduler Parent has refcount > # of children by Martin Kruliš (v1.2) 6. 10. 2017
Task Spawning Continuation Passing Parent creates continuation task and children Children starts immediately (refcount == 0) Continuation has refcount == # of children by Martin Kruliš (v1.2) 6. 10. 2017
Parallel Algorithm Templates Parallel-reduce Decomposition Too large task Reduce <0,100) Finalize Too large tasks Reduce <0,50) Finalize Reduce <50,100) Finalize Reduce <0,25) <25,50) Reduce <50, 75) <75,100) by Martin Kruliš (v1.2) 6. 10. 2017
Task Scheduling Task Scheduling Thread Pool Oversubscription Workers are processing available tasks (e.g., tasks with refcount == 0 in TBB) # of workers ~ # available CPU cores Various scheduling strategies (which tasks goes first) Oversubscription Creating much more tasks than available workers Provides opportunity for loadbalancing Even when the length of the task is data-driven (and thus unpredictable) by Martin Kruliš (v1.2) 6. 10. 2017
Task Scheduling Task Scheduling Strategies Static Scheduling When number and length of tasks is predictable Assigned to the threads at the beginning Virtually no scheduling overhead (after assignment) Dynamic Scheduling Task can be reassigned to other workers Task are assigned as the workers become available Other Strategies Scheduling in phases – tasks are assigned statically, once some workers become available, overall reassignment is performed by Martin Kruliš (v1.2) 6. 10. 2017
Dynamic Task Scheduling Scheduling Algorithms Many different approaches that are suitable for different specific scenarios Global task queue Threads atomically pop tasks (or push tasks) The queue may become a bottleneck Private task queues per thread Each thread process/spawns its own tasks What should thread do, when its queue is empty? Combined solutions Local and shared queues by Martin Kruliš (v1.2) 6. 10. 2017
CPU Architecture Modern Multicore CPUs by Martin Kruliš (v1.2) 6. 10. 2017
NUMA Non-Uniform Memory Architecture First-touch Physical Memory Allocation by Martin Kruliš (v1.2) 6. 10. 2017
Cache Coherency Memory Coherency Problem MESI Protocol Implemented on cache level All cores must perceive the same data MESI Protocol Each cache line has a special flag Modified Exclusive Shared Invalid Memory bus snooping + update rules MOESI protocol – similar, adds “Owning” state by Martin Kruliš (v1.2) 6. 10. 2017
Cache Coherency MESI Protocol by Martin Kruliš (v1.2) 6. 10. 2017
TBB Task Scheduler Intel Threading Building Blocks Scheduler Thread pool with private task queues Local thread gets/inserts tasks from/to the bottom of its queue Thread steals tasks from the top of the queue by Martin Kruliš (v1.2) 6. 10. 2017
TBB Task Scheduler Task Dependency Tree Stack-like local processing leads to DFS tree expansion within one thread Reduces memory consumption Improves caching Queue-like stealing leads to BFS tree expansion by Martin Kruliš (v1.2) 6. 10. 2017
Locality Aware Scheduling Challenges Maintaining NUMA locality Efficient cache utilization vs. thread affinity Avoiding false sharing Key ideas Separate requests on different NUMA nodes Task scheduling consider cache sharing Related tasks – on cores that are close Minimize overhead of task stealing by Martin Kruliš (v1.2) 6. 10. 2017
Locality Aware Scheduling Locality Aware Scheduler (Z. Falt) Key ideas Queues are associated with cores (not threads) Threads are bound (by affinity) to NUMA node Two methods for task spawning Immediate task – related/follow-up work Deferred task – unrelated work Task stealing reflects CPU core distance NUMA distance – number of NUMA hops Cache distance – level of shared cache (L1, L2, …) by Martin Kruliš (v1.2) 6. 10. 2017
Locality Aware Scheduling Locality Aware Scheduler (Z. Falt) by Martin Kruliš (v1.2) 6. 10. 2017
Discussion by Martin Kruliš (v1.2) 6. 10. 2017