Presentation is loading. Please wait.

Presentation is loading. Please wait.

Task Scheduling for Multicore CPUs and NUMA Systems

Similar presentations


Presentation on theme: "Task Scheduling for Multicore CPUs and NUMA Systems"— Presentation transcript:

1 Task Scheduling for Multicore CPUs and NUMA Systems
Martin Kruliš by Martin Kruliš (v1.2)

2 Parallel Multiprocessing
Logical structure of processes and threads Fibers (user threads) Processes (full threads) Kernel threads CPU cores by Martin Kruliš (v1.2)

3 Scheduling in OS Thread/Process Scheduling in OS
Operating systems have multiple requirements Fairness (regarding multiple processes) Throughput (maximizing CPU utilization) Latency (minimizing response time) Efficiency (minimizing overhead) Additional constraints (I/O bound operations) Threads are planned on available cores Preemptively (thread can be removed from a core) Optimal solution does not exist A compromise between requirements is established by Martin Kruliš (v1.2)

4 Parallel Programming Parallel User Applications Fork/Join Model
More “cooperative” parallel processing The whole application has the same top-level objective Typically aspires to reduce processing time/latency Fork/Join Model One of the simplest and most often employed models Easily achieved without special libraries in many langs Usually employed in a wrong way Clumsy, great overhead of creating threads, … Suitable mostly for large-task parallelism Data parallelism is more important nowadays by Martin Kruliš (v1.2)

5 Task-based Parallelism
Task-based Decomposition An abstraction for programming Task One piece of work to be performed (code + data) Computational in nature, executed non-preemptively Typically represented by an object (functor) Much more light-weighted than a thread Ideal size ~ thousand instructions Decomposition Both task-parallel and data-parallel problems can be easily decomposed to small tasks Require some implicit synchronization mechanisms by Martin Kruliš (v1.2)

6 Task-Based Programming
Task-Based Programming Issues Task spawning Scheduling (i.e., load balancing) (Implicit) synchronization Example: Intel Threading Building Blocks Tasks can be spawned by other tasks Simplifies nesting and load balancing Each task have pointer to a successor and refcount (how many successors are pointing to it) Task decrements its successor’s refcount on completion Tasks with refcount == 0 can be executed by Martin Kruliš (v1.2)

7 Task Spawning Blocking Pattern
Parent spawns children and waits for them Special blocking call, that invokes scheduler Parent has refcount > # of children by Martin Kruliš (v1.2)

8 Task Spawning Continuation Passing
Parent creates continuation task and children Children starts immediately (refcount == 0) Continuation has refcount == # of children by Martin Kruliš (v1.2)

9 Parallel Algorithm Templates
Parallel-reduce Decomposition Too large task Reduce <0,100) Finalize Too large tasks Reduce <0,50) Finalize Reduce <50,100) Finalize Reduce <0,25) <25,50) Reduce <50, 75) <75,100) by Martin Kruliš (v1.2)

10 Task Scheduling Task Scheduling Thread Pool Oversubscription
Workers are processing available tasks (e.g., tasks with refcount == 0 in TBB) # of workers ~ # available CPU cores Various scheduling strategies (which tasks goes first) Oversubscription Creating much more tasks than available workers Provides opportunity for loadbalancing Even when the length of the task is data-driven (and thus unpredictable) by Martin Kruliš (v1.2)

11 Task Scheduling Task Scheduling Strategies Static Scheduling
When number and length of tasks is predictable Assigned to the threads at the beginning Virtually no scheduling overhead (after assignment) Dynamic Scheduling Task can be reassigned to other workers Task are assigned as the workers become available Other Strategies Scheduling in phases – tasks are assigned statically, once some workers become available, overall reassignment is performed by Martin Kruliš (v1.2)

12 Dynamic Task Scheduling
Scheduling Algorithms Many different approaches that are suitable for different specific scenarios Global task queue Threads atomically pop tasks (or push tasks) The queue may become a bottleneck Private task queues per thread Each thread process/spawns its own tasks What should thread do, when its queue is empty? Combined solutions Local and shared queues by Martin Kruliš (v1.2)

13 CPU Architecture Modern Multicore CPUs by Martin Kruliš (v1.2)

14 NUMA Non-Uniform Memory Architecture
First-touch Physical Memory Allocation by Martin Kruliš (v1.2)

15 Cache Coherency Memory Coherency Problem MESI Protocol
Implemented on cache level All cores must perceive the same data MESI Protocol Each cache line has a special flag Modified Exclusive Shared Invalid Memory bus snooping + update rules MOESI protocol – similar, adds “Owning” state by Martin Kruliš (v1.2)

16 Cache Coherency MESI Protocol by Martin Kruliš (v1.2)

17 TBB Task Scheduler Intel Threading Building Blocks Scheduler
Thread pool with private task queues Local thread gets/inserts tasks from/to the bottom of its queue Thread steals tasks from the top of the queue by Martin Kruliš (v1.2)

18 TBB Task Scheduler Task Dependency Tree
Stack-like local processing leads to DFS tree expansion within one thread Reduces memory consumption Improves caching Queue-like stealing leads to BFS tree expansion by Martin Kruliš (v1.2)

19 Locality Aware Scheduling
Challenges Maintaining NUMA locality Efficient cache utilization vs. thread affinity Avoiding false sharing Key ideas Separate requests on different NUMA nodes Task scheduling consider cache sharing Related tasks – on cores that are close Minimize overhead of task stealing by Martin Kruliš (v1.2)

20 Locality Aware Scheduling
Locality Aware Scheduler (Z. Falt) Key ideas Queues are associated with cores (not threads) Threads are bound (by affinity) to NUMA node Two methods for task spawning Immediate task – related/follow-up work Deferred task – unrelated work Task stealing reflects CPU core distance NUMA distance – number of NUMA hops Cache distance – level of shared cache (L1, L2, …) by Martin Kruliš (v1.2)

21 Locality Aware Scheduling
Locality Aware Scheduler (Z. Falt) by Martin Kruliš (v1.2)

22 Discussion by Martin Kruliš (v1.2)


Download ppt "Task Scheduling for Multicore CPUs and NUMA Systems"

Similar presentations


Ads by Google