Martin Kruliš by Martin Kruliš (v1.1)1
Thread Scheduling in OS ◦ Operating systems have multiple requirements Fairness (regarding multiple processes) Throughput (maximizing CPU utilization) Latency (minimizing response time) Efficiency (minimizing overhead) Additional constraints (I/O bound operations) ◦ Threads are planned on available cores Preemptively (thread can be removed from a core) ◦ Optimal solution does not exist A compromise between requirements is established by Martin Kruliš (v1.1)2
Task Scheduling in Parallel Applications ◦ Completely different problem Tasks have common objective(s) Possibly much more information about the tasks and their structure (than OS have about threads) ◦ Task (typical definition) A portion of work (code + data) Sufficiently small and indivisible Typically scheduled non-preemptively May have dependencies (one task must finish before another task can be executed) by Martin Kruliš (v1.1)3
Task Scheduling Issues ◦ Task spawning All tasks are created at the beginning Task are spawned dynamically by other tasks ◦ Predictable time complexity # of instructions is fixed/depend on the data ◦ Blocking operations Computing tasks vs. I/O (disk, net, GPU, …) tasks ◦ Optimization issues Task dependencies may lead to various orderings Data produced by a task are used by another task by Martin Kruliš (v1.1)4
Task Scheduling Strategies ◦ Static Scheduling When number and length of tasks is predictable Assigned to the threads at the beginning Virtually no scheduling overhead (after assignment) ◦ Dynamic Scheduling When task are spawned ad hoc or their length is unpredictable and varying greatly Oversubscription – much more tasks than threads The task-to-thread assignment may not be determined directly (when the task is created) and it may change in time by Martin Kruliš (v1.1)5
Scheduling Algorithms ◦ Many different approaches that are suitable for different specific scenarios ◦ Global task queue Threads atomically pop tasks (or push tasks) The queue may become a bottleneck ◦ Private task queues per thread Each thread process/spawns its own tasks What should thread do, when its queue is empty? ◦ Combined solutions Local and shared queues by Martin Kruliš (v1.1)6
Modern Multicore CPUs by Martin Kruliš (v1.1)7
Non-Uniform Memory Architecture ◦ First-touch Physical Memory Allocation by Martin Kruliš (v1.1)8
Memory Coherency Problem ◦ Implemented on cache level ◦ All cores must perceive the same data MESI Protocol ◦ Each cache line has a special flag Modified Exclusive Shared Invalid ◦ Memory bus snooping + update rules by Martin Kruliš (v1.1)9
MESI Protocol by Martin Kruliš (v1.1)10
Intel Threading Building Blocks Scheduler ◦ Thread pool with private task queues ◦ Local thread gets/inserts tasks from/to the bottom of its queue ◦ Thread steals tasks from the top of the queue by Martin Kruliš (v1.1)11
Task Dependency Tree ◦ Stack-like local processing leads to DFS tree expansion within one thread Reduces memory consumption Improves caching ◦ Queue-like stealing leads to BFS tree expansion by Martin Kruliš (v1.1)12
Challenges ◦ Maintaining NUMA locality ◦ Efficient cache utilization vs. thread affinity ◦ Avoiding false sharing Key ideas ◦ Separate requests on different NUMA nodes ◦ Task scheduling consider cache sharing Related tasks – on cores that are close ◦ Minimize overhead of task stealing by Martin Kruliš (v1.1)13
Locality Aware Scheduler (Z. Falt) ◦ Key ideas Queues are associated with cores (not threads) Threads are bound (by affinity) to NUMA node Two methods for task spawning Immediate task – related/follow-up work Deferred task – unrelated work Task stealing reflects CPU core distance NUMA distance – number of NUMA hops Cache distance – level of shared cache (L1, L2, …) by Martin Kruliš (v1.1)14
Locality Aware Scheduler (Z. Falt) by Martin Kruliš (v1.1)15
by Martin Kruliš (v1.1)16