Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Lecture 27 Multiprocessor Scheduling

Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues related to multi-core: scheduling and scalability

The cache coherence problem Since we have multiple private caches: How to keep the data consistent across caches? Each core should perceive the memory as a monolithic array, shared by all the cores

The cache coherence problem Core 1Core 2Core 3Core 4 One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache Main memory x=15213 multi-core chip

The cache coherence problem Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache Main memory x=15213 multi-core chip assuming write-back caches

The cache coherence problem Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches

Solutions for cache coherence There exist many solution algorithms, coherence protocols, etc. A simple solution: Invalidation protocol with bus snooping

Inter-core bus Core 1Core 2Core 3Core 4 One or more levels of cache Main memory multi-core chip inter-core bus

Invalidation protocol with snooping Invalidation: If a core writes to a data item, all other copies of this data item in other caches are invalidated Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores.

The cache coherence problem Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches INVALIDATED sends invalidation request

The cache coherence problem Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=21660 One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches

Alternative to invalidate protocol: update protocol Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches broadcasts updated value

Alternative to invalidate protocol: update protocol Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=21660 One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches broadcasts updated value

Invalidation vs update Multiple writes to the same location invalidation: only the first time update: must broadcast each write (which includes new variable value) Invalidation generally performs better: it generates less bus traffic

Programmers still Need to Worry about Concurrency Mutex Condition variables Lock-free data structures

Single-Queue Multiprocessor Scheduling reuse the basic framework for single processor scheduling put all jobs that need to be scheduled into a single queue pick the best two jobs to run, if there are two CPUs Advantage: simple Disadvantage: does not scale

SQMS and Cache Affinity

Cache Affinity Thread migration is costly Need to restart the execution pipeline Cached data is invalidated OS scheduler tries to avoid migration as much as possible: it tends to keeps a thread on the same core

SQMS and Cache Affinity.

Multi-Queue Multiprocessor Scheduling Scalable Cache affinity

Load Imbalance Migration

Work Stealing A (source) queue that is low on jobs will occasionally peek at another (target) queue If the target queue is (notably) more full than the source queue, the source will “steal” one or more jobs from the target to help balance load Cannot look around at other queues too often

Linux Multiprocessor Schedulers Both approaches can be successful O(1) scheduler Completely Fair Scheduler (CFS) BF Scheduler (BFS), uses a single queue

An Analysis of Linux Scalability to Many Cores This paper asks whether traditional kernel designs can be used and implemented in a way that allows applications to scale

Amdahl's Law N: the number of threads of execution B: the fraction of the algorithm that is strictly serial the theoretical speedup:

Scalability Issues Global lock used for a shared data structure longer lock wait time Shared memory location overhead caused by the cache coherency algorithms Tasks compete for limited size-shared hardware cache increased cache miss rates Tasks compete for shared hardware resources (interconnects, DRAMinterfaces) more time wasted waiting Too few available tasks: less efficiency

How to avoid/fix These issues can often be avoided (or limited) using popular parallel programming techniques Lock-free algorithms Per-core data structures Fine-grained locking Cache-alignment Sloppy Counters

Current bottlenecks https://www.usenix.org/conference/osdi10/analysis- linux-scalability-many-cores

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Similar presentations

Presentation on theme: "Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Similar presentations

Presentation on theme: "Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues."— Presentation transcript:

Similar presentations

About project

Feedback