Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

SE-292 High Performance Computing
Cache Optimization Summary
Scheduling. Main Points Scheduling policy: what to do next, when there are multiple threads ready to run – Or multiple packets to send, or web requests.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Distributed Operating Systems CS551 Colorado State University at Lockheed-Martin Lecture 4 -- Spring 2001.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
1 Multi-core architectures Jernej Barbic , Spring 2007 May 3, 2007.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multiprocessor Cache Coherency
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.
CPU Scheduling - Multicore. Reading Silberschatz et al: Chapter 5.5.
LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
1 Previous lecture review n Out of basic scheduling techniques none is a clear winner: u FCFS - simple but unfair u RR - more overhead than FCFS may not.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
CMPE 421 Parallel Computer Architecture Multi Processing 1.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
1 Multi-core architectures Zonghua Gu Acknowledgement: Slides taken from Jernej Barbic’s lecture notes.
Lecture 13: Multiprocessors Kai Bu
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
Supporting Multi-Processors Bernard Wong February 17, 2003.
Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Multi-Core Architectures 1. Single-Core Computer 2.
Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Martin Kruliš by Martin Kruliš (v1.1)1.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Task Scheduling for Multicore CPUs and NUMA Systems
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Chapter 4: Threads.
Multi-Core Architectures
Multiprocessors - Flynn’s taxonomy (1966)
High Performance Computing
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSC Multiprocessor Programming, Spring, 2011
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Lecture 27 Multiprocessor Scheduling

Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues related to multi-core: scheduling and scalability

The cache coherence problem Since we have multiple private caches: How to keep the data consistent across caches? Each core should perceive the memory as a monolithic array, shared by all the cores

The cache coherence problem Core 1Core 2Core 3Core 4 One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache Main memory x=15213 multi-core chip

The cache coherence problem Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache Main memory x=15213 multi-core chip assuming write-back caches

The cache coherence problem Core 1Core 2Core 3Core 4 One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache Main memory x=15213 multi-core chip

The cache coherence problem Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches

Solutions for cache coherence There exist many solution algorithms, coherence protocols, etc. A simple solution: Invalidation protocol with bus snooping

Inter-core bus Core 1Core 2Core 3Core 4 One or more levels of cache Main memory multi-core chip inter-core bus

Invalidation protocol with snooping Invalidation: If a core writes to a data item, all other copies of this data item in other caches are invalidated Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores.

The cache coherence problem Core 1Core 2Core 3Core 4 One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache Main memory x=15213 multi-core chip

The cache coherence problem Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches INVALIDATED sends invalidation request

The cache coherence problem Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=21660 One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches

Alternative to invalidate protocol: update protocol Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches broadcasts updated value

Alternative to invalidate protocol: update protocol Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=21660 One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches broadcasts updated value

Invalidation vs update Multiple writes to the same location invalidation: only the first time update: must broadcast each write (which includes new variable value) Invalidation generally performs better: it generates less bus traffic

Programmers still Need to Worry about Concurrency Mutex Condition variables Lock-free data structures

Single-Queue Multiprocessor Scheduling reuse the basic framework for single processor scheduling put all jobs that need to be scheduled into a single queue pick the best two jobs to run, if there are two CPUs Advantage: simple Disadvantage: does not scale

SQMS and Cache Affinity

Cache Affinity Thread migration is costly Need to restart the execution pipeline Cached data is invalidated OS scheduler tries to avoid migration as much as possible: it tends to keeps a thread on the same core

SQMS and Cache Affinity.

Multi-Queue Multiprocessor Scheduling Scalable Cache affinity

Load Imbalance Migration

Work Stealing A (source) queue that is low on jobs will occasionally peek at another (target) queue If the target queue is (notably) more full than the source queue, the source will “steal” one or more jobs from the target to help balance load Cannot look around at other queues too often

Linux Multiprocessor Schedulers Both approaches can be successful O(1) scheduler Completely Fair Scheduler (CFS) BF Scheduler (BFS), uses a single queue

An Analysis of Linux Scalability to Many Cores This paper asks whether traditional kernel designs can be used and implemented in a way that allows applications to scale

Amdahl's Law N: the number of threads of execution B: the fraction of the algorithm that is strictly serial the theoretical speedup:

Scalability Issues Global lock used for a shared data structure longer lock wait time Shared memory location overhead caused by the cache coherency algorithms Tasks compete for limited size-shared hardware cache increased cache miss rates Tasks compete for shared hardware resources (interconnects, DRAMinterfaces) more time wasted waiting Too few available tasks: less efficiency

How to avoid/fix These issues can often be avoided (or limited) using popular parallel programming techniques Lock-free algorithms Per-core data structures Fine-grained locking Cache-alignment Sloppy Counters

Current bottlenecks linux-scalability-many-cores