Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Multiple Processor Systems

Categories of I/O Devices

Part IV: Memory Management

Multiple Processor Systems I CS 423 Klara Nahrstedt/Sam King 10/11/20141.

© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.

Distributed Systems CS

SE-292 High Performance Computing

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Mutual Exclusion.

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

Multiple Processor Systems

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

CMPT 300: Operating Systems I Dr. Mohamed Hefeeda

Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.

1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.

Computer Organization and Architecture

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

Communication Models for Parallel Computer Architectures 4 Two distinct models have been proposed for how CPUs in a parallel computer system should communicate.

Parallel Computer Architectures

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 1: Introduction.

Computer System Architectures Computer System Software

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.

Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

13/03/07 CENG334 Introduction to Operating Systems Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara, TURKEY URL:

Chapter 7 Operating Systems. Define the purpose and functions of an operating system. Understand the components of an operating system. Understand the.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

Processor Architecture

CIS250 OPERATING SYSTEMS Chapter One Introduction.

Cotter-cs431 Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved Chapter 8 Multiple Processor Systems.

1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.

MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,

Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Background Computer System Architectures Computer System Software.

Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.

The University of Adelaide, School of Computer Science

CENG334 Introduction to Operating Systems Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara, TURKEY Multiple processor systems and.

Multiple processor systems

Memory Management.

Multiprocessor System Distributed System

Overview Parallel Processing Pipelining

MODERN OPERATING SYSTEMS Third Edition ANDREW S

Multiple Processor Systems

CENG334 Introduction to Operating Systems

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

12.4 Memory Organization in Multiprocessor Systems

Multiprocessor Introduction and Characteristics of Multiprocessor

Computer Architecture

Multiple Processor Systems

Multiple Processor Systems

Multiple Processor and Distributed Systems

High Performance Computing

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Presentation transcript:

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors UMA Multiprocessors NUMA Multiprocessors NUMA Multiprocessors Multicore Chips Multicore Chips Multiprocessor Operating Systems Multiprocessor Operating Systems Types of Operating Systems Types of Operating Systems Multiprocessor Syncronization Multiprocessor Syncronization Multiprocessor Scheduling Multiprocessor Scheduling 1

More Power Problems Run the clock faster Run the clock faster Electrical signals travel 20cm/nsec in copper (Einstein’s special theory of relativity) Electrical signals travel 20cm/nsec in copper (Einstein’s special theory of relativity) Signals can not travel for more than 2 cm for a 10 GHz clock, or 2 mm for a 100 GHz computer Signals can not travel for more than 2 cm for a 10 GHz clock, or 2 mm for a 100 GHz computer Making computers this small may be possible, but we have a heat dissipation problem Making computers this small may be possible, but we have a heat dissipation problem We have limitations for the time being, so the solution to more power is through using multiple and paralel CPU’s We have limitations for the time being, so the solution to more power is through using multiple and paralel CPU’s 2

3 Multiple Processor Systems Figure 8-1. (a) A shared-memory multiprocessor. (b) A message- passing multicomputer. (c) A wide area distributed system.

Multiple Processor Systems Shared-memory multiprocessors: Every CPU has equal access to the entire physical memory Shared-memory multiprocessors: Every CPU has equal access to the entire physical memory Message-passing multicomputers: Each CPU has it’s own memory. The CPU’s communicate with each other using messages over the interconnection structure Message-passing multicomputers: Each CPU has it’s own memory. The CPU’s communicate with each other using messages over the interconnection structure Wide area distributed system: Computer systems connected over a network. Communication is again by messages but there is a delay due to the network Wide area distributed system: Computer systems connected over a network. Communication is again by messages but there is a delay due to the network 4

Multiprocessor Hardware UMA Multiprocessors UMA Multiprocessors NUMA Multiprocessors NUMA Multiprocessors Multicore Chips Multicore Chips 5

6 UMA (Uniform Memory Access) Multiprocessors with Bus-Based Architectures (1) Figure 8-2. Three bus-based multiprocessors. (a) Without caching. (b) With caching. (c) With caching and private memories. UMA : Uniform access to the entire memory, same access times for all CPU’s UMA : Uniform access to the entire memory, same access times for all CPU’s

UMA Multiprocessors with Bus-Based Architectures (2) Each CPU has to wait for the bus to be idle to read or write to the memory Each CPU has to wait for the bus to be idle to read or write to the memory For 2 or 3 computers, bus contention is manageable (Figure 8.2 (a)) For 2 or 3 computers, bus contention is manageable (Figure 8.2 (a)) For larger number of CPU’s, a cache is added to the CPU. Since reads can be satisfied by cache contents, there will be less traffic on the bus (Figure 8.2 (b)) For larger number of CPU’s, a cache is added to the CPU. Since reads can be satisfied by cache contents, there will be less traffic on the bus (Figure 8.2 (b)) Writing has to be managed! Writing has to be managed! Some systems have private and shared memories (Figure 8.3 (c)) Some systems have private and shared memories (Figure 8.3 (c)) Mostly private memory is used. Shared memory is for shared variables between CPUs Mostly private memory is used. Shared memory is for shared variables between CPUs Needs carefull programming! Needs carefull programming! 7

8 Figure 8-3. (a) An 8 × 8 crossbar switch. (b) An open crosspoint. (c) A closed crosspoint. UMA Multiprocessors Using Crossbar Switches (1)

UMA Multiprocessors Using Crossbar Switches (2) Use of a single bus limits (even with caches) the number of CPUs to about 16 or 32 CPUs Use of a single bus limits (even with caches) the number of CPUs to about 16 or 32 CPUs A crossbar switch connecting n CPUs to k memories may solve this problem A crossbar switch connecting n CPUs to k memories may solve this problem A crosspoint is a small electronic switch A crosspoint is a small electronic switch Contention for memory is still possible if k < n. Partitioning the memory into n units may reduce the contention Contention for memory is still possible if k < n. Partitioning the memory into n units may reduce the contention 9

10 UMA Multiprocessors Using Multistage Switching Networks (1) Figure 8-4. (a) A 2 × 2 switch with two input lines, A and B, and two output lines, X and Y. (b) A message format. Module : memory unit Module : memory unit Address : an address within a module Address : an address within a module Opcode : Read or Write Opcode : Read or Write Value: value to be written Value: value to be written

11 UMA Multiprocessors Using Multistage Switching Networks (2) Figure 8-5. An omega switching network.

12 NUMA (Nonuniform) Multiprocessors (1) Characteristics of NUMA machines: 1. There is a single address space visible to all CPUs. 2. Access to remote memory is via LOAD and STORE instructions. 3. Access to remote memory is slower than access to local memory.

13 NUMA Multiprocessors (2) Figure 8-6. (a) A 256-node directory-based multiprocessor. (b) Division of a 32-bit memory address into fields. (c) The directory at node 36.

NUMA Multiprocessors (3) Let us assume that each node has one CPU, 16 MB of ram and a cache Let us assume that each node has one CPU, 16 MB of ram and a cache The total memory is 2 32 bytes, divided up into 2 26 cache lines (blocks) of 64 bytes each The total memory is 2 32 bytes, divided up into 2 26 cache lines (blocks) of 64 bytes each The total memory is allocated among nodes, with 0-16 MB in node 0, MB in node 1, and so on The total memory is allocated among nodes, with 0-16 MB in node 0, MB in node 1, and so on Each node has a directory containing an entry for each of the 2 18 (262,144) 64-byte cache lines Each node has a directory containing an entry for each of the 2 18 (262,144) 64-byte cache lines Each directory entry is 9 bits (cache presence bit + 8 bits for a node number), so the total directory size is 2 18 * 9 = 2,359,296 bits = 294,912 bytes Each directory entry is 9 bits (cache presence bit + 8 bits for a node number), so the total directory size is 2 18 * 9 = 2,359,296 bits = 294,912 bytes We will assume that a cache line (memory block) is held in the cache of one node only (single copy) We will assume that a cache line (memory block) is held in the cache of one node only (single copy) 14

NUMA Multiprocessors (4) The directory of each node is kept in an extremely fast special- purpose hardware, since directory must be queried on every instruction that references memory (so expensive) The directory of each node is kept in an extremely fast special- purpose hardware, since directory must be queried on every instruction that references memory (so expensive) Let us assume that CPU 20 references the address 0x This address corresponds node 36, block 4, offset 8 in decimal Let us assume that CPU 20 references the address 0x This address corresponds node 36, block 4, offset 8 in decimal Node 20 sends a request message to node 36 to find whether block 4 is cached ot not (NOT from Figure 8-6 (c)) Node 20 sends a request message to node 36 to find whether block 4 is cached ot not (NOT from Figure 8-6 (c)) Node 36 fetches block 4 from it’s local ram, sends it back to the to node 20, and updates the directory entry to indicate that the line is now cached at node 20 Node 36 fetches block 4 from it’s local ram, sends it back to the to node 20, and updates the directory entry to indicate that the line is now cached at node 20 Now let us assume that node 20 makes a reference to line 2 of node 36. This line is cached in node 82 (Figure 8-6 (c)). Node 82 passes the line to node 20 Now let us assume that node 20 makes a reference to line 2 of node 36. This line is cached in node 82 (Figure 8-6 (c)). Node 82 passes the line to node 20

Multicore Chips (1) Moore's Law : The number of transistors that can be placed on a chip increases exponentially, doubling approximately every two years. Multicore (dual-core, quad-core) means more than one complete CPU on the same chip The CPU’s share the same main memory but they may have private (AMD) or shared (Intel) cache memory Snooping : special hardware circuitry makes sure that if a word is present in two or more caches and one of the CPUs modifies the word, it is automatically and atomically removed from all caches in order to maintain consistency

Multiprocessor Operating Systems Each CPU has its own operating system Each CPU has its own operating system Master-Slave multiprocessors Master-Slave multiprocessors Symmetric Multiprocessors Symmetric Multiprocessors 17

18 Each CPU Has Its Own Operating System (1) Figure 8-7. Partitioning multiprocessor memory among four CPUs, but sharing a single copy of the operating system code (The boxes marked Data are the operating system’s private data for each CPU).

Each CPU Has Its Own Operating System (2) CPUs share the OS code CPUs share the OS code System calls handled by individual CPUs System calls handled by individual CPUs No sharing of processes since each CPU has OS tables of its own No sharing of processes since each CPU has OS tables of its own Each CPU schedules its own processes and may be idle if there are no processes Each CPU schedules its own processes and may be idle if there are no processes Since memory allocation is fixed, pages are not shared among CPUs Since memory allocation is fixed, pages are not shared among CPUs Since each OS maintains its own cache of disk blocks, there may be inconsistency if blocks are modified by different Oss Since each OS maintains its own cache of disk blocks, there may be inconsistency if blocks are modified by different Oss This model was used in the early days of multiprocessors and rarely used these days This model was used in the early days of multiprocessors and rarely used these days 19

20 Master-Slave Multiprocessors Figure 8-8. A master-slave multiprocessor model. One master OS in one CPU distributes work among other slave CPUs One master OS in one CPU distributes work among other slave CPUs Problems of previous arcitecture (no page sharing, CPUs idle, buffer cache inconsistency) are solved Problems of previous arcitecture (no page sharing, CPUs idle, buffer cache inconsistency) are solved Master CPU may be overloaded since it has to cater for all others Master CPU may be overloaded since it has to cater for all others

21 Symmetric Multiprocessors (1) Figure 8-9. The SMP multiprocessor model. Each CPU runs a single but shared copy of OS independently Each CPU runs a single but shared copy of OS independently

Symmetric Multiprocessors (2) When a system call is made, the CPU on which the call is made traps the kernel and processes the call When a system call is made, the CPU on which the call is made traps the kernel and processes the call This model eliminates the asymmetry of master- slave configuration This model eliminates the asymmetry of master- slave configuration One copy of OS executing on different CPUs One copy of OS executing on different CPUs One set of OS tables One set of OS tables No master CPU bootleneck No master CPU bootleneck 22

Symmetric Multiprocessors (3) Problem: What will happen if two CPUs try to claim the same free page? Problem: What will happen if two CPUs try to claim the same free page? This is one example out of many This is one example out of many Locks (mutex) are provided to solve these problems Locks (mutex) are provided to solve these problems The OS is splitted into several critical regions (each controlled by different locks) that can be executed concurrently by different CPUs without interfering with each other The OS is splitted into several critical regions (each controlled by different locks) that can be executed concurrently by different CPUs without interfering with each other 23

24 Multiprocessor Synchronization (1) Figure The TSL (Test and Set Lock) instruction can fail if the bus cannot be locked. These four steps show a sequence of events where the failure is demonstrated.

Multiprocessor Synchronization (2) The synchronization problem of previous slide can be prevented The synchronization problem of previous slide can be prevented By locking the bus so that other CPUs can not access it By locking the bus so that other CPUs can not access it Execute the TSL instruction Execute the TSL instruction Unlock the bus Unlock the bus This can be done preferably by hardware locking or by software using spin locks (process executes a tight loop testing its status) This can be done preferably by hardware locking or by software using spin locks (process executes a tight loop testing its status) 25

Multiprocessor Scheduling Uniprocessor scheduling : scheduler chooses the thread to run next Uniprocessor scheduling : scheduler chooses the thread to run next Multiprocessor scheduling : scheduler has to choose a thread and a CPU Multiprocessor scheduling : scheduler has to choose a thread and a CPU Another complication factor is the thread relations Another complication factor is the thread relations Unrelated threads as in multi-user timesharing environments Unrelated threads as in multi-user timesharing environments Related threads as in the threads of some application working together (such as compilers make command) Related threads as in the threads of some application working together (such as compilers make command) 26

Thread Scheduling Algorithms Time-Sharing : unrelated threads Time-Sharing : unrelated threads Space Sharing : related threads Space Sharing : related threads Gang Scheduling : related threads Gang Scheduling : related threads 27

28 Timesharing Figure Using a single data structure for scheduling a multiprocessor. 16 CPUs all busy, CPU 4 becomes idle, locks scheduling queues and selects thread A. Next CPU 12 goes idle and chooses thread B 16 CPUs all busy, CPU 4 becomes idle, locks scheduling queues and selects thread A. Next CPU 12 goes idle and chooses thread B CPUs are time-shared and we have load balancing (no overload, but work is distributed) CPUs are time-shared and we have load balancing (no overload, but work is distributed)

29 Space Sharing (1) Figure A set of 32 CPUs split into four partitions, with two CPUs available. Partitioning is based on related (grouped) threads Partitioning is based on related (grouped) threads

Space Sharing (2) Scheduling multiple related threads at the same time across multiple CPUs is called space sharing Scheduling multiple related threads at the same time across multiple CPUs is called space sharing Some applications benefit from this approach such as compilers make command Some applications benefit from this approach such as compilers make command Each thread in a group is given its dedicated CPU. This thread holds on to the CPU until it terminates. If a thread blocks on I/O, it continues to hold the CPU Each thread in a group is given its dedicated CPU. This thread holds on to the CPU until it terminates. If a thread blocks on I/O, it continues to hold the CPU If there are not enough CPUs to start all threads of a group at the same time, the whole group waits until there are If there are not enough CPUs to start all threads of a group at the same time, the whole group waits until there are 30

Space Sharing (3) Space sharing does not use multiprogramming so we do not have context switching overhead Space sharing does not use multiprogramming so we do not have context switching overhead Time is wasted when a thread blocks for I/O or for some other event and CPU is idle Time is wasted when a thread blocks for I/O or for some other event and CPU is idle Consequently, people have looked for algorithms that attempt to schedule in both time and space together, especially for threads that create multiple threads, which usually need to communicate with one another Consequently, people have looked for algorithms that attempt to schedule in both time and space together, especially for threads that create multiple threads, which usually need to communicate with one another 31

32 Gang Scheduling (1) Figure Communication between two threads belonging to thread A that are running out of phase. Consider process A with threads A0 and A1, process B with threads B0 and B1. A0 and B0 execute on CPU 0, A1 and B1 on CPU 1 Consider process A with threads A0 and A1, process B with threads B0 and B1. A0 and B0 execute on CPU 0, A1 and B1 on CPU 1 A0 sends a message to A1, A1 sends a reply back repeately A0 sends a message to A1, A1 sends a reply back repeately A0 sends the message but gets the reply after a delay of 200 msec because of B’s threads A0 sends the message but gets the reply after a delay of 200 msec because of B’s threads

33 Gang Scheduling (2) The three parts of gang scheduling: 1. Groups of related threads are scheduled as a unit, a gang. 2. All members of a gang run simultaneously, on different timeshared CPUs. 3. All gang members start and end their time slices together.

Gang Scheduling (3) Time is divided into time slices. Time is divided into time slices. At the start of a time slice, all CPUs are rescheduled with a new thread in each. At the start of a time slice, all CPUs are rescheduled with a new thread in each. At the start of the next time slice another scheduling is done. No scheduling is done in between At the start of the next time slice another scheduling is done. No scheduling is done in between If a thread blocks, its CPU stays idle until the end of the quantum If a thread blocks, its CPU stays idle until the end of the quantum 34

35 Gang Scheduling (4) Figure Gang scheduling (6 CPUs, 5 processes and 24 threads in total). In gang scheduling all threads of a process run together, so that if one of them sends a request to another one, it will get the message and reply immediately In gang scheduling all threads of a process run together, so that if one of them sends a request to another one, it will get the message and reply immediately