Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors UMA Multiprocessors NUMA Multiprocessors NUMA Multiprocessors Multicore Chips Multicore Chips Multiprocessor Operating Systems Multiprocessor Operating Systems Types of Operating Systems Types of Operating Systems Multiprocessor Syncronization Multiprocessor Syncronization Multiprocessor Scheduling Multiprocessor Scheduling 1
More Power Problems Run the clock faster Run the clock faster Electrical signals travel 20cm/nsec in copper (Einstein’s special theory of relativity) Electrical signals travel 20cm/nsec in copper (Einstein’s special theory of relativity) Signals can not travel for more than 2 cm for a 10 GHz clock, or 2 mm for a 100 GHz computer Signals can not travel for more than 2 cm for a 10 GHz clock, or 2 mm for a 100 GHz computer Making computers this small may be possible, but we have a heat dissipation problem Making computers this small may be possible, but we have a heat dissipation problem We have limitations for the time being, so the solution to more power is through using multiple and paralel CPU’s We have limitations for the time being, so the solution to more power is through using multiple and paralel CPU’s 2
3 Multiple Processor Systems Figure 8-1. (a) A shared-memory multiprocessor. (b) A message- passing multicomputer. (c) A wide area distributed system.
Multiple Processor Systems Shared-memory multiprocessors: Every CPU has equal access to the entire physical memory Shared-memory multiprocessors: Every CPU has equal access to the entire physical memory Message-passing multicomputers: Each CPU has it’s own memory. The CPU’s communicate with each other using messages over the interconnection structure Message-passing multicomputers: Each CPU has it’s own memory. The CPU’s communicate with each other using messages over the interconnection structure Wide area distributed system: Computer systems connected over a network. Communication is again by messages but there is a delay due to the network Wide area distributed system: Computer systems connected over a network. Communication is again by messages but there is a delay due to the network 4
Multiprocessor Hardware UMA Multiprocessors UMA Multiprocessors NUMA Multiprocessors NUMA Multiprocessors Multicore Chips Multicore Chips 5
6 UMA (Uniform Memory Access) Multiprocessors with Bus-Based Architectures (1) Figure 8-2. Three bus-based multiprocessors. (a) Without caching. (b) With caching. (c) With caching and private memories. UMA : Uniform access to the entire memory, same access times for all CPU’s UMA : Uniform access to the entire memory, same access times for all CPU’s
UMA Multiprocessors with Bus-Based Architectures (2) Each CPU has to wait for the bus to be idle to read or write to the memory Each CPU has to wait for the bus to be idle to read or write to the memory For 2 or 3 computers, bus contention is manageable (Figure 8.2 (a)) For 2 or 3 computers, bus contention is manageable (Figure 8.2 (a)) For larger number of CPU’s, a cache is added to the CPU. Since reads can be satisfied by cache contents, there will be less traffic on the bus (Figure 8.2 (b)) For larger number of CPU’s, a cache is added to the CPU. Since reads can be satisfied by cache contents, there will be less traffic on the bus (Figure 8.2 (b)) Writing has to be managed! Writing has to be managed! Some systems have private and shared memories (Figure 8.3 (c)) Some systems have private and shared memories (Figure 8.3 (c)) Mostly private memory is used. Shared memory is for shared variables between CPUs Mostly private memory is used. Shared memory is for shared variables between CPUs Needs carefull programming! Needs carefull programming! 7
8 Figure 8-3. (a) An 8 × 8 crossbar switch. (b) An open crosspoint. (c) A closed crosspoint. UMA Multiprocessors Using Crossbar Switches (1)
UMA Multiprocessors Using Crossbar Switches (2) Use of a single bus limits (even with caches) the number of CPUs to about 16 or 32 CPUs Use of a single bus limits (even with caches) the number of CPUs to about 16 or 32 CPUs A crossbar switch connecting n CPUs to k memories may solve this problem A crossbar switch connecting n CPUs to k memories may solve this problem A crosspoint is a small electronic switch A crosspoint is a small electronic switch Contention for memory is still possible if k < n. Partitioning the memory into n units may reduce the contention Contention for memory is still possible if k < n. Partitioning the memory into n units may reduce the contention 9
10 UMA Multiprocessors Using Multistage Switching Networks (1) Figure 8-4. (a) A 2 × 2 switch with two input lines, A and B, and two output lines, X and Y. (b) A message format. Module : memory unit Module : memory unit Address : an address within a module Address : an address within a module Opcode : Read or Write Opcode : Read or Write Value: value to be written Value: value to be written
11 UMA Multiprocessors Using Multistage Switching Networks (2) Figure 8-5. An omega switching network.
12 NUMA (Nonuniform) Multiprocessors (1) Characteristics of NUMA machines: 1. There is a single address space visible to all CPUs. 2. Access to remote memory is via LOAD and STORE instructions. 3. Access to remote memory is slower than access to local memory.
13 NUMA Multiprocessors (2) Figure 8-6. (a) A 256-node directory-based multiprocessor. (b) Division of a 32-bit memory address into fields. (c) The directory at node 36.
NUMA Multiprocessors (3) Let us assume that each node has one CPU, 16 MB of ram and a cache Let us assume that each node has one CPU, 16 MB of ram and a cache The total memory is 2 32 bytes, divided up into 2 26 cache lines (blocks) of 64 bytes each The total memory is 2 32 bytes, divided up into 2 26 cache lines (blocks) of 64 bytes each The total memory is allocated among nodes, with 0-16 MB in node 0, MB in node 1, and so on The total memory is allocated among nodes, with 0-16 MB in node 0, MB in node 1, and so on Each node has a directory containing an entry for each of the 2 18 (262,144) 64-byte cache lines Each node has a directory containing an entry for each of the 2 18 (262,144) 64-byte cache lines Each directory entry is 9 bits (cache presence bit + 8 bits for a node number), so the total directory size is 2 18 * 9 = 2,359,296 bits = 294,912 bytes Each directory entry is 9 bits (cache presence bit + 8 bits for a node number), so the total directory size is 2 18 * 9 = 2,359,296 bits = 294,912 bytes We will assume that a cache line (memory block) is held in the cache of one node only (single copy) We will assume that a cache line (memory block) is held in the cache of one node only (single copy) 14
NUMA Multiprocessors (4) The directory of each node is kept in an extremely fast special- purpose hardware, since directory must be queried on every instruction that references memory (so expensive) The directory of each node is kept in an extremely fast special- purpose hardware, since directory must be queried on every instruction that references memory (so expensive) Let us assume that CPU 20 references the address 0x This address corresponds node 36, block 4, offset 8 in decimal Let us assume that CPU 20 references the address 0x This address corresponds node 36, block 4, offset 8 in decimal Node 20 sends a request message to node 36 to find whether block 4 is cached ot not (NOT from Figure 8-6 (c)) Node 20 sends a request message to node 36 to find whether block 4 is cached ot not (NOT from Figure 8-6 (c)) Node 36 fetches block 4 from it’s local ram, sends it back to the to node 20, and updates the directory entry to indicate that the line is now cached at node 20 Node 36 fetches block 4 from it’s local ram, sends it back to the to node 20, and updates the directory entry to indicate that the line is now cached at node 20 Now let us assume that node 20 makes a reference to line 2 of node 36. This line is cached in node 82 (Figure 8-6 (c)). Node 82 passes the line to node 20 Now let us assume that node 20 makes a reference to line 2 of node 36. This line is cached in node 82 (Figure 8-6 (c)). Node 82 passes the line to node 20
Multicore Chips (1) Moore's Law : The number of transistors that can be placed on a chip increases exponentially, doubling approximately every two years. Multicore (dual-core, quad-core) means more than one complete CPU on the same chip The CPU’s share the same main memory but they may have private (AMD) or shared (Intel) cache memory Snooping : special hardware circuitry makes sure that if a word is present in two or more caches and one of the CPUs modifies the word, it is automatically and atomically removed from all caches in order to maintain consistency
Multiprocessor Operating Systems Each CPU has its own operating system Each CPU has its own operating system Master-Slave multiprocessors Master-Slave multiprocessors Symmetric Multiprocessors Symmetric Multiprocessors 17
18 Each CPU Has Its Own Operating System (1) Figure 8-7. Partitioning multiprocessor memory among four CPUs, but sharing a single copy of the operating system code (The boxes marked Data are the operating system’s private data for each CPU).
Each CPU Has Its Own Operating System (2) CPUs share the OS code CPUs share the OS code System calls handled by individual CPUs System calls handled by individual CPUs No sharing of processes since each CPU has OS tables of its own No sharing of processes since each CPU has OS tables of its own Each CPU schedules its own processes and may be idle if there are no processes Each CPU schedules its own processes and may be idle if there are no processes Since memory allocation is fixed, pages are not shared among CPUs Since memory allocation is fixed, pages are not shared among CPUs Since each OS maintains its own cache of disk blocks, there may be inconsistency if blocks are modified by different Oss Since each OS maintains its own cache of disk blocks, there may be inconsistency if blocks are modified by different Oss This model was used in the early days of multiprocessors and rarely used these days This model was used in the early days of multiprocessors and rarely used these days 19
20 Master-Slave Multiprocessors Figure 8-8. A master-slave multiprocessor model. One master OS in one CPU distributes work among other slave CPUs One master OS in one CPU distributes work among other slave CPUs Problems of previous arcitecture (no page sharing, CPUs idle, buffer cache inconsistency) are solved Problems of previous arcitecture (no page sharing, CPUs idle, buffer cache inconsistency) are solved Master CPU may be overloaded since it has to cater for all others Master CPU may be overloaded since it has to cater for all others
21 Symmetric Multiprocessors (1) Figure 8-9. The SMP multiprocessor model. Each CPU runs a single but shared copy of OS independently Each CPU runs a single but shared copy of OS independently
Symmetric Multiprocessors (2) When a system call is made, the CPU on which the call is made traps the kernel and processes the call When a system call is made, the CPU on which the call is made traps the kernel and processes the call This model eliminates the asymmetry of master- slave configuration This model eliminates the asymmetry of master- slave configuration One copy of OS executing on different CPUs One copy of OS executing on different CPUs One set of OS tables One set of OS tables No master CPU bootleneck No master CPU bootleneck 22
Symmetric Multiprocessors (3) Problem: What will happen if two CPUs try to claim the same free page? Problem: What will happen if two CPUs try to claim the same free page? This is one example out of many This is one example out of many Locks (mutex) are provided to solve these problems Locks (mutex) are provided to solve these problems The OS is splitted into several critical regions (each controlled by different locks) that can be executed concurrently by different CPUs without interfering with each other The OS is splitted into several critical regions (each controlled by different locks) that can be executed concurrently by different CPUs without interfering with each other 23
24 Multiprocessor Synchronization (1) Figure The TSL (Test and Set Lock) instruction can fail if the bus cannot be locked. These four steps show a sequence of events where the failure is demonstrated.
Multiprocessor Synchronization (2) The synchronization problem of previous slide can be prevented The synchronization problem of previous slide can be prevented By locking the bus so that other CPUs can not access it By locking the bus so that other CPUs can not access it Execute the TSL instruction Execute the TSL instruction Unlock the bus Unlock the bus This can be done preferably by hardware locking or by software using spin locks (process executes a tight loop testing its status) This can be done preferably by hardware locking or by software using spin locks (process executes a tight loop testing its status) 25
Multiprocessor Scheduling Uniprocessor scheduling : scheduler chooses the thread to run next Uniprocessor scheduling : scheduler chooses the thread to run next Multiprocessor scheduling : scheduler has to choose a thread and a CPU Multiprocessor scheduling : scheduler has to choose a thread and a CPU Another complication factor is the thread relations Another complication factor is the thread relations Unrelated threads as in multi-user timesharing environments Unrelated threads as in multi-user timesharing environments Related threads as in the threads of some application working together (such as compilers make command) Related threads as in the threads of some application working together (such as compilers make command) 26
Thread Scheduling Algorithms Time-Sharing : unrelated threads Time-Sharing : unrelated threads Space Sharing : related threads Space Sharing : related threads Gang Scheduling : related threads Gang Scheduling : related threads 27
28 Timesharing Figure Using a single data structure for scheduling a multiprocessor. 16 CPUs all busy, CPU 4 becomes idle, locks scheduling queues and selects thread A. Next CPU 12 goes idle and chooses thread B 16 CPUs all busy, CPU 4 becomes idle, locks scheduling queues and selects thread A. Next CPU 12 goes idle and chooses thread B CPUs are time-shared and we have load balancing (no overload, but work is distributed) CPUs are time-shared and we have load balancing (no overload, but work is distributed)
29 Space Sharing (1) Figure A set of 32 CPUs split into four partitions, with two CPUs available. Partitioning is based on related (grouped) threads Partitioning is based on related (grouped) threads
Space Sharing (2) Scheduling multiple related threads at the same time across multiple CPUs is called space sharing Scheduling multiple related threads at the same time across multiple CPUs is called space sharing Some applications benefit from this approach such as compilers make command Some applications benefit from this approach such as compilers make command Each thread in a group is given its dedicated CPU. This thread holds on to the CPU until it terminates. If a thread blocks on I/O, it continues to hold the CPU Each thread in a group is given its dedicated CPU. This thread holds on to the CPU until it terminates. If a thread blocks on I/O, it continues to hold the CPU If there are not enough CPUs to start all threads of a group at the same time, the whole group waits until there are If there are not enough CPUs to start all threads of a group at the same time, the whole group waits until there are 30
Space Sharing (3) Space sharing does not use multiprogramming so we do not have context switching overhead Space sharing does not use multiprogramming so we do not have context switching overhead Time is wasted when a thread blocks for I/O or for some other event and CPU is idle Time is wasted when a thread blocks for I/O or for some other event and CPU is idle Consequently, people have looked for algorithms that attempt to schedule in both time and space together, especially for threads that create multiple threads, which usually need to communicate with one another Consequently, people have looked for algorithms that attempt to schedule in both time and space together, especially for threads that create multiple threads, which usually need to communicate with one another 31
32 Gang Scheduling (1) Figure Communication between two threads belonging to thread A that are running out of phase. Consider process A with threads A0 and A1, process B with threads B0 and B1. A0 and B0 execute on CPU 0, A1 and B1 on CPU 1 Consider process A with threads A0 and A1, process B with threads B0 and B1. A0 and B0 execute on CPU 0, A1 and B1 on CPU 1 A0 sends a message to A1, A1 sends a reply back repeately A0 sends a message to A1, A1 sends a reply back repeately A0 sends the message but gets the reply after a delay of 200 msec because of B’s threads A0 sends the message but gets the reply after a delay of 200 msec because of B’s threads
33 Gang Scheduling (2) The three parts of gang scheduling: 1. Groups of related threads are scheduled as a unit, a gang. 2. All members of a gang run simultaneously, on different timeshared CPUs. 3. All gang members start and end their time slices together.
Gang Scheduling (3) Time is divided into time slices. Time is divided into time slices. At the start of a time slice, all CPUs are rescheduled with a new thread in each. At the start of a time slice, all CPUs are rescheduled with a new thread in each. At the start of the next time slice another scheduling is done. No scheduling is done in between At the start of the next time slice another scheduling is done. No scheduling is done in between If a thread blocks, its CPU stays idle until the end of the quantum If a thread blocks, its CPU stays idle until the end of the quantum 34
35 Gang Scheduling (4) Figure Gang scheduling (6 CPUs, 5 processes and 24 threads in total). In gang scheduling all threads of a process run together, so that if one of them sends a request to another one, it will get the message and reply immediately In gang scheduling all threads of a process run together, so that if one of them sends a request to another one, it will get the message and reply immediately