Download presentation
Presentation is loading. Please wait.
Published byDomenic Randall Modified over 9 years ago
1
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1
2
Introduction Goal: connecting multiple computers to get higher performance –Multiprocessors –Scalability, availability, power efficiency Job-level (process-level) parallelism –High throughput for independent jobs Parallel processing program –Single program run on multiple processors Multicore microprocessors –Chips with multiple processors (cores) 2
3
Hardware and Software Hardware –Serial: e.g., Pentium 4 –Parallel: e.g., quad-core Xeon e5345 Software –Sequential: e.g., matrix multiplication –Concurrent: e.g., operating system Sequential/concurrent software can run on serial/parallel hardware –Challenge: making effective use of parallel hardware 3
4
Parallel Programming Parallel software is the problem Need to get significant performance improvement –Otherwise, just use a faster uniprocessor, since it’s easier! Difficulties –Partitioning –Coordination –Communications overhead 4
5
Amdahl’s Law Sequential part can limit speedup Example: 100 processors, 90× speedup? Need sequential part to be 0.1% of original time 5
6
Scaling Example Workload: sum of 10 scalars, and 10 × 10 matrix sum –Speed up from 10 to 100 processors Single processor: Time = (10 + 100) × t add 10 processors –Time = 10 × t add + 100/10 × t add = 20 × t add –Speedup = 110/20 = 5.5 (55% of potential) 100 processors –Time = 10 × t add + 100/100 × t add = 11 × t add –Speedup = 110/11 = 10 (10% of potential) 6
7
Scaling Example (cont) What if matrix size is 100 × 100? Single processor: Time = (10 + 10000) × t add 10 processors –Time = 10 × t add + 10000/10 × t add = 1010 × t add –Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors –Time = 10 × t add + 10000/100 × t add = 110 × t add –Speedup = 10010/110 = 91 (91% of potential) 7
8
Strong vs Weak Scaling Strong scaling: problem size fixed –As in example Weak scaling: problem size proportional to number of processors –10 processors, 10 × 10 matrix Time = 20 × t add –100 processors, 32 × 32 matrix Time = 10 × t add + 1000/100 × t add = 20 × t add –Constant performance in this example 8
9
9 Memory Organization - I Centralized shared-memory multiprocessor or Symmetric shared-memory multiprocessor (SMP) Multiple processors connected to a single centralized memory – since all processors see the same memory organization uniform memory access (UMA) Shared-memory because all processors can access the entire memory address space Can centralized memory emerge as a bandwidth bottleneck? – not if you have large caches and employ fewer than a dozen processors
10
10 SMPs or Centralized Shared-Memory Processor Caches Processor Caches Processor Caches Processor Caches Main Memory I/O System
11
11 Memory Organization - II For higher scalability, memory is distributed among processors distributed memory multiprocessors If one processor can directly address the memory local to another processor, the address space is shared distributed shared-memory (DSM) multiprocessor If memories are strictly local, we need messages to communicate data cluster of computers or multicomputers Non-uniform memory architecture (NUMA) since local memory has lower latency than remote memory
12
12 Distributed Memory Multiprocessors Processor & Caches MemoryI/O Processor & Caches MemoryI/O Processor & Caches MemoryI/O Processor & Caches MemoryI/O Interconnection network
13
13 SMPs Centralized main memory and many caches many copies of the same data A system is cache coherent if a read returns the most recently written value for that word Time Event Value of X in Cache-A Cache-B Memory 0 - - 1 1 CPU-A reads X 1 - 1 2 CPU-B reads X 1 1 1 3 CPU-A stores 0 in X 0 1 0
14
14 Cache Coherence A memory system is coherent if: P writes to X; no other processor writes to X; P reads X and receives the value previously written by P P1 writes to X; no other processor writes to X; sufficient time elapses; P2 reads X and receives value written by P1 Two writes to the same location by two processors are seen in the same order by all processors – write serialization The memory consistency model defines “time elapsed” before the effect of a processor is seen by others
15
15 Cache Coherence Protocols Directory-based: A single location (directory) keeps track of the sharing status of a block of memory Snooping: Every cache block is accompanied by the sharing status of that block – all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies Write-update: when a processor writes, it updates other shared copies of that block
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.