Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1

Introduction Goal: connecting multiple computers to get higher performance –Multiprocessors –Scalability, availability, power efficiency Job-level (process-level) parallelism –High throughput for independent jobs Parallel processing program –Single program run on multiple processors Multicore microprocessors –Chips with multiple processors (cores) 2

Hardware and Software Hardware –Serial: e.g., Pentium 4 –Parallel: e.g., quad-core Xeon e5345 Software –Sequential: e.g., matrix multiplication –Concurrent: e.g., operating system Sequential/concurrent software can run on serial/parallel hardware –Challenge: making effective use of parallel hardware 3

Parallel Programming Parallel software is the problem Need to get significant performance improvement –Otherwise, just use a faster uniprocessor, since it’s easier! Difficulties –Partitioning –Coordination –Communications overhead 4

Amdahl’s Law Sequential part can limit speedup Example: 100 processors, 90× speedup? Need sequential part to be 0.1% of original time 5

Scaling Example Workload: sum of 10 scalars, and 10 × 10 matrix sum –Speed up from 10 to 100 processors Single processor: Time = (10 + 100) × t add 10 processors –Time = 10 × t add + 100/10 × t add = 20 × t add –Speedup = 110/20 = 5.5 (55% of potential) 100 processors –Time = 10 × t add + 100/100 × t add = 11 × t add –Speedup = 110/11 = 10 (10% of potential) 6

Scaling Example (cont) What if matrix size is 100 × 100? Single processor: Time = (10 + 10000) × t add 10 processors –Time = 10 × t add + 10000/10 × t add = 1010 × t add –Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors –Time = 10 × t add + 10000/100 × t add = 110 × t add –Speedup = 10010/110 = 91 (91% of potential) 7

Strong vs Weak Scaling Strong scaling: problem size fixed –As in example Weak scaling: problem size proportional to number of processors –10 processors, 10 × 10 matrix Time = 20 × t add –100 processors, 32 × 32 matrix Time = 10 × t add + 1000/100 × t add = 20 × t add –Constant performance in this example 8

9 Memory Organization - I Centralized shared-memory multiprocessor or Symmetric shared-memory multiprocessor (SMP) Multiple processors connected to a single centralized memory – since all processors see the same memory organization  uniform memory access (UMA) Shared-memory because all processors can access the entire memory address space Can centralized memory emerge as a bandwidth bottleneck? – not if you have large caches and employ fewer than a dozen processors

10 SMPs or Centralized Shared-Memory Processor Caches Processor Caches Processor Caches Processor Caches Main Memory I/O System

11 Memory Organization - II For higher scalability, memory is distributed among processors  distributed memory multiprocessors If one processor can directly address the memory local to another processor, the address space is shared  distributed shared-memory (DSM) multiprocessor If memories are strictly local, we need messages to communicate data  cluster of computers or multicomputers Non-uniform memory architecture (NUMA) since local memory has lower latency than remote memory

12 Distributed Memory Multiprocessors Processor & Caches MemoryI/O Processor & Caches MemoryI/O Processor & Caches MemoryI/O Processor & Caches MemoryI/O Interconnection network

13 SMPs Centralized main memory and many caches  many copies of the same data A system is cache coherent if a read returns the most recently written value for that word Time Event Value of X in Cache-A Cache-B Memory 0 - - 1 1 CPU-A reads X 1 - 1 2 CPU-B reads X 1 1 1 3 CPU-A stores 0 in X 0 1 0

14 Cache Coherence A memory system is coherent if: P writes to X; no other processor writes to X; P reads X and receives the value previously written by P P1 writes to X; no other processor writes to X; sufficient time elapses; P2 reads X and receives value written by P1 Two writes to the same location by two processors are seen in the same order by all processors – write serialization The memory consistency model defines “time elapsed” before the effect of a processor is seen by others

15 Cache Coherence Protocols Directory-based: A single location (directory) keeps track of the sharing status of a block of memory Snooping: Every cache block is accompanied by the sharing status of that block – all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary  Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies  Write-update: when a processor writes, it updates other shared copies of that block

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Similar presentations

Presentation on theme: "Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Similar presentations

Presentation on theme: "Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1."— Presentation transcript:

Similar presentations

About project

Feedback