Download presentation
Presentation is loading. Please wait.
Published byDamon Robbins Modified over 8 years ago
1
EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I
2
EE 382 Processor DesignWinter 98/99Michael Flynn 2 Processor Issues for MP Initialization Interrupts Virtual Memory –TLB Coherency Emphasis on Physical Memory and System Interconnect Physical Memory –Coherency –Synchronization –Consistency
3
EE 382 Processor DesignWinter 98/99Michael Flynn 3 Outline Partitioning –Granularity –Overhead and efficiency Multi-threaded MP Shared Bus –Coherency –Synchronization –Consistency Scalable MP –Cache directories –Interconnection networks –Trends and tradeoffs Additional References –Hennessy and Patterson, CAQA, Chapter 8 –Culler, Singh, Gupta, Parallel Computer Architecture A Hardware/Software Approach http://HTTP.CS.Berkeley.EDU/~culler/ book.alpha/index.html
4
EE 382 Processor DesignWinter 98/99Michael Flynn 4 Representative System L2 Cache Pipelines Registers L1 Icache L1 Dcache CPU Chipset Memory I/O Bus(es)
5
EE 382 Processor DesignWinter 98/99Michael Flynn 5 Shared Memory MP Shared-Memory –Consider systems with a single memory address space –Contrasted to multi-computers separate memory address spaces message passing for communication and synchronization Example: Network of Workstations
6
EE 382 Processor DesignWinter 98/99Michael Flynn 6 Shared Memory MP Types of shared-memory MP –multithreaded or shared resource MP –shared-bus MP (broadcast protocols) –scalable MP (networked protocols) Issues –partitioning of application into p parallel tasks –scheduling of tasks to minimize dependency T w –communications and synchronization
7
EE 382 Processor DesignWinter 98/99Michael Flynn 7 Partitioning If a uniprocessor executes a program in time T 1 with O 1 operations, and a p parallel proc. executes in T p with O p ops, then O p >O 1 due to task overhead Also Sp = T 1 /T p < p, where p=no. of processors in the system and this is also the amount of parallelism (or the degree of partitioning) available in the program.
8
EE 382 Processor DesignWinter 98/99Michael Flynn 8 Granularity grain size Sp finecoarse limited by parallelism and load balance overhead limited
9
EE 382 Processor DesignWinter 98/99Michael Flynn 9 Task Scheduling Static… at compile time Dynamic … run time –system load balancing –load balancing –clustering of tasks with inter-processor communication –schedule with compiler assistance
10
EE 382 Processor DesignWinter 98/99Michael Flynn 10 Overhead Limits Sp to less than p with p processors Efficiency = Sp/p = T 1 /(T p * p) Lee’s equal work hypothesis: Sp < p/ln(p) Task overhead due to – communications delays – context switching – cold cache effects
11
EE 382 Processor DesignWinter 98/99Michael Flynn 11 Multi-threaded MP Multiple processors sharing many execution units –each processor has its own state –share function units, caches, TLBs, etc. Types –time multiplex multiple processors so that there are no pipeline breaks,etc. –pipelined processor switch context and on any processor delay (cache miss,etc) Optimizes multi-thread throughput, but limits single- thread performance –See Study 8.1 on p. 537 Processors share D cache
12
EE 382 Processor DesignWinter 98/99Michael Flynn 12 Shared-Bus MP Processors with own D cache require cache coherency protocol. Simplest protocols have processors snoop on writes to memory that occur on a shared bus If write is to a line in own cache, either invalidate or update that line.
13
EE 382 Processor DesignWinter 98/99Michael Flynn 13 Coherency, Synchronization, and Consistency Coherency –Property that the value returned after a read is the same value as the latest write –Required for process migration even without sharing Synchronization –Instructions that control access to critical sections of data shared by multiple processors Consistency –Rules for allowing memory references to be reordered that may lead to observed differences in memory state by multiple processors
14
EE 382 Processor DesignWinter 98/99Michael Flynn 14 Shared-Bus Cache Coherency Protocols Write invalidate, simple 3 state -V,I,D Berkeley (w.invalidate) 4 state - V,S,D,I Illinois (w.invalidate) 4 state - M,E,S,I Dragon (w.update) 5 state - M,E,S,D,I Simpler protocols have somewhat more memory bus traffic.
15
EE 382 Processor DesignWinter 98/99Michael Flynn 15 MESI Protocol
16
EE 382 Processor DesignWinter 98/99Michael Flynn 16 Coherence Overhead for Parallel Processing Results for 4 parallel programs with 16 CPUs and 64KB cache Coherence traffic is a substantial portion of bus demand Large blocks can lead to false sharing Hennessy and Patterson CAQA Fig 8.15
17
EE 382 Processor DesignWinter 98/99Michael Flynn 17 Synchronization Primitives Communicating Sequential Processes Process A Process B acquire semaphoreacquire semaphore access shared dataaccess shared data (read/modify/write)(read/modify/write) release semaphorerelease semaphore
18
EE 382 Processor DesignWinter 98/99Michael Flynn 18 Synchronization Primitives Acquiring the semaphore generally requires an atomic read-modify-write operation a location –Ensure that only one process enters critical section –Test&Set, Locked-Exchange, Compare&Exchange, Fetch&Add, Load-Locked/Store-Conditional Looping on a semaphore with a test and set or similar instruction is called a spin lock –Techniques to minimize overhead for spin contention: Test + Test&Set, exponential backoff
19
EE 382 Processor DesignWinter 98/99Michael Flynn 19 Memory Consistency Problem Can the tests at L1 and L2 below both succeed? Process A Process B A = 0;B = 0;...... A = 1;B = 1; L1:if (B==0) L2:if (A==0) Memory Consistency Model –Rules for allowing memory references by a program executing on one processor to be observed in a different order by a program executing on another processor –Memory Fence operations explicitly control ordering of memory references
20
EE 382 Processor DesignWinter 98/99Michael Flynn 20 Memory Consistency Models (Part I) Sequential consistency (strong ordering) –All memory ops execute in some sequential order. Memory ops of each processor appear in program order. Processor consistency (Total Store Ordering) –Writes are buffered and stored in order –Reads are performed in order, but can bypass writes –Processor flushes store buffer when synchronization instruction executed Weak consistency –Memory references generally allowed in any order –Programs enforce ordering when required for shared data by executing Memory Fence instructions All memory references for previous instructions complete before fence No memory references for subsequent instructions issued before fence –Synchronization instructions act like fences
21
EE 382 Processor DesignWinter 98/99Michael Flynn 21 Memory Consistency Models (Part II) Release consistency –Distinguish between acquire/release of semaphore before/after access to shared data –Acquire semaphore Ensure that semaphore acquired before any reads or writes by subsequent instructions (which may access shared data) –Release semaphore Ensure that any writes by previous instructions (which may access shared data) are visible before semaphore released Hennessy and Patterson CAQA Fig 8.39
22
EE 382 Processor DesignWinter 98/99Michael Flynn 22 Pentium Processor Example 2-Level Cache Hierarchy –Inclusion Enforced –Snoops on system bus only need interrogate L2 Cache Policy –Write-Back supported –Write-Through optional selected by page or line write buffers used Cache Coherence –MESI at both levels Memory Consistency –Processor Ordering Issues –Writes hit E-line on-chip –Writes hit E or M line while buffer occupied CPU Pipelines Data Cache Write Buffer L2 Cache Cache Write Buffer System Bus
23
EE 382 Processor DesignWinter 98/99Michael Flynn 23 Shared-Bus Performance Models Null Binomial –resubmissions don’t automatically occur, e.g, multithreaded MP –See study 8.1, page 537 Resubmissions model –where requests remain on bus until serviced –See pp 413-415 and cache example posting on web Bus traffic usually limits number of processors –Bus optimized for MP supports 10-20 But high cost for small systems –Bus that incrementally extends uniprocessor limited to 2-4
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.