1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Architectural Concepts l Distributed Memory MIMD –Replicate the processor/memory pairs –Connect them via an interconnection network l Shared Memory MIMD –Replicate the processors –Replicate the memories –Connect them via an interconnection network
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Distributed Memory Machine l Access to local memory module is much faster than remote l Hardware remote accesses via –Load/Store primitive –Message passing layer l Cache memory for local memory traffic l Message –Memory-memory –Cache-cache Processor 1 Processor p Interconnection Network Memory
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Advantages of Distributed Memory l Local memory traffic less contention than in shared memory l Highly scalable l Don’t need sophisticated synchronization features like monitors, semaphores. Message passing serves dual purpose –To send the data –Provide synchronization
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Problems of Distributed Memory l Load balancing l Message passing can lead to synchronization failures, including deadlock –BlockingSend -> BlockingReceive –BlockingReceive -> BlockingSend l Intensive data copying of whole structures l Small message overheads are high
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Shared Memory Architecture l All processors have equal access to shared memory modules l Local Caches reduce –Memory traffic –Network traffic –Memory access time l IP Synchronisation –Indivisible load/store Processor 1 Processor 2 Processor p Interconnection Network Memory Module 1 Memory Module 2 Memory Module m
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Advantages of Shared Memory l No need to partition code or data –Occurs on the fly l No need to move data explicitly l Don’t need new programming languages or compilers.
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Disadvantages of Shared Memory l Synchronization is difficult l Lack of scalability –IPC becomes bottleneck l Scalability can be addressed by –High throughput, low latency network –Cache Memories Causes coherence problem –Distributed shared memory architecture
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Distributed Shared Memory l Three design choices –Non-uniform memory access (NUMA) Like Cray T3D –Cache coherent non-uniforms memory access (CC-NUMA) Convex SPP, Stanford DASH –Cache-only memory access (COMA) Like KSR-1
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Non-uniform memory access (NUMA) P0P0 M0M0 PE 0 P1P1 M1M1 PE 1 PnPn MnMn PE n Interconnection Network
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Cache coherent non-uniforms memory access (CC-NUMA) Interconnection Network P0P0 M0M0 PE 0 C0C0 P1P1 M1M1 PE 1 C1C1 PnPn MnMn PE n CnCn
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Cache-only memory access (COMA) Interconnection Network P0P0 PE 0 C0C0 P1P1 PE 1 C1C1 PnPn PE n CnCn
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Classification of MIMD Computers MIMD Computers Process-level architectures Single Address Space shared Memory Physical Shared memory (UMA) Virtual Distributed Shared Memory NUMA CC-NUMA COMA Multiple Address Space distributed Memory Thread Level architectures Single address space shared memory Physical Shared Memory (UMA) Virtual Distributed Shared Memory NUMA CC-NUMA
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Problems of Scalable Computers l Tolerate and hide the latency of remote loads –Worse if output of one computation relies on another to complete l Tolerate and hide idling due to synchronization among processors
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Tolerating Remote Loads P0P0 PE 0 Interconnection Network M0M0 rA rB Result P1P1 PE 1 M1M1 A PnPn PE n MnMn B Result:= A + B Load A rA A A Load B rB B B
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Tolerating Latency l Cache memory –Simply lowers the cost of remote access –Introduces cache coherence problem l Prefetching –Already present, so cost is low –Increases network load l Threads + fast context switching –Accept that it will take a long time and cover the overhead l These solutions don’t solve synchronization issues –Latency tolerant algorithms
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley Design issues of scalable MIMD l Processor Design –Pipelining, parallel instruction issue –Atomic data access, prefetching, cache memory, message passing, etc l Interconnection network design –Scalable, high bandwidth, low latency l Memory design –Shared memory design –Cache coherence l IO Subsystem –Parallel IO