CSL718 : Multiprocessors 13th April, 2006 Introduction Anshul Kumar, CSE IITD
Parallel Architectures Flynn’s Classification [1966] Architecture Categories SISD SIMD MISD MIMD Anshul Kumar, CSE IITD
MIMD IS C P M IS DS IS C IS P DS Anshul Kumar, CSE IITD
Parallel Architectures Sima’s Classification Parallel architectures PAs Data-parallel architectures Function-parallel Anshul Kumar, CSE IITD
Function Parallel Architectures Instruction level PAs Thread level PAs Process level PAs ILPs Pipelined processors VLIWs Superscalar processors MIMDs Shared Memory MIMD Distributed Memory MIMD Built using general purpose processors Anshul Kumar, CSE IITD
Issues from user’s perspective Specification / Program design explicit parallelism or implicit parallelism + parallelizing compiler Partitioning / mapping to processors Scheduling / mapping to time instants static or dynamic Communication and Synchronization Anshul Kumar, CSE IITD
Parallelizing example for (i=0; i<n; i++) { m = m+3 a[i] = (a[m]+a[m+1]+a[m+2])/3 } Can all iterations be done in parallel? Dependence 1: m = m + 3 Dependence 2: a[1] = (a[3]+a[4]+a[5])/3 a[4] = (a[12]+a[13]+a[14])/3 Anshul Kumar, CSE IITD
Parallelizing example - contd. Eliminate dependence based on induction variable for (i=0; i<n; i++) { m = i*3 a[i] = (a[m]+a[m+1]+a[m+2])/3 } Anshul Kumar, CSE IITD
Parallelizing example - contd. Eliminate forward dependency using double buffer for (i=0; i<n; i++) { m = i*3 aa[i] = (a[m]+a[m+1]+a[m+2])/3 } barrier( ) a[i] = aa[i] Anshul Kumar, CSE IITD
Parallelizing example - contd. Parallelization using dynamic thread creation and scheduling schedule(0) for (i=0; i<n; i++) { wait_till_scheduled(i) m = i*3 a[i] = (a[m]+a[m+1]+a[m+2])/3 if (i0)schedule(3*i) schedule(3*i+1) schedule(3*i+2) } Anshul Kumar, CSE IITD
Grain size and performance Overhead limited load imbalance and parallelism limited Speed up Fine grain Opt grain size Coarse grain Anshul Kumar, CSE IITD
Speed up and efficiency Anshul Kumar, CSE IITD
Amdahl’s Law Sp s 1 .5 Sp=p Sp=1
Generalization Sp p actual Anshul Kumar, CSE IITD
Shared Memory Architecture Anshul Kumar, CSE IITD
Design Space of Shared Memory Architectures Extent of address space sharing Location of memory modules Uniformity of memory access Anshul Kumar, CSE IITD
Address Space P1 P2 P3 P4 Each processor sees an exclusive address space Each processor sees partly exclusive and partly shared address space Each processor sees same shared address space Anshul Kumar, CSE IITD
Location of Memory P M M Centralized P M P Mixed Distributed Interconnection Network Centralized P M Interconnection Network Mixed P M Interconnection Network Distributed Anshul Kumar, CSE IITD
Clustered Architecture M M M M M M M M P P P P P P P P Interconnection Network Interconnection Network M M M M M M Global Interconnection Network M M M Anshul Kumar, CSE IITD
Uniformity of Access UMA (Uniform Memory Access) Uniformity across memory address space Uniformity across processors NUMA (Non-Uniform Memory Access) CC-NUMA (Cache Coherent NUMA) COMA (Cache Only Memory Architecture) UMA : Symmetrical Shared Memory Multiprocessor (SMP) NUMA : Distributed Shared Memory Multiprocessor Anshul Kumar, CSE IITD
Location and Sharing SHARING full partial none UMA centralized mixed NUMA distributed Anshul Kumar, CSE IITD
Shared Memory with Caches Multiple copies of data may exist Problem of cache coherence Cache coherence protocols What action is taken? Which processors/caches communicate? Status of each block? Anshul Kumar, CSE IITD
What action is taken? Invalidate other caches and/or memory send a signal/message immediately, copy information only when unavoidable similar to write back policy Update other caches and/or memory write simultaneously at all places (send modifications immediately) similar to write through policy Anshul Kumar, CSE IITD
Which procs/caches communicate? Snoopy protocol broadcast invalidate or update messages all processors snoop on the bus Directory based protocol maintain directory - list of copies communicate selectively directory - centralized (memory) or distributed (caches) Anshul Kumar, CSE IITD
Status of each cache block? valid/invalid private/shared clean/dirty Simplest protocol (3 states) Invalid, (shared) clean, private dirty Berkeley protocol (4 states) Invalid, (shared) clean, private dirty, shared dirty Illinois, Firefly protocols (4 states) Invalid, shared clean, private clean, private dirty Dragon protocols (5 states) Invalid, shared clean/dirty private clean/dirty Anshul Kumar, CSE IITD
Simplest invalidation protocol Use 3 states : Invalid, shared clean, private dirty invalid clean shared? dirty CPU event BUS event Anshul Kumar, CSE IITD
Simplest invalidation protocol Use 3 states : Invalid, shared clean, private dirty RD miss invalid clean shared? WR RD miss WR miss dirty CPU event BUS event Anshul Kumar, CSE IITD
Simplest invalidation protocol Use 3 states : Invalid, shared clean, private dirty invalid clean shared? WR miss, INV RD miss WR miss, INV dirty CPU event BUS event Anshul Kumar, CSE IITD
Simplest invalidation protocol Use 3 states : Invalid, shared clean, private dirty RD miss invalid clean shared? WR miss, INV RD miss WR miss, INV WR RD miss WR miss dirty CPU event BUS event Anshul Kumar, CSE IITD