Download presentation
Presentation is loading. Please wait.
Published byCecil Cameron Modified over 9 years ago
1
Computer Architecture MIMD Parallel Processors Iolanthe II racing in Waitemata Harbour
2
Classification of Parallel Processors Flynn’s Taxonomy Classifies according to instruction and data stream Single Instruction Single Data Sequential processors Single Instruction Multiple Data CM-2 – multiple small processors Vector processors Parts of commercial processors - MMX, Altivec Multiple Instruction Single Data ? Multiple Instruction Multiple Data General Parallel Processors
3
MIMD Systems Recipe Buy a few high performance commercial PEs DEC Alpha MIPS R10000 UltraSPARC Pentium? Put them together with some memory and peripherals on a common bus Instant parallel processor! How to program it?
4
Programming Model Problem not unique to MIMD Even sequential machines need one von Neuman (stored program) model Parallel - Splitting the work load Data Distribute data to PEs Instructions Distribute tasks to PEs Synchronization Having divided the data & tasks, how do we synchronize tasks?
5
Programming Model Shared Memory Model Flavour of the year Generally thought to be simplest to manage All PEs see a common (virtual) address space PEs communicate by writing into the common address space
6
Data Distribution Trivial All the data sits in the common address space Any PE can access it! Uniform Memory Access (UMA) systems All PEs access all data with same t acc Non-UMA (NUMA) systems Memory is physically distributed Some PEs are “closer” to some addresses More later!
7
Synchronisation Read static shared data No problem! Update problem PE 0 writes x PE 1 reads x How to ensure that PE 1 reads the last value written by PE 0 ? Semaphores Lock resources (memory areas or...) while being updated by one PE
8
Synchronisation Semaphore Data structure in memory Count of waiters -1 = resource free >= 0 resource in use Pointer to list of waiters Two operations Wait Proceed immediately if resource free (waiter count = -1) Notify Advise semaphore that you have finished with resource Decrement waiter count First waiter will be given control
9
Semaphores - Implementation Scenario Semaphore free (-1) PE 0 : wait.. Resource free, so PE 0 uses it (sets 0) PE 1 : wait.. Reads count (0) Starts to increment it.. PE 0 notify.. Gets bus and writes -1 PE 1 : (finishing wait) Adds 1 to 0, writes 1 to count, adds PE 1 TCB to list ðStalemate! Who issues notify to free the resource?
10
Atomic Operations Problem PE 0 wrote a new value (-1) after PE 1 had read the counter PE 1 increments the value it read (0) and writes it back Solution PE 1 ’s read and update must be atomic No other PE must gain access to counter while PE 1 is updating Usually an architecture will provide Test and set instruction Read a memory location, test it, if it’s 0, write a new value, else do nothing Atomic or indivisible.. No other PE can access the value until the operation is complete
11
Atomic Operations Test & Set Read a memory location, test it, if it’s 0, write a new value, else do nothing Can be used to guard a resource When the location contains 0 - access to the resource is allowed Non-zero value means the resource is locked Semaphore: Simple semaphore (no wait list) Implement directly Waiter “backs off” and tries again (rather than being queued) Complex semaphore (with wait list) Guards the wait counter
12
Atomic Operations Processor must provide an atomic operation for Multi-tasking or multi-threading on a single PE Multiple processes Interrupts occur at arbitrary points in time including timer interrupts signaling end of time-slice Any process can be interrupted in the middle of a read-modify-write sequence Shared memory multi-processors One PE can lose control of the bus after the read of a read-modify-write Cache? Later!
13
Atomic Operations Variations Provide equivalent capability Sometimes appear in strange guises! Read-modify-write bus transactions Memory location is read, modified and written back as a single, indivisible operation Test and exchange Check register’s value, if 0, exchange with memory Reservation Register (PowerPC) lwarx - load word and reserve indexed stwcx - store word conditional indexed Reservation register stores address of reserved word Reservation and use can be separated by sequence of instructions
14
Barriers In shared memory environment PEs must know when another PE has produced a result Simplest case: barrier for all PEs Must be inserted by programmer Potentially expensive All PEs stall and waste time in the barrier
15
Cache? What happens to cached locations?
16
Multiple Caches Coherence PE A reads location x from memory Copy in cache A PE B reads location x from memory Copy in cache B PE A adds 1
17
Multiple Caches - Inconsistent states Coherence PE A reads location x from memory Copy in cache A PE B reads location x from memory Copy in cache B PE A adds 1 A’s copy now 201 PE B reads location x reads 200 from cache B
18
Multiple Caches - Inconsistent states Coherence PE A reads location x from memory Copy in cache A PE B reads location x from memory Copy in cache B PE A adds 1 A’s copy now 201 PE B reads location x reads 200 from cache B Caches and memory are now inconsistent or not coherent
19
Cache - Maintaining Coherence Invalidate on write PE A reads location x from memory Copy in cache A PE B reads location x from memory Copy in cache B PE A adds 1 A’s copy now 201 Issues invalidate x Cache B marks x invalid Invalidate is address transaction only
20
Cache - Maintaining Coherence Reading the new value PE B reads location x Main memory is wrong also PE A snoops read Realises it has valid copy PE A issues retry
21
Cache - Maintaining Coherence Reading the new value ±PE B reads location x ðMain memory is wrong also ±PE A snoops read ðRealises it has valid copy ±PE A issues retry ±PE A writes x back ðMemory now correct OPE B reads location x again Reads latest version
22
Coherent Cache - Snooping SIU “snoops” bus for transactions Addresses compared with local cache Matches Initiate retries Local copy is modified Local copy is written to bus Invalidate local copies Another PE is writing Mark local copies shared second PE is reading same value
23
Coherent Cache - MESI protocol Cache line has 4 states Invalid Modified Only valid copy Memory copy is invalid Exclusive Only cached copy Memory copy is valid Shared Multiple cached copies Memory copy is valid
24
MESI State Diagram Note the number of bus transactions needed! WH Write Hit WM Write Miss RH Read Hit RMS Read Miss Shared RME Read Miss Exclusive SHW Snoop Hit Write
25
Coherent Cache - The Cost Cache coherency transactions Additional transactions needed Shared Write Hit Other caches must be notified Modified Other PE read Push-out needed Other PE write Push-out needed - writing one word of n-word line Invalid - modified in other cache Read or write Wait for push-out
26
Clusters A bus which is too long becomes slow! eg PCI is limited to 10 TTL loads Lots of processors? On the same bus Bus speed must be limited ðLow communication rate ðBetter to use a single PE! Clusters ~8 processors on a bus
27
Clusters 8 cache coherent (CC) processors on a bus Interconnect network ~100? clusters
28
Clusters Network Interface Unit Detects requests for “remote” memory
29
Clusters Message despatched to remote cluster’s NIU Memory Request Message
30
This memory is much closer than this one! From PEs in this cluster Clusters - Shared Memory Non Uniform Memory Access Access time to memory depends on location!
31
Clusters - Shared Memory Non Uniform Memory Access Access time to memory depends on location! Worse! NIU needs to maintain cache coherence across the entire machine
32
Clusters - Maintaining Cache Coherence NIU (or equivalent) maintains directory Directory Entries All lines from local memory cached elsewhere NIU software (firmware) Checks memory requests against directory Update directory Send invalidate messages to other clusters Fetch modified (dirty) lines from other clusters Remote memory access cost 100s of cycles! Address Status Clusters 4340 S 1, 3, 8 5260 E 9 Directory (Cluster 2)
33
Clusters - “Off the shelf” Commercial clusters Provide page migration Make copy of a remote page on the local PE Programmer remains responsible for coherence Don’t provide hardware support for cache coherence (across network) Fully CC machines may never be available! Software Systems....
34
Shared Memory Systems Software Systems eg Treadmarks Provide shared memory on page basis Software detects references to remote pages moves copy to local memory Reduces shared memory overhead Provides some of the shared memory model convenience Without swamping interconnection network with messages Message overhead is too high for a single word! Word basis is too expensive!!
35
Shared Memory Systems - Granularity Granularity Word basis is too expensive!! Sharing data at low granularity Fine grain sharing Access / sharing for individual words Overheads too high Number of messages Message overhead is high for one word Compare Burst access to memory Don’t fetch a single word - Overhead (bus protocol) is too high Amortize cost of access over multiple words
36
Shared Memory Systems - Granularity Coarse Grain Systems Transferring data from cluster to cluster Overhead Messages Updating directory Amortise the overhead over a whole page åLower relative overhead Applies to thread size also Split program into small threads of control çParallel Overhead cost of setting up & starting each thread cost of synchronising at the end of a set of threads Can be more efficient to run a single sequential thread!
37
Coarse Grain Systems So far... Most experiments suggest that fine grain systems are impractical Larger, coarser grain Blocks of data Threads of computation needed to reduce overall computation time by using multiple processors Too Fine grain parallel systems can run slower than a single processor!
38
Parallel Overhead Ideal Time = 1/n Add Overhead Time > optimal No point to use more than 4 PEs!!
39
Parallel Overhead Ideal Time = 1/n Add Overhead Time > optimal No point to use more than 4 PEs!!
40
Parallel Overhead Shared memory systems Best results if you Share on large block basis eg page Split program into coarse grain (long running) threads Give away some parallelism to achieve any parallel speedup! Coarse grain Data Computation There’s parallelism at the instruction level too! The instruction issue unit in a sequential processor is trying to exploit it!
41
Clusters - Improving multiple PE performance Bandwidth to memory Cache reduces dependency on the memory- CPU interface 95% cache hits ð5% of memory accesses crossing the interface but add a few PEs and a few CC transactions ðeven if the interface was coping before, it won’t in a multiprocessor system! A major bottleneck!
42
Clusters - Improving multiple PE performance Bus protocols add to access time ðRequest / Grant / Release phases needed “Point-to-point” is faster! Cross-bar switch interface to memory No PE contends with any other for the common bus Cross-bar? Name taken from old telephone exchanges!
43
Clusters - Memory Bandwidth Modern Clusters Use “Point-to-point” X-bar interfaces to memory to get bandwidth! Cache coherence? Now really hard!! How does each cache snoop all transactions?
44
Programming Model Distributed Memory Message passing Alternative to shared memory Each PE has own address space PEs communicate with messages Messages provide synchronisation PE can block or wait for a message
45
Programming Model - Distributed Memory Distributed Memory Systems Hardware is simple! Network can be as simple as ethernet Networks of Workstations model Commodity (cheap!) PEs Commodity Network Standard Ethernet ATM Proprietary Myrinet Achilles (UWA!)
46
Programming Model - Distributed Memory Distributed Memory Systems Software is considered harder Programmer responsible for Distributing data to individual PEs Explicit Thread control Starting, stopping & synchronising At least two commonly available systems Parallel Virtual Machine (PVM) Message Passing Interface (MPI) Built on two operations Send data, destPE, block | don’t block Receive data, srcPE, block | don’t block Blocking ensures synchronisation
47
Programming Model - Distributed Memory Distributed Memory Systems Performance generally better (versus shared memory) Shared memory has hidden overheads Grain size poorly chosen eg data doesn’t fit into pages Unnecessary coherence transactions Updating a shared region (each page) before end of computation MP system waits and updates page when computation is complete
48
Programming Model - Distributed Memory Distributed Memory Systems Performance generally better (versus shared memory) False sharing Severely degrades performance May not be apparent on superficial analysis PE a accesses this data PE b accesses this data This whole page ping-pongs between PE a and PE b Memory page
49
Distributed Memory - Summary Simpler (almost trivial) hardware Software More programmer effort Explicit data distribution Explicit synchronisation Performance generally better Programmer knows more about the problem Communicates only when necessary Communication grain size can be optimum çLower overheads
50
Data Flow Conventional programming models are control driven Instruction sequence is precisely specified Sequence specifies control which instruction the CPU will execute next Execution rule: Execute an instruction when its predecessor has completed s1: r = a*b; s2: s = c*d; s3: y = r + s; s2 executes when s1 is complete s3 executes when s2 is complete
51
Data Flow Consider the calculation y = a*b + c*d Represent it by a graph Nodes represent computations Data flows along arcs Execution rule: Execute an instruction when its data is available Data driven rule ab x + dc x y
52
Data Flow Dataflow firing rule An instruction fires (executes) when its data is available Exposes all possible parallelism Either multiplication can fire as soon as data arrives Addition must wait Data dependence analysis! Instruction issue units: Fire (issue) each instruction when its operands (registers) have been written ab x + dc x y
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.