Download presentation
Presentation is loading. Please wait.
Published byJoan Butler Modified over 9 years ago
1
©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture
2
©RG:E0243:L2- Parallel Architecture 2 Overview Parallel Architecture Cache coherence problem Memory consistency
3
©RG:E0243:L2- Parallel Architecture 3 Trends Ever increasing transistor density multiple processors (multiple core) on a single chip (CMP) Beyond Instruction level parallelism thread-level parallelism Speculative execution Speculative Multithreaded execution
4
©RG:E0243:L2- Parallel Architecture 4 Recall: Amdahl’s Law: For a program with x part sequential execution, speedup is limited by 1/x. Speedup = (Exec. Time in Uniproc.)/ Exec. Time in N Procs.) Efficiency = Speedup of N Procs. /N
5
©RG:E0243:L2- Parallel Architecture 5 Space of Parallel Computing Programming Models What programmer uses in coding applns. Specifies synch. And communication. Programming Models: Shared address space, e.g., OpenMP Message passing, e.g., MPI Parallel Architecture Shared Memory Centralized shared memory (UMA) Distributed Shared Memory (NUMA) Distributed Memory A.k.a. Message passing E.g., Clusters
6
©RG:E0243:L2- Parallel Architecture 6 Shared Memory Architectures Shared, global, address space, hence called Shared Address Space Any processor can directly reference any memory location Communication occurs implicitly as result of loads and stores Centralized: latencies to memory uniform, but uniformly large Distributed: Non-Uniform Memory Access (NUMA)
7
©RG:E0243:L2- Parallel Architecture 7 M Network Centralized Shared Memory M M $ P $ P $ P Network Distributed Shared Memory M $ P M $ P Shared Memory Architecture
8
©RG:E0243:L2- Parallel Architecture 8 Distributed Memory Architecture Network M $ P M $ P M $ P Message Passing Architecture Memory is private to each node Processes communicate by messages Proc. Node Proc. Node Proc. Node
9
©RG:E0243:L2- Parallel Architecture 9 Caches and Cache Coherence Caches play key role in all cases Reduce average data access time Reduce bandwidth demands placed on shared interconnect Private processor caches create a problem Copies of a variable can be present in multiple caches A write by processor P may not be visible to P’ ! P’ will keep accessing stale value from its cache! Cache coherence problem
10
©RG:E0243:L2- Parallel Architecture 10 Cache Coherence Problem: Example Processors see different values for u after event 3 With write back caches, value written back to memory depends on which cache flushes or writes back value. I/O devices Memory P 1 $$ $ P 2 P 3 5 u = ? 4 u u :5 1 u 2 u 3 u = 7 Read Write Read
11
©RG:E0243:L2- Parallel Architecture 11 Cache Coherence Problem Multiple processors with private caches Potential data consistency problem: the cache coherence problem Processes shouldn’t read `stale’ data Intuitively, Reading an address should return the last value written to that address Solutions Hardware: cache coherence mechanisms Invalidation-based vs. Update-based Snoopy vs. directory Software: compiler assisted cache coherence
12
©RG:E0243:L2- Parallel Architecture 12 Example: Snoopy Bus Protocols Assumption: shared bus interconnect where all cache controllers monitor all bus activity Called snooping There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Corrective action could involve updating or invalidating a cache block
13
©RG:E0243:L2- Parallel Architecture 13 Snoopy Invalidate Protocol I/O devices Memory P 1 $$ $ P 2 P 3 4 u = ? u :5 1 u 2 u 3 u = 7
14
©RG:E0243:L2- Parallel Architecture 14 Invalidate vs Update Basic question of program behavior: Is a block written by one processor later read by others before it is overwritten? Invalidate readers will take a miss multiple writes without additional traffic clears out copies that are not used again Update avoids misses on later references multiple useless updates
15
©RG:E0243:L2- Parallel Architecture 15 MSI Invalidation Protocol Cache Block States I: Invalid S: Shared (one or more cache copies) M: Modified or Dirty (only copy) Encoded in 2 bits and updated by protocol Processor Events: PrRd (read) PrWr (write) Bus Transactions BusRd: asks for copy with no intent to modify BusRdX: asks for copy with intent to modify Flush: write back (updates main memory)
16
©RG:E0243:L2- Parallel Architecture 16 MSI: State Transition Diagram M I PrRd/BusRd PrWr/BusRdX PrRd/-PrWr/- BusRdX/Flush BusRd/Flush BusRdX/— PrRd/— BusRd /— PrWr/BusRdX S
17
©RG:E0243:L2- Parallel Architecture 17 MESI (4-state) Invalidation Protocol Problem with MSI protocol Reading and modifying data is 2 bus xactions, even if no one is sharing BusRd (I->S) followed by BusRdX or BusUpgr (S->M) Add exclusive state: write locally without xaction, Memory is up to date, so cache not necessarily owner States invalid exclusive or exclusive-clean (only this cache has copy, but not modified) shared (two or more caches may have copies) modified (dirty )
18
©RG:E0243:L2- Parallel Architecture 18 MESI - State Transition Diagram BusRd/Flush BusRdX/Flush PrW r/BusRdX PrWr PrRd/— BusRd/Flush E M I S PrWr/-- PrRd PrRd/ BusRd(S) BusRdX/Flush BusRd/ Flush PrW r/BusRdX PrRd/ BusRd (S )
19
©RG:E0243:L2- Parallel Architecture 19 Scalability Issues of Snoopy Protocol Snoopy cache ideally suited for bus-based IN. Shared bus IN saturates performance for large no. of procs. (beyond 8 procs.) For non-bus-based IN, coherence messages can be broadcast – expensive Only a few procs. may have a copy of the shared data. May be more efficient to maintain a directory of caches that have a copy of the cache block.
20
©RG:E0243:L2- Parallel Architecture 20 Directory Based Coherence Memory (or Cache) maintains a list (directory) of procs. that have the copy of a block On write, memory controller sends Invalidate (or Update) signal only to procs. that have a copy Memory also knows the current owner (in case of Dirty blocks) memory controller requests owner for updated copy
21
©RG:E0243:L2- Parallel Architecture 21 Generic Solution: Directories P1 Cache Memory Scalable Interconnection Network Comm. Assist P1 Cache Comm Assist DirectoryMemory Directory Directory presence bitsdirty bit
22
©RG:E0243:L2- Parallel Architecture 22 Memory Consistency Model Memory consistency model Order in which memory operations will appear to execute What value can a read return? Contract between appln. software and system. Affects ease-of-programming and performance
23
©RG:E0243:L2- Parallel Architecture 23 Understanding Program Order: Example Initially A = B = 0; Process P1 Process P2 Process P3 A = 1; while (A==0); while (B==0); B = 1; Print A; What value of A will be printed by process P3? Role of Program order in ensuring P3 reads the value of A as 1.
24
©RG:E0243:L2- Parallel Architecture 24 Example 2 Software Implementation of Mutex: Process P1 Process P2 A = 0; B = 0;...... A = 1; B = 1; if (B = 0) if (A=0) critical section critical section Can both P1 and P2 enter the critical section? i.e., evaluate the “if” condition as true?
25
©RG:E0243:L2- Parallel Architecture 25 Sequential Consistency: Definition A system is sequentially consistent if Operations within a processor follow program order Operations of all processors were executed in some (interleaved) sequential order All processors see the same sequential order
26
©RG:E0243:L2- Parallel Architecture 26 Implicit Memory Model Sequential consistency (SC) [Lamport] Result of an execution appears as if Operations from different processors executed in some sequential (interleaved) order Memory operations of each process in program order MEMORY P1P3P2Pn
27
©RG:E0243:L2- Parallel Architecture 27 Sequential Consistency: Definition A system is sequentially consistent if Operations within a processor follow program order Operations of all processors were executed in some (interleaved) sequential order All processors see the same sequential order Initially A = B = 0; Process P1 Process P2 Process P3 A = 1; while (A==0); while (B==0); B = 1; Print A;
28
©RG:E0243:L2- Parallel Architecture 28 Under SC can P3 print A as 0? Initially A = B = 0; Process P1 Process P2 Process P3 (w1)A = 1; (r2) while (A==0); (r3) while (B==0); (w2) B = 1; (r3’) Print A; w1 r3’ w2 r2 r3
29
©RG:E0243:L2- Parallel Architecture 29 Sequential Consistency SC ensures all Memory orders: Write Read Write Write Read Read Read Write SC treats all Memory operations same way!
30
©RG:E0243:L2- Parallel Architecture 30 Sequential Consistency: Conditions Before a load is allowed to perform w.r.t any other processor, all previous load accesses must be globally performed and all previous store accesses must be performed Before a store is allowed to perform w.r.t. any other processor, all previous LOAD must be globally performed and all previous STORE must be performed. What this means is read read, read write, write read, and write write order are maintained!
31
©RG:E0243:L2- Parallel Architecture 31 Processor Consistency: Definition A system is Processor consistent if Writes issued by a processor must be in program order Read read, read write, and write write order But no write read order Operations of all processors were executed in some (interleaved) sequential order All processors need not see the same sequential order of writes from different processors Initially A = B = 0; Process P1 Process P2 Process P3 A = 1; while (A==0); while (B==0); B = 1; Print A; Process P1 Process P2 A = 0; B = 0;...... A = 1; B = 1; if (B = 0) if (A=0) critical sectioncritical section
32
©RG:E0243:L2- Parallel Architecture 32 Example 2 Process P1 Process P2 A = 0; B = 0;...... A = 1; B = 1; if (B = 0) if (A=0) critical sectioncritical section
33
©RG:E0243:L2- Parallel Architecture 33 Weak Consistency Distinguishes between ordinary memory operations and synchronization operations (e.g., lock acquire/release) A system is weak consistent if Before a load/store is allowed to perform, all previous synchronization accesses must be performed Before a synchronization operation is performed, all previous load/store must be performed Synchronization accesses are sequentially consistent.
34
©RG:E0243:L2- Parallel Architecture 34 Weak Consistency Weak ordering: Divide memory operations into data operations and synchronization operations Synchronization operations act like a fence: All data operations before synch in program order must complete before synch is executed All data operations after synch in program order must wait for synch to complete Synchs are performed in program order
35
©RG:E0243:L2- Parallel Architecture 35 Weak Consistency Weak ordering: Implementation of fence: processor has counter that is incremented when data op is issued, and decremented when data op is completed Example: PowerPC has SYNC instruction
36
©RG:E0243:L2- Parallel Architecture 36 An Example: Load Store Load Store Load Store Load Store Sequential Consistency Processor Consistency
37
©RG:E0243:L2- Parallel Architecture 37 Example: Weak Consistency : Sync(Acq) Load/Store Sync(Rel) Sync(Acq) Sync(Rel) Load/Store No ordering among the loads/stores here!
38
©RG:E0243:L2- Parallel Architecture 38 Another model: Release Consistency Synchronization accesses are divided into Acquires: operations like lock Release: operations like unlock Semantics of acquire: Acquire must complete before all following memory accesses Semantics of release: all memory operations before release must complete but accesses after release in program order do not have to wait for release operations which follow release and which need to wait must be protected by an acquire
39
©RG:E0243:L2- Parallel Architecture 39 Release Consistency Further distinguishes between lock acquire and lock release synch. Operation. A system is release consistent if Before a load/store is allowed to perform, all previous acquire accesses must be performed Before a release synchronization operation is performed, all previous load/store must be performed Synchronization accesses are processor consistent.
40
©RG:E0243:L2- Parallel Architecture 40 Example: Release Consistency Sync(Acq) Load/Store Sync(Rel) Sync(Acq) Sync(Rel) Load/Store Weak Consistency Acquire Load/Store Release Acquire Release Load/Store Acquire treated as READ/LOAD Release treated as WRITE/STORE
41
©RG:E0243:L2- Parallel Architecture 41 Ordering in Consistency Models ModelRRRWRRRW WRWRWWWWSARSAWSARSAW R SAW SAR SAW SA SRRSRWSRRSRW R S R W S R SC PC R SAR SA SRWSRW WC RC WC : S A S A, S A S R, S R S R, S R S A RC : S A S A, S A S R, S R S R
42
Reading Material S. V. Adve, K. Gharachorloo, “Shared Memory Consistency Models: A Tutorial” WRL Research Report 95/7 http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbon, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors”’ ISCA 1991.
43
Term Project Steps [Step 0] Choose a team of 2 (you can identify your partner) [Step 1] Choose an area of interest [Step 2] Read some (recent) papers and make a hypothesis [Step 3] Check in literature to see if it has already been studied? If yes, go back to Step 1 [Step 4] Feasible to do the study in 2 months? If not go back to Step 1 [Step 5] Do some initial study (experimentation) [Step 6] Analyse, report, relief! :-)
44
Term Project Steps [Step 0] Choose a team of 2 (you can identify your partner) [Step 1] Choose an area of interest [Step 2] Read some (recent) papers and make a hypothesis [Step 3] Check in literature to see if it has already been studied? If yes, go back to Step 1 [Step 4] Feasible to do the study in 2 months? If not go back to Step 1 [Step 5] Do some initial study (experimentation) [Step 6] Analyse, report, relief! :-)
45
Term Project Expectations Non-trivial project Should have an element of surpise Be ambitious, but realistic too! Must learn/get something new! You can iterate your ideas with me during the next two weeks. Look beyond what we have discussed so far in the class! Try to choose some recent papers/topics!
46
Term Project Schedule Proposal Due by Sept. 20 First Review by Oct. 18 Report and Demo/Presentation : Nov. 29 Sept. 30 Oct. 29
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.