Caltech CS184 Spring2003 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
The University of Adelaide, School of Computer Science
Transaction Management: Concurrency Control CS634 Class 17, Apr 7, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Review: Multiprocessor Systems (MIMD)
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Snooping Cache and Shared-Memory Multiprocessors
ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 9, 2003 Distributed Shared Memory.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
Cache Coherence CS433 Spring 2001 Laxmikant Kale.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Multiprocessors – Locks
COSC6385 Advanced Computer Architecture
Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:
תרגול מס' 5: MESI Protocol
Lecture 19: Coherence and Synchronization
Atomic Operations in Hardware
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Designing Parallel Algorithms (Synchronization)
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
CS333 Intro to Operating Systems
Lecture 24: Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 21: Synchronization & Consistency
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory

Caltech CS184 Spring DeHon 2 Today Shared Memory –Model –Bus-based Snooping –Cache Coherence Synchronization –Primitives –Algorithms –Performance

Caltech CS184 Spring DeHon 3 Shared Memory Model Same model as multithreaded uniprocessor –Single, shared, global address space –Multiple threads (PCs) –Run in same address space –Communicate through memory Memory appear identical between threads Hidden from users (looks like memory op)

Caltech CS184 Spring DeHon 4 Synchronization For correctness have to worry about synchronization –Otherwise non-deterministic behavior –Threads run asynchronously –Without additional/synchronization discipline Cannot say anything about relative timing

Caltech CS184 Spring DeHon 5 Models Conceptual model: –Processor per thread –Single shared memory Programming Model: –Sequential language –Thread Package –Synchronization primitives Architecture Model: Multithreaded uniprocessor

Caltech CS184 Spring DeHon 6 Conceptual Model Memory

Caltech CS184 Spring DeHon 7 Architecture Model Implications Coherent view of memory –Any processor reading at time X will see same value –All writes eventually effect memory Until overwritten –Writes to memory seen in same order by all processors Sequentially Consistent Memory View

Caltech CS184 Spring DeHon 8 Sequential Consistency Memory must reflect some valid sequential interleaving of the threads

Caltech CS184 Spring DeHon 9 Sequential Consistency P1: A = 0 A = 1 L1: if (B==0) P2: B = 0 B = 1 L2: if (A==0) Can both conditionals be true?

Caltech CS184 Spring DeHon 10 Sequential Consistency P1: A = 0 A = 1 L1: if (B==0) P2: B = 0 B = 1 L2: if (A==0) Both can be false

Caltech CS184 Spring DeHon 11 Sequential Consistency P1: A = 0 A = 1 L1: if (B==0) P2: B = 0 B = 1 L2: if (A==0) If enter L1, then A must be 1  not enter L2

Caltech CS184 Spring DeHon 12 Sequential Consistency P1: A = 0 A = 1 L1: if (B==0) P2: B = 0 B = 1 L2: if (A==0) If enter L2, then B must be 1  not enter L1

Caltech CS184 Spring DeHon 13 Coherence Alone Coherent view of memory –Any processor reading at time X will see same value –All writes eventually effect memory Until overwritten –Writes to memory seen in same order by all processors Coherence alone does not guarantee sequential consistency

Caltech CS184 Spring DeHon 14 Sequential Consistency P1: A = 0 A = 1 L1: if (B==0) P2: B = 0 B = 1 L2: if (A==0) If not force visible changes of variable, (assignments of A, B), could end up inside both.

Caltech CS184 Spring DeHon 15 Consistency Deals with when written value must be seen by readers Coherence – w/ respect to same memory location Consistency – w/ respect to other memory locations …there are less strict consistency models…

Caltech CS184 Spring DeHon 16 Implementation

Caltech CS184 Spring DeHon 17 Naïve What’s wrong with naïve model? Memory

Caltech CS184 Spring DeHon 18 What’s Wrong? Memory bandwidth –1 instruction reference per instruction –0.3 memory references per instruction –333ps cycle –N*5 Gwords/s ? Interconnect Memory access latency

Caltech CS184 Spring DeHon 19 Optimizing How do we improve?

Caltech CS184 Spring DeHon 20 Naïve Caching What happens when add caches to processors? Memory P$P$P$P$

Caltech CS184 Spring DeHon 21 Naïve Caching Cached answers may be stale Shadow the correct value

Caltech CS184 Spring DeHon 22 How have both? Keep caching –Reduces main memory bandwidth –Reduces access latency Satisfy Model

Caltech CS184 Spring DeHon 23 Cache Coherence Make sure everyone sees same values Avoid having stale values in caches At end of write, all cached values should be the same

Caltech CS184 Spring DeHon 24 Idea Make sure everyone sees the new value Broadcast new value to everyone who needs it –Use bus in shared-bus system Memory P$P$P$P$

Caltech CS184 Spring DeHon 25 Effects Memory traffic is now just: –Cache misses –All writes

Caltech CS184 Spring DeHon 26 Additional Structure? Only necessary to write/broadcast a value if someone else has it cached Can write locally if know sole owner –Reduces main memory traffic –Reduces write latency

Caltech CS184 Spring DeHon 27 Idea Track usage in cache state “Snoop” on shared bus to detect changes in state Memory P$P$P$P$ RD 0300… Someone Has copy…

Caltech CS184 Spring DeHon 28 Cache State Data in cache can be in one of several states –Not cached (not present) –Exclusive (not shared) Safe to write to –Shared Must share writes with others Update state with each memory op

Caltech CS184 Spring DeHon 29 Cache Protocol [Culler/Singh/Gupta 5.13] RdX = Read Exclusive Perform Write by: Reading exclusive Writing locally

Caltech CS184 Spring DeHon 30 Snoopy Cache Organization [Culler/Singh/Gupta 6.4]

Caltech CS184 Spring DeHon 31 Cache States Extra bits in cache –Like valid, dirty

Caltech CS184 Spring DeHon 32 Misses #s are cache line size [Culler/Singh/Gupta 5.23]

Caltech CS184 Spring DeHon 33 Misses [Culler/Singh/Gupta 5.27]

Caltech CS184 Spring DeHon 34 Synchronization

Caltech CS184 Spring DeHon 35 Problem If correctness requires an ordering between threads, –have to enforce it Was not a problem we had in the single- thread case –does occur in the multiple threads on single processor case

Caltech CS184 Spring DeHon 36 Desired Guarantees Precedence –barrier synchronization Everything before barrier completes before anything after begins –producer-consumer Consumer reads value produced by producer Atomic Operation Set Mutual exclusion

Caltech CS184 Spring DeHon 37 Read/Write Locks? Try implement lock with r/w: if (~A.lock) A.lock=true do stuff A.lock=false

Caltech CS184 Spring DeHon 38 Problem with R/W locks? Consider context switch between test (~A.lock=true?) and assignment (A.lock=true) if (~A.lock) A.lock=true do stuff A.lock=false

Caltech CS184 Spring DeHon 39 Primitive Need Need Indivisible primitive to enabled atomic operations

Caltech CS184 Spring DeHon 40 Original Examples Test-and-set –combine test of A.lock and set into single atomic operation –once have lock can guarantee mutual exclusion at higher level Read-Modify-Write –atomic read…write sequence Exchange

Caltech CS184 Spring DeHon 41 Examples (cont.) Exchange –Exchange true with A.lock –if value retrieved was false this process got the lock –if value retrieved was true already locked (didn’t change value) keep trying –key is, only single exchanger get the false value

Caltech CS184 Spring DeHon 42 Implementing... What required to implement? –Uniprocessor –Bus-based

Caltech CS184 Spring DeHon 43 Implement: Uniprocessor Prevent Interrupt/context switch Primitives use single address –so page fault at beginning –then ok, to computation (defer faults…) SMT?

Caltech CS184 Spring DeHon 44 Implement: Snoop Bus Need to reserve for Write –write-through hold the bus between read and write Guarantee no operation can intervene –write-back need exclusive read and way to defer other writes until written

Caltech CS184 Spring DeHon 45 Performance Concerns? Locking resources reduce parallelism Bus (network) traffic Processor utilization Latency of operation

Caltech CS184 Spring DeHon 46 Basic Synch. Components Acquisition Waiting Release

Caltech CS184 Spring DeHon 47 Possible Problems Spin wait generates considerable memory traffic Release traffic Bottleneck on resources Invalidation –can’t cache locally… Fairness

Caltech CS184 Spring DeHon 48 Test-and-Set Try: t&s R1, A.lock bnz R1, Try return Simple algorithm generate considerable traffic p contenders –p try first, 1 wins –for o(1) time p-1 spin –…then p-2… –c*(p+p-1+p-2,,,) –O(p 2 )

Caltech CS184 Spring DeHon 49 Backoff Instead of immediately retrying –wait some time before retry –reduces contention –may increase latency (what if I’m only contender and is about to be released?)

Caltech CS184 Spring DeHon 50 Primitive Bus Performance [Culler/Singh/Gupta 5.29]

Caltech CS184 Spring DeHon 51 Bad Effects Performance Decreases with users –From growing traffic already noted

Caltech CS184 Spring DeHon 52 Test-test-and-Set Try: ld R1, A.lock bnz R1, Try t&s R1, A.lock bnz R1, Try return Read can be to local cache Not generate bus traffic Generates less contention traffic

Caltech CS184 Spring DeHon 53 Detecting atomicity sufficient Fine to detect if operation will appear atomic Pair of instructions –ll -- load locked load value and mark in cache as locked –sc -- store conditional stores value iff no intervening write to address e.g. cache-line never invalidated by write

Caltech CS184 Spring DeHon 54 LL/SC operation Try: LL R1 A.lock BNZ R1, Try SC R2, A.lock BEQZ Try return from lock

Caltech CS184 Spring DeHon 55 LL/SC Pair doesn’t really lock value Just detects if result would appear that way Ok to have arbitrary interleaving between LL and SC Ok to have capacity eviction between LL and SC –will just fail and retry

Caltech CS184 Spring DeHon 56 LL/SC and MP Traffic Address can be cached Spin on LL not generate global traffic (everyone have their own copy) After write (e.g. unlock) –everyone miss -- O(p) message traffic No need to lock down bus during operation

Caltech CS184 Spring DeHon 57 Performance Bus [Culler/Singh/Gupta 5.30] [talk about array+ticket later]

Caltech CS184 Spring DeHon 58 Big Ideas Simple Model –Preserve model –While optimizing implementation Exploit Locality –Reduce bandwidth and latency

Caltech CS184 Spring DeHon 59 Big Ideas Simple primitives –Must have primitives to support atomic operations –don’t have to implement atomicly just detect non-atomicity Make fast case common –optimize for locality –minimize contention