CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Slides:



Advertisements
Similar presentations
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Advertisements

Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
The University of Adelaide, School of Computer Science
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:
12 – Shared Memory Synchronization. 2 Review Caches contain all information on state of cached memory blocks Snooping cache over shared medium for smaller.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Chapter 2 — Instructions: Language of the Computer — 1 Branching Far Away If branch target is too far to encode with 16-bit offset, assembler rewrites.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.
Review: Multiprocessor Systems (MIMD)
Threading Part 2 CS221 – 4/22/09. Where We Left Off Simple Threads Program: – Start a worker thread from the Main thread – Worker thread prints messages.
Multiprocessors and Thread-Level Parallelism Cont’d
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
CPE 631 Session 24: Multiprocessors (cont’d) Department of Electrical and Computer Engineering University of Alabama in Huntsville.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Kernel Locking Techniques by Robert Love presented by Scott Price.
Anshul Kumar, CSE IITD ECE729 : Advance Computer Architecture Lecture 26: Synchronization, Memory Consistency 25 th March, 2010.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.
CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Synchronization. How many ( x, y ) value pairs could result when these threads run in parallel and concurrently? x=0y=0 if (x==0)if (y==0) y=1x=1 [Graphic:
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
1 Critical Section Problem CIS 450 Winter 2003 Professor Jinhua Guo.
Where Testing Fails …. Problem Areas Stack Overflow Race Conditions Deadlock Timing Reentrancy.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
Lecture 5 Page 1 CS 111 Summer 2013 Bounded Buffers A higher level abstraction than shared domains or simple messages But not quite as high level as RPC.
Multiprocessors – Locks
Lecture 20: Consistency Models, TM
Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:
12 – Shared Memory Synchronization
Background on the need for Synchronization
Lecture 21 Synchronization
Lecture 19: Coherence and Synchronization
Lecture 5: Synchronization
CSC 4250 Computer Architectures
Atomic Operations in Hardware
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Atomic Operations in Hardware
Parallel Shared Memory
CS 213 Lecture 7: Multiprocessor 3: Synchronization, Prefetching
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CPE 631 Session 21: Multiprocessors (Part 2)
Designing Parallel Algorithms (Synchronization)
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
Professor David A. Patterson Computer Science 252 Spring 1998
Lecture 4: Synchronization
Kernel Synchronization II
12 – Shared Memory Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
CSE 153 Design of Operating Systems Winter 19
CS333 Intro to Operating Systems
Lecture: Coherence, Synchronization
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 21: Synchronization & Consistency
Lecture: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST

Consistency Model

Lock in Shared Memory Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock Lock acquire liR2,#1 lockit:lwR3,0(R1) ;load var bnezR3,lockit ;≠ 0  not free  spin swR2,0(R1) – Does it work? Lock release swR0,0(R1); R0 = 0

Why Need Atomic Load and Store thread 0thread 1 liR2,#1liR2,#1 lwR3,0(R1) bnezR3,lockit swR2,0(R1) Both threads can acquire the lock  why? Value should not change between load and store  need atomic load and store

Hardware Support For Locks Atomic exchange: interchange a value in a register for a value in memory 0  synchronization variable is free 1  synchronization variable is locked and unavailable – Set register to 1 & swap – New value in register determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if other processor had already claimed access – Key is that exchange operation is indivisible Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: it returns the value of a memory location and atomically increments it – 0  synchronization variable is free

Spin Lock Implementation Spin locks with atomic exchange liR2,#1 lockit:exchR2,0(R1) ;atomic exchange bnezR2,lockit ;already locked? What about MP with cache coherency? – Want to spin on cache copy to avoid full memory latency – Likely to get cache hits for such variables

Spin Lock Implementation Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic Solution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”): try:liR2,#1 lockit:lwR3,0(R1) ;load var bnezR3,lockit ;≠ 0  not free  spin exchR2,0(R1) ;atomic exchange bnezR2,try ;already locked?

Hardware Support For Locks Hard to have read & write in 1 instruction: use 2 instead Load linked (or load locked) + store conditional – Load linked returns the initial value – Store conditional returns 1 if it succeeds (no other store to same memory location since preceding load) and 0 otherwise Example doing atomic swap with LL & SC: try:movR3,R4 ; mov exchange value llR2,0(R1); load linked scR3,0(R1); store conditional beqzR3,try ; branch store fails (R3 = 0) movR4,R2 ; put load value in R4 Example doing fetch & increment with LL & SC: try:llR2,0(R1); load linked addiR2,R2,#1 ; increment (OK if reg–reg) scR2,0(R1) ; store conditional beqzR2,try ; branch store fails (R2 = 0)

Lock Implementation : LL & SC Using LL & SC to implement lock LL does not cause any bus traffic lockit:llR2,0(R1) ;load var bnezR2,lockit ;≠ 0  not free  spin dadduiR2,R0,#1 sc R2,0(R1) bnezR2,lockit

How to Implement Atomic Load-Store Atomic exchange (or atomic load-and-store) – Separate load and store internally by HW (one instruction visible to SW) – Load part invalidate other caches – Until store part is completed, any invalidation from other cache is held (if other processors need to write to the variable, make them wait) Load-locked / store-conditional – Remember the last load-locked address – Invalidation from other processors set load-locked address to 0 – Store-conditional fail if load-locked address is 0

Programming With Locks Writing good programs with locks is tricky Coarse-grained lock – One lock for large data structure shared by many processors – The entire data structure may not be used by all processors – Programming is simple, but performance will be bad (too much lock contention) Fine-grained lock – Many fine-grained locks for different parts of large data structure – Different parts may be updated by multiple processors simultaneously – Programming is difficult to maintain many locks Can HW remove the need for locks?

Programming with Locks Avoid data race condition in parallel programs – Multiple threads access a shared memory location with an undetermined accessing order and at least one access is write – Example: what if every thread executes total_count += local_count, when total_count is a global variable? (without proper synchronization) Writing highly parallel and correctly synchronized programs is difficult – Correct parallel program: no data race  shared data must be protected by locks Common problems with locking – Priority inversion: higher-priority process waits for a lower-priority process holding a lock – Lock convoying: occur with high contention on locks – Deadlock problem: get worse with many fine-grained locks Locking granularity issues

Coarse-Grain Locks Lock the entire data structure  correct but slow + Easy to guarantee the correctness: avoid any possible interference by multiple threads - Limit parallelism: only a single thread is allowed to access the data at a time Example struct acct_t accounts [MAX_ACCT] acquire (lock); if (accounts[id].balance >= amount) { accounts[id].balance -= amount; give_cash(); } release (lock)

Fine-Grain Locks Lock part of shared data structure  more parallel but difficult to program + Reduce locked portion by a processor at a time  fast - Difficult to make correct  easy to make mistakes - May require multiple locks for a task  deadlocks Example struct acct_t accounts [MAX_ACCT] acquire (accounts[id].lock); if (accounts[id].balance >= amount) { accounts[id].balance -= amount; give_cash(); } release (accounts[id].lock)

Difficulty of Fine-grain Locks May need multiple locks for a task – Example: account-to-account transfer  need two locks acquire (accounts[id_from].lock); acquire (accounts[id_to].lock); if (accounts[id_from].balance >= amount) { accounts[id_from].balance -= amount; accounts[id_to].balance += amount; } release (accounts[id_from].lock) release (accounts[id_to].lock) Deadlock : circular wait for shared resources – Thread 0 : id_from = 10, id_to = 20 – Thread 1 : id_from = 20, id_to = 10 Thread 0 Thread 1 acquire (accounts[10].lock)acquire (accounts[20].lock) // try acquire (accounts[20].lock// try acquire (accounts[10].lock) // waiting for accounts[20].lock// waiting for accounts[10.lock

Difficulty of Fine-grain Locks II Avoiding deadlock: acquire all locks in the same order Many more complex cases with locks – Lock-based programming is difficult  easy to make mistakes – May lead to deadlocks or performance issues – May cause race conditions, if locks are not programmed carefully id_first = min (id_from, id_to) id_second = max (id_from, id_to) acquire (accounts[id_first].lock); acquire (accounts[id_second].lock); if (accounts[id_from].balance >= amount) { accounts[id_from].balance -= amount; accounts[id_to].balance += amount; } release (accounts[id_second].lock) release (accounts[id_first].lock)

Lock Overhead with No Contention Lock variables do not contain real data  lock variables are used just to make program exuection correct – Consume extra memory (and cache space)  worse with fine-grain locks Acquiring locks is expensive – Require the use of slow atomic instructions (atomic swap, load- linked/store-conditional) – Require write permissions Efficient parallel programs must not have a lot of lock contention – Most of time, locks don’t do anything  one thread is accessing a shared location at a time – Still locks need to be acquired to protect a shared location (for example, 1% of total accesses)