Lecture 12: TM, Consistency Models

Slides:

Advertisements

Similar presentations

1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,

Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

1 Lecture 18: Transactional Memories II Papers: LogTM: Log-Based Transactional Memory, HPCA’06, Wisconsin LogTM-SE: Decoupling Hardware Transactional Memory.

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

1 Lecture 10: TM Pathologies Topics: scalable lazy implementation, paper on TM performance pathologies.

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

1 Lecture 8: Eager Transactional Memory Topics: implementation details of eager TM, various TM pathologies.

1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

1 Lecture 24: Transactional Memory Topics: transactional memory implementations.

Lecture 13: Consistency Models

1 Lecture 6: TM – Eager Implementations Topics: Eager conflict detection (LogTM), TM pathologies.

Computer Architecture II 1 Computer architecture II Lecture 9.

1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

1 Lecture 12: Relaxed Consistency Models Topics: sequential consistency recap, relaxing various SC constraints, performance comparison.

1 Lecture 10: TM Implementations Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation.

1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Memory Consistency Zhonghai Lu Outline Introduction What is a memory consistency model? Who should care? Memory consistency models Strict.

1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.

1 Lecture 25: Multiprocessors Today’s topics:  Synchronization  Consistency  Shared memory vs message-passing  Simultaneous multi-threading (SMT)

1 Lecture 10: Transactional Memory Topics: lazy and eager TM implementations, TM pathologies.

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Multiprocessors – Locks

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 20: Consistency Models, TM

Memory Consistency Models

Lecture 11: Consistency Models

Memory Consistency Models

Lecture 18: Coherence and Synchronization

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 26: Multiprocessors

Lecture 2: Snooping-Based Coherence

Lecture 11: Transactional Memory

Shared Memory Consistency Models: A Tutorial

Lecture 5: Snooping Protocol Design Issues

Lecture: Consistency Models, TM

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 17: Transactional Memories I

Lecture 21: Transactional Memory

Lecture 21: Synchronization and Consistency

Lecture 22: Consistency Models, TM

Lecture: Consistency Models, TM

Lecture 25: Multiprocessors

Lecture 10: Consistency Models

Lecture 26: Multiprocessors

Performance Pathologies in Hardware Transactional Memory

Performance Pathologies in Hardware Transactional Memory

Lecture 24: Multiprocessors

Lecture 23: Transactional Memory

Lecture 21: Transactional Memory

Lecture 21: Synchronization & Consistency

Lecture: Coherence and Synchronization

Lecture: Consistency Models, TM

Lecture 19: Coherence and Synchronization

Lecture 11: Relaxed Consistency Models

Lecture: Transactional Memory

Lecture 18: Coherence and Synchronization

Lecture 11: Consistency Models

Presentation transcript:

Lecture 12: TM, Consistency Models Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Paper on TM Pathologies (ISCA’08) LL: lazy versioning, lazy conflict detection, committing transaction wins conflicts EL: lazy versioning, eager conflict detection, requester succeeds and others abort EE: eager versioning, eager conflict detection, requester stalls

Pathology 1: Friendly Fire VM: any CD: eager CR: requester wins Two conflicting transactions that keep aborting each other Can do exponential back-off to handle livelock Fixable by doing requester stalls? Also fixable by doing requester wins only if the requester is older

Pathology 2: Starving Writer VM: any CD: eager CR: requester stalls A writer has to wait for the reader to finish – but if more readers keep showing up, the writer is starved (note that the directory allows new readers to proceed by just adding them to the list of sharers) Fixable by forcing the directory to override requester-stalls on a starvation alarm

Pathology 3: Serialized Commit VM: lazy CD: lazy CR: any If there’s a single commit token, transaction commit is serialized There are ways to alleviate this problem (discussed in the last class)

Pathology 4: Futile Stall VM: any CD: eager CR: requester stalls A transaction is stalling on another transaction that ultimately aborts and takes a while to reinstate old values -- no good workaround

Pathology 5: Starving Elder VM: lazy CD: lazy CR: committer wins Small successful transactions can keep aborting a large transaction The large transaction can eventually grab the token and not release it until after it commits

Pathology 6: Restart Convoy VM: lazy CD: lazy CR: committer wins A number of similar (conflicting) transactions execute together – one wins, the others all abort – shortly, these transactions all return and repeat the process Use exponential back-off

Pathology 7: Dueling Upgrades VM: eager CD: eager CR: requester stalls If two transactions both read the same object and then both decide to write it, a deadlock is created Exacerbated by the Futile Stall pathology Solution?

Four Extensions Predictor: predict if the read will soon be followed by a write and acquire write permissions aggressively Hybrid: if a transaction believes it is a Starving Writer, it can force other readers to abort; for everything else, use requester stalls Timestamp: In the EL case, requester wins only if it is the older transaction (handles Friendly Fire pathology) Backoff: in the LL case, aborting transactions invoke exponential back-off to prevent convoy formation

Coherence Vs. Consistency Recall that coherence guarantees (i) that a write will eventually be seen by other processors, and (ii) write serialization (all processors see writes to the same location in the same order) The consistency model defines the ordering of writes and reads to different memory locations – the hardware guarantees a certain consistency model and the programmer attempts to write correct programs with those assumptions

Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section critical section P1 P2 P3 A = 1 if (A == 1) B = 1 if (B == 1) register = A P1 P2 Data = 2000 while (Head == 0) Head = 1 { } … = Data

Sequential Consistency P1 P2 Instr-a Instr-A Instr-b Instr-B Instr-c Instr-C Instr-d Instr-D … … We assume: Within a program, program order is preserved Each instruction executes atomically Instructions from different threads can be interleaved arbitrarily Valid executions: abAcBCDdeE… or ABCDEFabGc… or abcAdBe… or aAbBcCdDeE… or …..

Sequential Consistency Programmers assume SC; makes it much easier to reason about program behavior Hardware innovations can disrupt the SC model For example, if we assume write buffers, or out-of-order execution, or if we drop ACKS in the coherence protocol, the previous programs yield unexpected outputs

Consistency Example - I Consider a multiprocessor with bus-based snooping cache coherence and a write buffer between CPU and cache Initially A = B = 0 P1 P2 A  1 B  1 … … if (B == 0) if (A == 0) Crit.Section Crit.Section The programmer expected the above code to implement a lock – because of write buffering, both processors can enter the critical section The consistency model lets the programmer know what assumptions they can make about the hardware’s reordering capabilities

Consistency Example - 2 P1 P2 Data = 2000 while (Head == 0) { } Head = 1 … = Data Sequential consistency requires program order -- the write to Data has to complete before the write to Head can begin -- the read of Head has to complete before the read of Data can begin

Consistency Example - 3 Initially, A = B = 0 P1 P2 P3 A = 1 if (A == 1) B = 1 if (B == 1) register = A Sequential consistency can be had if a process makes sure that everyone has seen an update before that value is read – else, write atomicity is violated

Sequential Consistency A multiprocessor is sequentially consistent if the result of the execution is achieveable by maintaining program order within a processor and interleaving accesses by different processors in an arbitrary fashion The multiprocessors in the previous examples are not sequentially consistent Can implement sequential consistency by requiring the following: program order, write serialization, everyone has seen an update before a value is read – very intuitive for the programmer, but extremely slow

HW Performance Optimizations Program order is a major constraint – the following try to get around this constraint without violating seq. consistency if a write has been stalled, prefetch the block in exclusive state to reduce traffic when the write happens allow out-of-order reads with the facility to rollback if the ROB detects a violation (detected by re-executing the read later)

Relaxed Consistency Models (HW/SW) We want an intuitive programming model (such as sequential consistency) and we want high performance We care about data races and re-ordering constraints for some parts of the program and not for others – hence, we will relax some of the constraints for sequential consistency for most of the program, but enforce them for specific portions of the code Fence instructions are special instructions that require all previous memory accesses to complete before proceeding (sequential consistency)

Fences P1 P2 { { Region of code Region of code { { Region of code Region of code with no races with no races } } Fence Fence Acquire_lock Acquire_lock { { Racy code Racy code } } Fence Fence Release_lock Release_lock

Potential Relaxations Program Order: (all refer to different memory locations) Write to Read program order Write to Write program order Read to Read and Read to Write program orders Write Atomicity: (refers to same memory location) Read others’ write early Write Atomicity and Program Order: Read own write early

Relaxations Relaxation W  R Order W  W Order R RW Order Rd others’ Wr early Rd own Wr early IBM 370 X TSO PC SC IBM 370: a read can complete before an earlier write to a different address, but a read cannot return the value of a write unless all processors have seen the write SPARC V8 Total Store Ordering (TSO): a read can complete before an earlier write to a different address, but a read cannot return the value of a write by another processor unless all processors have seen the write (it returns the value of own write before others see it) Processor Consistency (PC): a read can complete before an earlier write (by any processor to any memory location) has been made visible to all

Performance Comparison Taken from Gharachorloo, Gupta, Hennessy, ASPLOS’91 Studies three benchmark programs and three different architectures: MP3D: 3-D particle simulator LU: LU-decomposition for dense matrices PTHOR: logic simulator LFC: aggressive; lockup-free caches, write buffer with bypassing RDBYP: only write buffer with bypassing BASIC: no write buffer, no lockup-free caches

Performance Comparison

Summary Sequential Consistency restricts performance (even more when memory and network latencies increase relative to processor speeds) Relaxed memory models relax different combinations of the five constraints for SC Most commercial systems are not sequentially consistent and rely on the programmer to insert appropriate fence instructions to provide the illusion of SC

Title Bullet