Slide 1 Kubiatowicz, Chaiken and Agarwal, "Closing the Window of Vulnerability in Multiphase Memory Transactions" MIT Computer Science Dept. CS258 Lecture.

Slides:



Advertisements
Similar presentations
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Advertisements

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Tutorial 3 - Linux Interrupt Handling -
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Multiple Processor Systems
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Kernel-Kernel Communication in a Shared- memory Multiprocessor Eliseu Chaves, et. al. May 1993 Presented by Tina Swenson May 27, 2010.
Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Processes 1 CS502 Spring 2006 Processes Week 2 – CS 502.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
CS533 - Concepts of Operating Systems 1 CS533 Concepts of Operating Systems Class 8 Synchronization on Multiprocessors.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
CPS110: Implementing threads/locks on a uni-processor Landon Cox.
PRASHANTHI NARAYAN NETTEM.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto OS-Related Hardware.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Introduction to Concurrency.
Distributed Database Systems Overview
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-8 Memory Management (2) Department of Computer Science and Software.
Lecture 11 Page 1 CS 111 Online Memory Management: Paging and Virtual Memory CS 111 On-Line MS Program Operating Systems Peter Reiher.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Transactions and Concurrency Control. Concurrent Accesses to an Object Multiple threads Atomic operations Thread communication Fairness.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
MULTIVIE W Slide 1 (of 21) Software Transactional Memory Should Not Be Obstruction Free Paper: Robert Ennals Presenter: Emerson Murphy-Hill.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Translation Lookaside Buffer
Outline Paging Swapping and demand paging Virtual memory.
G.Anuradha Reference: William Stallings
CSC 4250 Computer Architectures
Multiprocessor Cache Coherency
Lecture 5: GPU Compute Architecture
Chapter 8: Main Memory.
CMSC 611: Advanced Computer Architecture
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Lecture 5: GPU Compute Architecture for the last time
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Lecture 14: Reducing Cache Misses
Lecture 5: Snooping Protocol Design Issues
Lecture 8: Directory-Based Cache Coherence
Translation Lookaside Buffer
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
/ Computer Architecture and Design
Lecture 25: Multiprocessors
CS333 Intro to Operating Systems
Lecture 24: Multiprocessors
Lecture 23: Transactional Memory
Lecture 18: Coherence and Synchronization
Lecture 10: Directory-Based Examples II
Presentation transcript:

Slide 1 Kubiatowicz, Chaiken and Agarwal, "Closing the Window of Vulnerability in Multiphase Memory Transactions" MIT Computer Science Dept. CS258 Lecture by: Dan Bonachea a.k.a. "Kubi's baby"

Slide 2 Outline Intro & Scope –What architectural features create a WOV Window of Vulnerability - what is it? –Multiphase memory access –Potential for livelocks with WOV –Empirical measurements of severity Deadlocks that can arise Good & Bad Solutions for Closing the Window Alewife implementation & Conclusions

Slide 3 Scope Hardware cache-coherent distributed shared- memory multiprocessors, with: - multiphase shared memory transactions (request/reply) »long delays for accessing remote memory - polling-based completion (CPU retries until success) »as opposed to a signaling-based approach -and one or more of: »hardware context-switching, possibly with context- switch disable capabilities »high-availability interrupts (HAI) »prefetching or weak ordering Key property: hardware might not immediately consume the reply to its shared memory transaction and commit the load/store instruction

Slide 4 Anatomy of a Multiphase Memory Access If response data is lost during the WOV due to invalidation or cache conflict, requestor cannot make forward progress

Slide 5 Architectural Features that lead to WOV problem Prefetching or Weak ordering –allow processor to have multiple outstanding memory transactions (from same or different context) –some of the data addresses may conflict in the cache –with unified caches, response data may even conflict with instruction that initiated the transaction Hardware context-switching –Hardware keeps several threads ready to run and quickly switches between them when one stalls –Often also have a mechanism to disable context switching (to support fast atomic operations & critical sections) High-availability interrupts –any time we interrupt a load/store in progress to process network messages –used to implement software-assisted cache coherence, optimistic network deadlock recovery, etc. –has essentially the same effect as hardware context-switching

Slide 6 Livelocks that can occur with WOV Invalidation thrashing –external protocol invalidation during the WOV Intercontext thrashing –different local contexts with outstanding data transactions that conflict in cache High Availability Interrupt thrashing –cache conflicts during interrupt handler replaces a data response Instruction-Data thrashing –response data conflicts with the initiating instruction in the cache

Slide 7 Empirical measurements of WOV Alewife simulator: 64 processors, 4 contexts per processor, 1.5M cycles of a numerical integration app.

Slide 8 Broken Solution #1: Simple Locking One simple idea for closing the WOV: –Add a "lock" bit to the cache line that delays invalidation and prevents conflict replacement on response data (set on arrival, clear on access) –Also need a bit to save the fact that an external invalidate is pending for the cache line –Also need a "transaction-in-progress" cache line state to prevent new transactions during request phase that would conflict in the cache Not a perfect solution –Different context accessing same data could touch & unlock the line (fixable by adding more state) –Otherwise, fixes the WOV livelock problems, but….

Slide 9 Deadlocks Caused by Simple Locking D=Data, I=Instruction, P=Primary, S=Secondary 1,2 = node #, A,B,C,D = context # X and Y variables conflict in cache, Z does not Waits-for dependency arcs: Congruence –cache conflicts Protocol –external read req on data locked for write Execution –program order on instruction completion Disable –context switching has been disabled

Slide 10 Solution #1: Associative Locking Basic Idea: –Add a small, fully associative transaction buffer –Include address, state bits and space for data –Perform all locking on the transaction buffer entries »Defer invalidates on locked data (need address associativity to handle invalidates) »Optimization: merge references to same data from diff. contexts to reduce number of messages Avoids conflicts due to limited cache assoc., which leads to some deadlocks –Removes all the "congruence" dependency arcs –Also solves all the livelock scenarios Still can deadlock if we allow context-switch disable

Slide 11 Solution #2: Thrashwait Observation: –locking is pessimistic: locks data to prevent vulnerability during WOV, thereby ensuring progress (prevention) –optimistic option: allow vulnerability, but detect livelock/thrashing when it happens and take steps to correct it (detection and recovery) Basic idea: –dynamically detect when data got lost during WOV »tried-once bit on context says we attempted an access »transaction-in-progress state says transaction is complete, but data is missing –when we detect a loss, retry access and spin-wait for result (with context-switching disabled) »without HAI, this ensures WOV is length zero Can still livelock in the presence of HAI

Slide 12 Broken Solution #2: Associative Thrashwait Want to fix livelock problems of thrashwait in the presence of HAI One possibility is to add associativity –add a transaction buffer similar to in associative locking This is only a partial solution –Removes problems caused by cache conflicts –Prevents 3 of the 4 livelock scenarios »those involving cache conflicts –Still have invalidation thrashing »doesn't prevent external invalidations on the data while HAI is running »so WOV is still open during recovery and we can still livelock

Slide 13 Solution #3: Associative Thrashlock Hybrid approach - combines benefits of: –Thrashwait, Associativity and Locking Idea: –Augment Associative Thrashwait partial solution with a lock that defers all invalidations (one lock bit per CPU) »lock is turned on while spin-waiting in thrashing recovery »can run HAI handlers without danger of an invalidation –This solves the final livelock in Associative Thrashwait –Need a discipline for HAI handler code to prevent introducing new dependencies due to invalidation deferrment »handlers can't reference global memory »must always return to interrupted context

Slide 14 Alewife Implementation Hardware: –Distributed shared-memory cache-coherent multiprocessor –33 MHz SPARC-like CPU's –4 hardware contexts with register windows Uses Associative Thrashlock to close WOV Hardware Reqts: –16 transaction buffers –8 tried-once bits and 2 lock bits Provides: –HAI, context-switch w/disable, non-binding prefetch –2 simul. transactions/context –Access merging btw. contexts

Slide 15 Conclusions Window of Vulnerability is a problem for systems which have: –polling-based cache-coherent distributed shared- memory –and one or more of: »Multiple hardware contexts, possibly with context- switch disable »High-availability interrupts »Prefetching/weak ordering Paper presents 3 solutions: –(correct choice based on architectural features)

Slide 16 Extra Slides

Slide 17 High-Availability Interrupts

Slide 18 Internode Thrashing Detail

Slide 19 Technique Tables