Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory Qi Zhu CSE 340, Spring 2008 University of Connecticut Paper.

Slides:

Advertisements

Similar presentations

Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.

Advertisements

1 Lecture 18: Transactional Memories II Papers: LogTM: Log-Based Transactional Memory, HPCA’06, Wisconsin LogTM-SE: Decoupling Hardware Transactional Memory.

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Transactional Memory Supporting Large Transactions Anvesh Komuravelli Abe Othman Kanat Tangwongsan Hardware-based.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.

Memory Management (II)

1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

1 Lecture 24: Transactional Memory Topics: transactional memory implementations.

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.

1 Lecture 9: TM Implementations Topics: wrap-up of “lazy” implementation (TCC), eager implementation (LogTM)

1 Lecture 7: Lazy & Eager Transactional Memory Topics: details of “lazy” TM, scalable lazy TM, implementation details of eager TM.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Vir. Mem II CSE 471 Aut 011 Synonyms v.p. x, process A v.p. y, process B v.p # index Map to same physical page Map to synonyms in the cache To avoid synonyms,

Multiprocessor Cache Coherency

File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.

An Integrated Hardware-Software Approach to Transactional Memory Sean Lie Theory of Parallel Systems Monday December 8 th, 2003.

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.

July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Lecture 9: Memory Hierarchy Virtual Memory Kai Bu

8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.

1 Some Real Problem  What if a program needs more memory than the machine has? —even if individual programs fit in memory, how can we run multiple programs?

Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.

Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.

Cache Memory Chapter 17 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.

Translation Lookaside Buffer

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Non Contiguous Memory Allocation

Lecture: Large Caches, Virtual Memory

Lecture 12 Virtual Memory.

Virtualizing Transactional Memory

Transactional Memory : Hardware Proposals Overview

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

A Study on Snoop-Based Cache Coherence Protocols

Multiprocessor Cache Coherency

Cache Memory Presentation I

CSE 153 Design of Operating Systems Winter 2018

Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.

Lecture: Cache Innovations, Virtual Memory

Virtual Memory Hardware

Lecture 22: Consistency Models, TM

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

Virtual Memory Overcoming main memory size limitation

Lecture: Cache Hierarchies

Lecture 7: Flexible Address Translation

Cache coherence CEG 4131 Computer Architecture III

CSE 153 Design of Operating Systems Winter 2019

Lecture 23: Transactional Memory

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Synonyms v.p. x, process A v.p # index Map to same physical page

Lecture 10: Directory-Based Examples II

Chapter 8 & 9 Main Memory and Virtual Memory

Presentation transcript:

Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory Qi Zhu CSE 340, Spring 2008 University of Connecticut Paper Source: ISCA’07, San Diego, CA

Motivation Hardware transactional memory has great potential to simplify the creation of correct and efficient multithreaded programs. Unfortunately, supporting the concurrent execution of an unbounded number of unbounded transactions is challenging. Many proposed implementations are very complex.

History In 1993, Herlihy and Moss proposed an elegant and efficient hardware implementation. Unfortunately, this implementation limits the volume of data and does not allow transactions to survive a context switch.

Recent Development Recently, several proposals have emerged to provide unbounded hardware transactions and support overflowed transactions with the same concurrency as non-overflowed transactions. However, these systems do not have simple implementations.

Methodology Proposed In order to simplify the design, the author explores a different approach. The two key concepts the author proposed are: 1. Permissions-only Cache 2. OneTM: OneTM Serialized and OneTM Concurrent

An example execution on three systems

Making the Fast Case Common Permissions-Only Cache The permissions-only cache allows a transaction to access more data than can be buffered on-chip without transitioning to a higher-overhead overflowed execution mode. Only when the permissions-only cache itself overflows does the system need to fall back on some other mechanism for detecting conflicts for overflowed blocks.

Operation The permissions-only cache is: (1)read by external coherence requests as part of conflict detection (2)updated when a transaction block is replaced from the data cache (3)invalidated on a commit or abort (4)read on transactional store misses

Efficient Encoding Because the permissions-only cache does not contain data, it can more efficiently encode the transactional read/write bits. By using sector cache techniques, the tag overhead can be reduced dramatically. For example, with good page-level spatial locality, a 4KB permissions-only cache allows a transaction to access up to 1MB of data without overflow.

Discussion By efficiently tracking transactions’ read and write sets, the permissions-only cache increase the size of transactions that can successfully complete without invoking an overflowed execution model. With a sufficiently large permissions-only cache, the occurrence of overflowed transactions will likely to be rare.

Background on Unbounded Hardware TM To provide basis for later comparison, we first take a look at three hardware- based unbounded transactional memory proposals that precisely detect conflicts at the cache block granularity. 1. UTM 2. VTM 3. PTM

UTM UTM( Unbounded Transactional Memory) was the first transactional memory proposal to support unbounded transactions. UTM maintains its transactional state in a single, shared, memory-resident data structure called the xstate.

VTM VTM( Virtualizing Transactional Memory) tracks overflowed transactional state using a shared data structure mapped into the virtual address space (called the XADT). Entries in the XADT are allocated when blocks overflow the cache. Much like xstate, XADT also uses linked list and supports accessing all entries for a specific virtual memory block or all entries for a specific transaction.

PTM PTM( Unbounded Page-Based Transaction Memory) supports unbounded transactional memory by associating state with physical addresses at the block granularity but allocating/reallocating this shadow state on a per-page basis. PTM’s shadow pages behave similarly to UTM’s log pointers, except PTM’s Transaction Access Vector(TAV) lists track data for an entire page.

Discussion These three proposals require the hardware to dynamically allocate/deallocate, maintain, and concurrently manipulate complex linked-based structures(UTM’s xstate, VTM’s XADT, and PTM’s Transaction Access Vectors) and the corresponding cached version of these structures.

Discussion Manipulating and accessing these structures can add overhead to both overflowed transactions and concurrently-executing non-overflowed transactions. More importantly, the hardware for correctly manipulating these structure is not simple.

Making the Uncommon Case Simple The author propose OneTM, a transactional memory system in which only a single overflowed transaction per process can be active at a time. The principal advantage of this design is that the implementation is relatively simple. The impact of the concurrency restrictions on overall system throughput is small because the permissions- only cache ensures that overflow will be the uncommon case.

OneTM OneTM-Serialized Serialized implementation simply stalls all other threads in an application when one of the threads needs to execute an overflowed transactions. At the same time, overflowed transactions still support an explicit abort operation because overflowed transactions continue to log.

OneTM Serialized STSW: Shared (per-process) transaction status word PTSW: Private (per-thread) transaction status word

Description of transaction status word (a) Shared Transaction Status Word (STSW) (b) Private Transaction Status Word (PTSW) STSW FieldDescription Overflowed?Is there a current overflowed transaction? OTIDID of current overflowed transaction PTSW FieldDescription Overflowed?Is this thread in an overflowed transaction? TNDNesting depth of current transaction No-user-abortDisable logging, allows IO Located at a fixed address in virtual memory known to all threads Per-thread architected register, saved/restored on context switches

Discussion Although this implementation is simple, the price of its simplicity is the loss of all concurrency when a transaction overflows, which will have a significant negative impact on performance.

OneTM-Concurrent In order to allow other code to execute concurrently with a single overflowed transaction, we introduce per-block persistent metadata. The system uses this metadata to track the read and write set of the single overflowed transaction; other threads then check the metadata to detect conflicts.

Metadata Operation When a transaction overflows, it transitions to overflowed execution mode. A simple way of accomplish this is to abort the transaction and restart it in overflowed mode after ensuring that no other thread in the application is already executing in overflowed mode.

Metadata Operation Alternatively, OneTM-Concurrent can avoid an abort by more gracefully transition to overflowed mode. As before, the processor must first ensure that no other thread in the application is executing in overflowed mode. Next, the processor walks both the data cache and the permissions-only cache to set the overflowed metadata for blocks read or written by transactions; this action ensures that the conflict detection information for these blocks is not lost if the overflowed transaction is context switched.

Lazy Metadata Clearing OTID: Overflowed Transaction Identifier OTID is used to differentiate between stale and current metadata. The OTID of the active overflowed is also stored in the STSW( see table before). When a transaction transitions to overflowed mode, it increments the OTID in the STSW. Instead of explicitly clearing the metadata bits when it completes, the overflowed transaction simply clears the overflowed bit in the STSW as before.

Lazily Coherent Metadata To prevent out-of-order writeback from overwriting more recent metadata with stale metadata, the system allows only the owner of the block( non-exclusive or exclusive ) to set the metadata. Once the metadata has been written, it is the owner’s responsibility to ensure the data is eventually written back to memory( or transfer the ownership, and thus the responsibility, on to another processor).

Lazily Coherent Metadata The key to the correctness of this lazy updating of metadata is that the system guarantees that any new requests for the block receive the most recent version of the metadata. Once an overflowed transaction has set the read bit( and thus has the block in owned state), any other processor that tries to write the block will issue a cache request and receive the most recent version of the metadata, indicating the conflict.

Example Execution ‘O’ state indicates a single non-exclusive read-only dirty owner

Experimental Evaluation ParameterValue ProcessorEight in-order x86 cores, 1 IPC L1 cache64KB, 4-way set associative, 64B blocks L1 miss latency10 cycles L2 cache4MB, 4-way set associative, 64B blocks L2 miss latency200 cycles simulated machine configuration

Experimental Evaluation ProgramInput barnesInput-2K cholesktk 14.O Ocean-non-contiguousn130 radixN r1024 raytraceTeapot.env volrendHead-scaleddown2 Water-spatialInput-512 Tree- N% scanning Benchmark summary

Performance

Scalability analysis using the tree-10% microbenchmark

Results Summary These results show that allowing only one overflowed transaction at a time can result in performance competitive with an ideally concurrent implementation. The addition of even a small permissions-only cache closes the gap for our benchmarks. For larger multiprocessors, the performance of OneTM-Concurrent suffers versus the ideal unbounded transactional memory.

Conclusion The permissions-only cache eases the memory-footprint restrictions imposed by the bounded execution mode of the processor. This structure permits bounded transactions to access potentially hundreds of megabytes before overflowing. The permissions-only cache can be introduced into hardware-only and hardware-software proposed systems to make the fast case common.

Conclusion We also proposed OneTM, an unbounded transactional memory system that assumes overflowed transaction will be rare to simplify the implementation. Specially, we limit the number of overflowed transaction that may exist in an application at a time to be one. By bounding concurrency among overflowed transactions, OneTM avoids the complexity of managing and traversing linked data structure in hardware, admitting simple conflict detection and transaction commit.

Questions?

Thank you ! I wish everyone have a great summer !