Faster Data Structures in Transactional Memory using Three Paths

Slides:

Advertisements

Similar presentations

Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.

Advertisements

Time-based Transactional Memory with Scalable Time Bases Torvald Riegel, Christof Fetzer, Pascal Felber Presented By: Michael Gendelman.

Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.

Threads Cannot be Implemented As a Library Andrew Hobbs.

Transaction Management: Concurrency Control CS634 Class 17, Apr 7, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Handling Deadlocks n definition, wait-for graphs n fundamental causes of deadlocks n resource allocation graphs and conditions for deadlock existence n.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Rich Transactions on Reasonable Hardware J. Eliot B. Moss Univ. of Massachusetts,

Pessimistic Software Lock-Elision Nir Shavit (Joint work with Yehuda Afek Alexander Matveev)

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Hybrid Transactional Memory Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school)

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.

CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.

Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.

Software Transactional Memory for Dynamic-Sized Data Structures Maurice Herlihy, Victor Luchangco, Mark Moir, William Scherer Presented by: Gokul Soundararajan.

Concurrency, Mutual Exclusion and Synchronization.

Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.

1cs Intersection of Concurrent Accesses A fundamental property of Web sites: Concurrent accesses by multiple users Concurrent accesses intersect.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

DBMS Implementation Chapter 6.4 V3.0 Napier University Dr Gordon Russell.

WG5: Applications & Performance Evaluation Pascal Felber

Kernel Locking Techniques by Robert Love presented by Scott Price.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

XA Transactions.

DOUBLE INSTANCE LOCKING A concurrency pattern with Lock-Free read operations Pedro Ramalhete Andreia Correia November 2013.

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

Range Queries in Non-blocking k-ary Search Trees Trevor Brown Hillel Avni.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.

MULTIVIE W Slide 1 (of 21) Software Transactional Memory Should Not Be Obstruction Free Paper: Robert Ennals Presenter: Emerson Murphy-Hill.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

On Transactional Memory, Spinlocks and Database Transactions Khai Q. Tran Spyros Blanas Jeffrey F. Naughton (University of Wisconsin Madison)

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Lecture 20: Consistency Models, TM

Failure-Atomic Slotted Paging for Persistent Memory

Irina Calciu Justin Gottschlich Tatiana Shpeisman Gilles Pokam

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

CS 540 Database Management Systems

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Alex Kogan, Yossi Lev and Victor Luchangco

PHyTM: Persistent Hybrid Transactional Memory

Memory Consistency Models

Atomic Operations in Hardware

Atomic Operations in Hardware

Memory Consistency Models

Challenges in Concurrent Computing

Designing Parallel Algorithms (Synchronization)

Changing thread semantics

Lecture 22: Consistency Models, TM

Does Hardware Transactional Memory Change Everything?

Database Security Transactions

Multicore programming

Hybrid Transactional Memory

Atomic Commit and Concurrency Control

Software Transactional Memory Should Not be Obstruction-Free

A Concurrent Lock-Free Priority Queue for Multi-Thread Systems

Getting to the root of concurrent binary search tree performance

CS510 - Portland State University

Multicore programming

Lecture 1: Introduction

CS333 Intro to Operating Systems

Problems with Locks Andrew Whitaker CSE451.

Process/Thread Synchronization (Part 2)

Nikola Grcevski Testarossa JIT Compiler IBM Toronto Lab

Presentation transcript:

Faster Data Structures in Transactional Memory using Three Paths Trevor Brown University of Toronto

Problem Handcrafted data structures are fast and important for libraries Hardware transactional memory (HTM) can be used to make them even faster How can we get the most benefit from HTM? And, hardware transactional memory can be used to make them even faster, since they often have significant synchronization overhead that can be eliminated using HTM.

Hardware transactional memory (HTM) Can perform arbitrary blocks of code atomically Each transaction either commits or aborts HTM implementations are best effort Transactions can abort for any reason Need a non-transactional fallback path We’ve seen a lot of talks about hardware transactional memory, so I’ll just remind you that HTM implementations are best effort, which means that transactions can abort for any reason. For example, they can abort if they are too large. And, to guarantee progress, a programmer has to provide a non-transactional fallback path.

Transactional lock elision (TLE) Parallelizes code protected by a global lock Fast: Begin txn If lock is held, Abort txn Sequential code Commit txn On abort: Wait until lock is free Try again (up to K times) Fallback: Acquire lock Release lock One common use of HTM is to parallelize code protected by a global lock. This is called transactional lock elision. In TLE, the fallback path simply acquires a lock, performs sequential code, and releases the lock. On the fast path, we begin a transaction, check if the lock is held, and, if so, we abort to avoid interfering with a process on the fallback path. Otherwise, we perform the sequential code and commit. If we experience an abort, we wait until the lock is free, and then try our transaction again, up to K times, before moving to the fallback path.

Performance results for a binary search tree 100% updates, key range [0, 65536) Problem: poor performance if operations frequently run on Fallback 98% updates, 2% range queries Here are some results from a simple microbenchmark on a binary search tree from a new 72-thread Intel machine with HTM. Here, up is good, and with 50% insertions and 50% deletions in a tree containing approximately 32,000 keys, we can see that TLE performs well, and scales almost linearly. The problem with TLE is that is performs very poorly if operations frequently run on the fallback path. For example, if we take just 2% of the updates and replace them with range queries, which are large, and frequently have to run on the fallback path, the performance of TLE is crippled.

Concurrency between 2 paths Fallback using fine-grained synchronization Fast manipulates synchronization meta-data used by Fallback Processes on Fast and Fallback can run concurrently Problem: Fallback imposes overhead on Fast We can improve performance when operations frequently run on the fallback path by replacing the global lock-based fallback path with one that uses fine-grained synchronization, and modifying the fast path so that it manipulates the synchronization meta-data used by the fallback path. This allows processes on Fast and Fallback to run concurrently. The problem is that the fallback path imposes overhead on the fast path.

New approach: add a Middle path To eliminate this overhead, I propose adding an HTM-based middle path.

Paths and concurrency Fast path: sequential code in a transaction Fallback path: fine-grained synchronization Middle path: transactions manipulate synchronization meta-data used by Fallback Fallback does not impose overhead on Fast Concurrency Fast Middle Fallback Y N More specifically, the fast path consists of sequential code run in a big transaction, the fallback path uses fine-grained synchronization, and, on the middle path, transactions manipulate the synchronization meta-data used by the fallback path. This allows the middle path to run concurrently with the fallback path, and it can also run concurrently with the fast path, since both use HTM. However, the fast path cannot run concurrently with the fallback path, and this is very important, and desirable, because this means that processes on the fast path do not need to coordinate with processes on the fallback path. Consequently, fallback does not impose any overhead on fast, irrespective of how it operates.

Performance results for a binary search tree 100% updates, key range [0, 8192) 98% updates, 2% RQ, key range [0, 65536) Let’s see how this idea performs. With 50% insertions and 50% deletions in a tree containing approximately 4,000 keys, TLE and my new 3 path algorithm perform the best, and almost perfectly overlap. For the 2 path algorithm whose fast and fallback paths can run concurrently, the overhead imposed on the fast path by the fallback path is quite visible. For reference, I also included the lock-free algorithm that is used as the fallback path, and the global lock. With 98% updates and 2% range queries in a tree containing approximately 32,000 keys, the performance of TLE plummets, but the 3 path algorithm remains at the top. The 3 path algorithm also outperforms the 2 path algorithm up to 66 threads, because some of its range queries complete on the fast path. At 72 threads and beyond, range queries stop succeeding in hardware due to contention, and the 3 path algorithm behaves identically to the 2 path algorithm. These graphs show that, when few operations run on the fallback path, the 3 path algorithm has extremely low overhead, and when many operations run on the fallback path, it obtains a large benefit from concurrency between the fallback path and the middle path. In other words, we get the best of both worlds.

Conclusion We proposed adding a transactional middle path, which can run concurrently with the other two paths Middle-Fallback concurrency provides high performance when many operations run on Fallback Since there is no Fast-Fallback concurrency, Fallback does not impose any overhead on Fast