Massively Concurrent Red-Black Trees with Hardware Transactional Memory Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris.

Slides:

Advertisements

Similar presentations

TRAMP Workshop Some Challenges Facing Transactional Memory Craig Zilles and Lee Baugh University of Illinois at Urbana-Champaign.

Advertisements

Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.

Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto,

Hybrid Transactional Memory Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school)

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:

1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.

1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.

Relativistic Red Black Trees. Relativistic Programming Concurrent reading and writing improves performance and scalability – concurrent readers may disagree.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.

WG5: Applications & Performance Evaluation Pascal Felber

Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.

Darko Makreshanski Department of Computer Science ETH Zurich

CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.

Hardware and Software transactional memory and usages in MRE

Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.

Adaptive Software Lock Elision

Hathi: Durable Transactions for Memory using Flash

Lecture 20: Consistency Models, TM

Maurice Herlihy and J. Eliot B. Moss, ISCA '93

Irina Calciu Justin Gottschlich Tatiana Shpeisman Gilles Pokam

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

Multiway Search Trees Data may not fit into main memory

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Transactional Memory : Hardware Proposals Overview

Alex Kogan, Yossi Lev and Victor Luchangco

Combining HTM and RCU to Implement Highly Efficient Balanced Binary Search Trees Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas and Nectarios.

PHyTM: Persistent Hybrid Transactional Memory

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory By McKenney, Michael, Triplett and Walpole.

Atomic Operations in Hardware

Atomic Operations in Hardware

Faster Data Structures in Transactional Memory using Three Paths

Effective Data-Race Detection for the Kernel

Reactive Synchronization Algorithms for Multiprocessors

CO4301 – Advanced Games Development Week 10 Red-Black Trees continued

Martin Rinard Laboratory for Computer Science

Challenges in Concurrent Computing

Trees 4 The B-Tree Section 4.7

Anders Gidenstam Håkan Sundell Philippas Tsigas

Professor, No school name

Software Cache Coherent Control by Parallelizing Compiler

Concurrent Data Structures Concurrent Algorithms 2017

Lecture 2: Snooping-Based Coherence

Multi-Way Search Trees

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Lecture 6: Transactions

Lecture 21: Transactional Memory

Yiannis Nikolakopoulos

Lecture 22: Consistency Models, TM

Hybrid Transactional Memory

Software Transactional Memory Should Not be Obstruction-Free

CS510 - Portland State University

Locking Protocols & Software Transactional Memory

Multicore programming

Programming with Shared Memory Specifying parallelism

Lecture 23: Transactional Memory

Lecture 21: Transactional Memory

Lecture: Consistency Models, TM

Concurrent Cache-Oblivious B-trees Using Transactional Memory

Lecture: Transactional Memory

Dynamic Performance Tuning of Word-Based Software Transactional Memory

Controlled Interleaving for Transactions

Concurrency control (OCC and MVCC)

Presentation transcript:

Massively Concurrent Red-Black Trees with Hardware Transactional Memory Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris National Technical University of Athens (NTUA) School of Electrical and Computer Engineering (ECE) Computing Systems Laboratory (CSLab) {jimsiak,knikas,goumas,nkoziris}@cslab.ece.ntua.gr http://research.cslab.ece.ntua.gr PDP 2016

Motivation Hardware Transactional Memory (HTM) has become mainstream IBM Power8, Blue Gene/Q, zEC12 Intel Haswell, Broadwell … Available on highly multi-threaded machines (10s to 100s of hardware threads available) We need to evaluate it on real-life applications Red-Black trees Classic data structure Widely used for dictionary implementations Challenging to devise efficient concurrent implementations using locks or atomic primitives Their properties favor the usage of HTM Siakavaras et. al cslab@ntua

Our Contributions We have implemented concurrent red-black trees with HTM We have evaluated then with high number of threads Intel Haswell-EP: 56 hardware threads IBM Power8: 160 hardware threads We have found out that: Programming with HTM can be simple … … but to get performance one needs to put some extra effort Different challenges faced on each system Depending on each HTM’s resources Different optimizations applied Siakavaras et. al cslab@ntua

Transactional Memory Programming model alternative to locks Aims to simplify programming (complexity is moved to the TM system) Programming effort similar to coarse-grained locking … … with performance similar or even better than fine-grained locking More robust than locks Programming with TM Programmer annotates regions of code to be executed atomically (transactions) TM system guarantees atomic execution of transactions atomic: either all writes become visible (commit) or none of them (abort) TM system keeps track of read- and write- sets for each transaction and if a conflict is detected some transaction is aborted Thread 0 Thread 1 Thread 2 Thread 0 Thread 1 Thread 2 tx_begin() tx_begin() tx_begin() lock(L) lock(L) lock(L) acq. wr A rd B wr B rd A x conflict tx_begin() tx_end() tx_end() unlock(L) ... other code ... lock(global_lock); ... critical section ... unlock(global_lock); tx_begin(); tx_end(); Lock-based code TM code acq. tx_end() unlock(L) acq. TM is expected to perform better than locks when no coflicts are present worse than locks when conflicts arise unlock(L) Siakavaras et. al cslab@ntua

Hardware Transactional Memory Hardware implementation of TM (HTM) ISA extensions to support the transactional model xbegin: begins a transaction xend: ends a transaction xabort: explicitly aborts a transaction with some abort code Currently supported on various processors IBM Power8, Blue Gene/Q, zEC12 Intel Haswell, Broadwell … On the plus side eliminates the overheads of Software TM But Hardware limitations read-/write- sets are limited by the hardware buffers Best-effort implementations (a transaction may never commit) aborts due to read-/write- set buffers overflow, interrupts, cache line eviction … programmer needs to specify non-transactional fallback code Siakavaras et. al cslab@ntua

Red-Black Trees in a nutshell Definition A node is either red or black The root is always black All leaves are black Every red node must have two black children Every path from a given node to any of its descendant leaves contains the same number of black nodes. The above properties guarantee that the tree is almost balanced Siakavaras et. al cslab@ntua

Red-Black Trees in a nutshell Applications Dictionary ADT Collection of (key, value) pairs Supports three operations: Lookup(key) Insert(key, value) Delete(key) Siakavaras et. al cslab@ntua

Internal vs. External RBTs Both keys and values are stored in every node External Values are only stored in the leaves Internal nodes are only used for routing to the appropriate leaf Occupy more memory Simplify delete operation All our implementations are external trees Siakavaras et. al cslab@ntua

Bottom-Up vs. Top-Down RBTs Example: insert(30) Bottom-Up[1] Top-Down[2] 1 2 3 2 traversals Less tree modifications (only the necessary) Not convenient for fine-grained synchronization approaches (threads might traverse the tree in opposite directions) 1 traversal More tree modifications (some could be avoided) Convenient for fine-grained synchronization approaches 1 Root->leaf (top-down) traversal (read-only phase) 1 Root -> Leaf (top-down) traversal 2 Insert the new key 2 Perform necessary modifications on the way to the leaf 3 Leaf -> Root traversal (bottom-up) to fix red-black tree violations 3 When the new key is inserted no leaf->root traversal is necessary [1] T. H. Cormen, C. E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algorithms. The MIT Press, 3rd ed., 2009. [2] R. A. Tarjan, Efficient top-down updating of red-black trees, Tech. Rep. TR-006-85, Department of Computer Science, Princeton University, 1985. Siakavaras et. al cslab@ntua

A naïve HTM red-black tree: bu-cg-htm code: insert operation in bu-cg-htm Bottom-Up external RBT Each operation enclosed in a single HTM transaction Single global lock (SGL) fallback path. int aborts = MAX_TX_RETRIES; AGAIN: while (SGL is taken) ; int status = xbegin(); if (status == TX_BEGIN_SUCCESS) { if (SGL is taken) xabort(0xFF); rbt_insert(...); xend(); } else { if (--aborts > 0) goto AGAIN; acquire_lock(SGL); release_lock(SGL); } Begin a transaction. Return values: TX_BEGIN_SUCCESS: a transaction has just began. != TX_BEGIN_SUCCESS: a previously initialized transaction has aborted. Return value contains information about abort reason. Spin until the fallback lock is released Abort if lock is taken (another thread executes in non-transactional mode) Critical section executed in transactional mode Execute operation in transactional mode Commit transaction Retry in transactional mode Code executed on abort Execute operation in non-transactional mode (SGL fallback path) Siakavaras et. al cslab@ntua

Experimental Platforms Intel Haswell-EP IBM Power8 Processors 2 x Intel Xeon E5-2697 v3 2 x IBM Power8 # Cores 2 x 14 2 x 10 # Threads 56 (2-way SMT) 160 (8-way SMT) Core clock 2.6 GHz 3.7 GHz L1 (Data) 32KB, 8-way, 64B block size 64KB, 8-way, 128B block size L2 256KB, 8-way, 64B block size 512KB, 8-way, 128B block size L3 35MB, 20-way, 64B block size (shared per die) 80MB, 8-way, 128B block size Memory 64 GB (4 NUMA regions) 256 GB (4 NUMA regions) Siakavaras et. al cslab@ntua

Experimental Platforms: HTM characteristics Intel Haswell-EP IBM Power8 Versioning Lazy Progress guarantees Best effort Conflict detection Eager Conflict granularity Cache line Cache line size 64B 128B TX read-set (total / per HW thread) 4MB / 2MB 8KB / 1KB TX write-set 22KB / 11KB Haswell-EP can support larger transactions! Abort reasons Conflict Transactional conflict Non-transactional conflict Capacity Explicit Siakavaras et. al cslab@ntua

Experimental Setup Code HTM Evaluation Benchmarks Execution Written in C, using GCC intrinsics and inline assembly for HTM instructions GCC 4.9.2/4.9.1, -O3 optimization Nodes are padded and aligned to occupy exactly one cache line HTM Evaluation Tree sizes (3 configurations) Medium/Large trees: 2M/20M/100M keys Tree initialized to contain half keys in specified range (e.g. 1M keys in the 2M tree) Operations Workload (%lookups-%insertions-%deletions) Read-intensive: 80-10-10 Read-write: 50-25-25 Write-intensive: 20-40-40 Benchmarks Execution Benchmarks duration is 15 seconds Employ empty physical cores before SMT contexts 10 transactional retries before acquiring SGL Results are the average of 20 independent executions (box plots shown when necessary) Siakavaras et. al cslab@ntua

bu-cg-htm: Performance Haswell-EP Power8 Siakavaras et. al cslab@ntua

Optimizing for haswell-ep Siakavaras et. al cslab@ntua

Haswell-EP: bu-cg-htm throughput Siakavaras et. al cslab@ntua

Haswell-EP: tuning number of retries 2M keys (20-40-40) 10 retries 20 retries Tuning the number of transactional retries is vital to get stable and high performance As more cores are added in future processors this effect will probably become more intense 30 retries 50 retries Siakavaras et. al cslab@ntua

Haswell-EP: Performance Hyperthreading enabled bu-cg-htm scales on Haswell-EP for 2 reasons: Absence of conflicts (in most cases threads modify disjoint subtrees) HW resources (read-/write- set buffers) are large enough (no capacity aborts) for the resulting transactions Generally: transactions manage to commit without serializing on the global lock bu-cg-lock: bottom-up coarse-grained locking td-fg-lock: top-down fine-grained locking td-wf[1]: top-down wait-free (atomic operations, CAS) [1] A. Natarajan, L. H. Savoie, N. Mittal, Concurrent Wait-Free Red Black Trees, SSS 2013 Siakavaras et. al cslab@ntua

Optimizing for POWer8 Siakavaras et. al cslab@ntua

Power8: bu-cg-htm throughput Insertions/deletions are more expensive than lookups hyperthreading enabled performance collapse no hyperthreads very good scalability Siakavaras et. al cslab@ntua

Power8: bu-cg-htm aborts’ breakdown max 10 transactional retries per operation 2M keys (20-40-40) transactional conflicts: close to zero → non-conflicting operations on red-black tree non-transactional conflicts: due to global lock acquisition from other threads capacity: read-/write- set overflows explicit: due to global lock found taken > 20 threads → capacity aborts due to sharing of TM buffers between multiple SMT contexts Capacity aborts → SGL acquisitions → non-transactional conflict aborts Siakavaras et. al cslab@ntua

1st optimization: PCL Fallback SGL Fallback observation: When some thread acquires the global lock all concurrent transactions are aborted. Even those executed on other cores than the one acquiring the lock (which do not share hardware transactional buffers). Core 0 Core 19 TRANSACTIONAL BUFFERS . . . TRANSACTIONAL BUFFERS Thread 1 … 6 7 Thread 1 … 6 7 Idea! Before resorting to SGL try to execute transaction alone on its core i.e. first acquire a PCL (Per-Cpu Lock) that only aborts transactions on the same core x capacity . . . . . . x capacity . . . x capacity x x x acquire(SGL) x x x x Siakavaras et. al cslab@ntua

1st optimization: PCL Fallback Idea! Before resorting to SGL try to execute transaction alone on its core i.e. first acquire a PCL (Per-Cpu Lock) that only aborts siblings Core 0 Core 19 TRANSACTIONAL BUFFERS . . . TRANSACTIONAL BUFFERS Thread 1 … 6 7 Thread 1 … 6 7 x capacity . . . . . . x capacity . . . x capacity x x x acquire(PCL) tx_begin() ... Siakavaras et. al cslab@ntua

PCL Fallback: throughput solid lines: SGL fallback dotted lines: PCL fallback Siakavaras et. al cslab@ntua

PCL Fallback: aborts’ breakdown 2M keys SGL PCL Siakavaras et. al cslab@ntua

PCL Fallback: lock acquisitions Percentage of operations executed serially 20-40-40 50-25-25 80-10-10 Siakavaras et. al cslab@ntua

2nd optimization: Fine-grained transactions PCL leads to underutilization (1 thread per core executes) PCL does not solve the problem of capacity aborts Naively enclosing each RBT operation in an HTM transaction results in large transactions Review example: insert(30) Each transaction in bu-cg-htm encloses 2 phases: Lookup phase Rebalance phase Transactional size of each phase: Lookup: proportional to the average depth of the tree (typically 20-30 levels) Rebalance: ~97% times < 3 levels, ~75% 1 level Idea! Split lookup phase in multiple shorter transactions Siakavaras et. al cslab@ntua

2nd optimization: Fine-grained transactions Splitting lookup phase in multiple transactions is not straightforward What can go wrong? We need a way to inform threads about concurrent modifications. we add a version number on each node 8 Τ2: rotate_right(18) Τ1: lookup(10) 18 15 25 T1 is now on the wrong subtree and will not find key 10 13 16 10 Siakavaras et. al cslab@ntua

2nd optimization: Fine-grained transactions A version number is added on each node Version number is increased when node is modified Fine-grained transactions: Validate the version of current node If current node has changed, abort and restart operation from root Otherwise, move to next node, read its version and commit 8 Τ2: rotate_right(18) Τ1: lookup(10) version = 10 18 15 25 version = 11 When T1 tries to move to the next node, validation of node 18 will fail and T1 will restart from root 13 16 10 Siakavaras et. al cslab@ntua

2nd optimization: Fine-grained transactions Coarse-grained transactions 1 large transaction Fine-grained transactions Multiple short transactions Siakavaras et. al cslab@ntua

Fine-grained transactions: throughput 2nd optimization (fg-tm) 1st optimization (PCL) No optimization Siakavaras et. al cslab@ntua

Fine-grained transactions: aborts/operation #Threads bu-cg-htm bu-fg-htm 2M 20M 100M 20 0.02 0.23 0.27 0.001 0.002 0.007 40 6.2 9.7 9.9 0.003 60 9.8 0.004 0.01 80 0.009 0.008 0.012 120 0.24 0.17 0.11 160 1.15 1.07 0.85 bu-fg-htm manages to keep very low abort rate, independently of tree size Siakavaras et. al cslab@ntua

Conclusions & Future work Programming with HTM might be simple achieving high performance is not hardware limitations need to considered different HTM systems need different software optimizations Future Work Extend evaluation to more data structures / algorithms (e.g. graph algorithms) Evaluation on machines with more physical cores / hardware threads Siakavaras et. al cslab@ntua

Intel Corporation and IBM Hellas for kindly providing the two servers. THANK YOU! QUESTIONS? ACKNOWLEDGMENT Intel Corporation and IBM Hellas for kindly providing the two servers. I-PARTS project of Action ARISTEIA, co-financed by European Union (European Social Fund) and Hellenic national funds through the Operational Program Education and Lifelong Learning (NSRF 2007-2013). Siakavaras et. al cslab@ntua