Getting to the root of concurrent binary search tree performance

Slides:



Advertisements
Similar presentations
Memory.
Advertisements

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.
Chapter 3.2 : Virtual Memory
CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
Memory Allocation CS Introduction to Operating Systems.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.
1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.
Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER
Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.
Lecture 5 Cost Estimation and Data Access Methods.
1 Memory Management (b). 2 Paging  Logical address space of a process can be noncontiguous; process is allocated physical memory whenever the latter.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Section 10: Memory Allocation Topics
Memory Management Virtual Memory.
Non Contiguous Memory Allocation
COMP261 Lecture 23 B Trees.
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
CPS216: Data-intensive Computing Systems
CHP - 9 File Structures.
Multiway Search Trees Data may not fit into main memory
Alex Kogan, Yossi Lev and Victor Luchangco
Good data structure experiments are r.a.r.e.
Partitioned Memory Allocation
CS 332: Algorithms Hash Tables David Luebke /19/2018.
File System Structure How do I organize a disk into a file system?
Faster Data Structures in Transactional Memory using Three Paths
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Hash-Based Indexes Chapter 11
Subject Name: File Structures
Lecture 22 Binary Search Trees Chapter 10 of textbook
Hashing Exercises.
External Methods Chapter 15 (continued)
Main Memory Management
Good data structure experiments are r.a.r.e.
Dynamic Hashing.
Getting to the root of concurrent BSTs
Extendible Indexing Dina Said
CS Introduction to Operating Systems
Database Implementation Issues
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Database Management Systems (CS 564)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 8 11/24/2018.
External Memory Hashing
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 12/1/2018.
Main Memory Background Swapping Contiguous Allocation Paging
Virtual Memory Hardware
Lecture 3: Main Memory.
File Storage and Indexing
CACHE-CONSCIOUS INDEXES
Multicore programming
File Processing : Index and Hash
CS202 - Fundamental Structures of Computer Science II
Database Systems (資料庫系統)
File Storage and Indexing
Database Design and Programming
Contents Memory types & memory hierarchy Virtual memory (VM)
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 4/5/2019.
CSE 373: Data Structures and Algorithms
CSE 373: Data Structures and Algorithms
Chapter 11 Instructor: Xin Zhang
Path Oram An Extremely Simple Oblivious RAM Protocol
CSE 326: Data Structures Lecture #14
Database Implementation Issues
Virtual Memory 1 1.
Page Main Memory.
Presentation transcript:

Getting to the root of concurrent binary search tree performance Maya Arbel-Raviv, Technion Trevor Brown, IST Austria Adam Morrison, Tel Aviv U

Motivation Optimistic concurrent search trees are crucial in many applications (e.g., in-memory databases, internet routers and operating systems) We want to understand their performance We study BSTs because there are many variants, making comparisons easier

BST protected by a global lock Synthetic experiment: Insert 100,000 keys, then n threads perform searches

Hand-over-hand locking (HOH) Same experiment

Locking and cache coherence Evict this cache line for all threads! Evict this cache line for all threads! a m Algorithm L2 misses / search L3 misses / search Global lock 15.9 3.9 HOH 25.1 7.7

Achieving high performance NO locking while searching! Example: Unbalanced DGT BST Standard BST search No synchronization!

State of the art BSTs Not covered: lock-free balanced BSTs … Blocking Tree type Search overhead BCCO Y Internal* Read per-node DVY AA Internal Write per-search DGT External Blocking HJ Internal Read per-node RM Read per-search NM External EFRB * Technically, these are partially external, but when trees are prefilled with 100% insertion, as in the workloads we are considering, they are actually internal. Lock-free

How do they perform? Experiment: 100% searches with 64 threads Balanced Internal External BCCO1 BCCO2 HJ DVY RM AA NM1 NM2 DGT EFRB Read per-node Read per-search Write per-search Search overhead

Basic implementation issues Bloated nodes Why does node size matter? Larger nodes  fewer fit in cache  more cache misses Scattered fields Why does node layout matter? Searches may only access a few fields Scattered fields  more cache lines  more cache misses Incorrect usage of C volatile Missing volatiles  correctness issue Unnecessary volatiles  performance issue

impact of fixing these issues Balanced Internal External BCCO1 BCCO2 HJ DVY RM AA NM1 NM2 DGT EFRB Read per-node Read per-search Write per-search X Search overhead Bloated nodes Scattered fields Incorrect C volatile

How a fast allocator works Jemalloc: threads allocate from private arenas Each thread has an arena for each size class: 8, 16, 32, 48, 64, 80, 96, 112, 128, 192, 256, 320, 384, 448, 512, … Jemalloc per-thread allocation malloc(48) malloc(36) malloc(40) malloc(44) malloc(32) malloc(17) … malloc(33) 8 <---- 4096 byte page ----> 16 Arenas 32 32 ... 48 48 96 144 192 4048 64

Cache line crossings 8 16 32 48 64 These nodes cross cache lines! 16 32 48 16 32 32 64 96 48 48 96 144 Fixing bloated nodes can worsen performance! 64 64 128 192 Cache line Cache line Cache line Cache line Not a big deal if the tree fits in the cache If the tree does not fit in cache, double cache misses for half of your nodes!

Hashing a cache line to a bucket: cache sets Hashing a cache line to a bucket: physical address mod 4096 Cache is sort of like a hash table Maps addresses to buckets (4096 for us) Buckets can only contain up to c elements (64 for us) If you load an address, and it maps to a full bucket A cache line is evicted from that bucket “Mod 4096” is not a good hash function Patterns in allocation can lead to patterns in bucket occupancy

Cache set usage in HJ BST Insert creates a node and a descriptor (to facilitate lock-free helping) Node size class: 64 Descriptor size class: 64 Which cache sets will these nodes map to? Cache indexes used: 0, 2, 4, … (only even numbered indexes) Taken modulo 4096, these can only map to even numbered cache sets! Only half of the cache can be used to store nodes! 64 64 128 192 256 320 Cache line Cache line Cache line Cache line Cache line Cache line

Simple fix: random allocations Hypothesis: problem is the rigid even/odd allocation behaviour Idea: break the pattern with an occasional dummy 64 byte allocation Fixes the problem! Reduces unused cache sets to 1.6% Improved search performance by 41% … on our first experimental system, which was an AMD machine. However, on an Intel system, this did not improve search performance! 64 128 192 256 320 64 dummy 192 256 320

The difference between systems Intel processors prefetch more aggressively Adjacent line prefetcher: load one extra adjacent cache line Not always the next cache line (can be the previous one) Smallest unit of memory loaded is 128 bytes (two cache lines) This is also the unit of memory contention

Effect of prefetching on HJ BST The occasional dummy allocations break up the even/odd pattern But… Whenever search loads a node, it also loads the adjacent cache line This is a descriptor or a dummy allocation! This is useless for the search Only half of the cache is used for nodes 64 dummy 192 256 320 Fix: add padding to nodes or descriptors so they are in different size classes

Segregating memory for different object types 1. Previously described solution Add padding so objects have different size classes 2. A more principled solution Use multiple instances of jemalloc Each instance has its own arenas Allocate different object types from different jemalloc instances 3. An even better solution Use an allocator with support for segregating object types

Performance after all fixes Balanced Internal External BCCO1 BCCO2 HJ DVY RM AA NM1 NM2 DGT EFRB Read per-node Read per-search Write per-search Search overhead node size class 32 bytes node size class 48 bytes

Application benchmarks In-memory database DBx1000 [Yu et al., VLDB 2014] Yahoo! Cloud Serving Benchmarks TPC-C Database Benchmarks Used each BST as a database index Merges the memory spaces of DBx1000 and the BST! Creates similar memory layout issues (e.g., underutilized cache sets) And new ones (e.g., scattering of nodes across many pages) Even accidentally fixes memory layout issues in DBx1000 See paper for details

Recommendations When designing and testing a data structure Understand your memory layout! How are nodes laid out in cache lines? Pages? What types of objects are near one another? When adding a data structure to a program You are merging two memory spaces Understand your new memory layout! New tools needed? http://tbrown.pro Tutorials, code, benchmarks Companion talk: Good data structure experiments are R.A.R.E 24:30 first run --- cut 1/5th 20:30 second run (after cuts)