Presentation is loading. Please wait.

Presentation is loading. Please wait.

FASTER: A Concurrent Key-Value Store with In-Place Updates

Similar presentations


Presentation on theme: "FASTER: A Concurrent Key-Value Store with In-Place Updates"— Presentation transcript:

1 FASTER: A Concurrent Key-Value Store with In-Place Updates
4/4/ :08 AM FASTER: A Concurrent Key-Value Store with In-Place Updates Badrish Chandramouli, Guna Prasaad*, Donald Kossmann, Justin Levandoski, Mike Barnett, James Hunter *Intern from © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

2 Introduction Tremendous growth in data-intensive applications
4/4/ :08 AM Introduction Tremendous growth in data-intensive applications Tracking IoT devices, data center monitoring, streaming, online services, map-reduce, … State management is a hard problem State consists of independent objects – devices, users, ads Overall state is very large, doesn’t fit in memory Point operations sufficient Significant update traffic – update per-device avg CPU reading There has recently been a tremendous growth in data intensive applications in the cloud. For example, an app may track per-device or per-user information in an Internet of Things setting. It may perform data center telemetry monitoring, or execute streaming workloads or online services. State management is a hard problem across all these apps. The state usually consists of a large number of largely independent objects such as per-device average CPU readings. The overall quantity of state is usually very large, and does not fit in the memory of a single machine. Further, per-object operations are usually sufficient, and the workload is highly update intensive: we may need to frequently update the per-device reading. A common characteristic of such workloads is that of “temporal locality” – for instance, consider a search engine platform that tracks per-user stats such as counts of various keywords searched for in the last week. Even though billions of users are alive in the system, only a few million may be actively surfing and updating counts at any given instant. Further, the working set of active users shifts over time. Temporal Locality Property Search engine maintains per-user stats over last week Billions of users “alive” Only millions actively surfing at given instant of time © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

3 Requirements and Current Solutions
Support larger- than-memory data High throughput for working set Optimize for heavy updates Handle drifting hot working set Recoverable after failure Systems Requirements In memory structures (e.g., hash maps, Masstree) Persistent Key-Value Stores (e.g., RocksDB) Caching Systems (e.g., Redis) Support larger- than-memory data High throughput for working set Optimize for heavy updates Handle drifting hot working set Recoverable after failure

4 What is FASTER A latch-free concurrent multi-core hash key-value store
4/4/ :08 AM What is FASTER A latch-free concurrent multi-core hash key-value store Designed for high performance and scalability across threads Supports data larger than main memory + recovery Shapes the (dynamic) hot working set in memory Performance: up to 160 million ops/sec for YCSB variants Single machine, two sockets, 56 threads Exceeds throughput of pure in-memory systems when working set fits in memory FASTER Interface Read, Blind Update Atomic read-modify-write (RMW) - for running aggs (like sum), partial field updates, ... Implemented as embedded C# component using code-gen See paper for details Give example of count/sum © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5 System Architecture Key Technical Contributions
Threading: Epoch Protection Framework with Trigger Actions Indexing: Concurrent Hash Index Record Storage: “Hybrid Log” Record Allocator

6 Epoch Protection Basics
4/4/ :08 AM Epoch Protection Basics System Requirement avoid any coordination between threads in common case agree on mechanism to synchronize on shared system state Solution: epoch protection System maintains shared counter E (current epoch) - can be “bumped” by any thread Each thread keeps a (stale) local epoch counter copied from E An epoch c is “safe” if all thread-local epochs are greater than c Safe Epochs  1 2 3 4 Epochs incremented by threads when they need to initiate global coordination Thread 1 1 2 3 4 5 Thread 2 1 2 4 5 Thread 3 1 3 4 5 Thread 4 1 2 3 4 5 Increasing Time → Current Epoch  2 3 4 5 © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

7 Adding Trigger Actions
4/4/ :08 AM Adding Trigger Actions Basic Idea Associate a trigger (function callback) with epoch bump from c to c+1 Trigger action will be executed later, when c becomes safe Simplifies lazy synchronization in multi-threaded systems Example: invoke function F() when (shared) status becomes “active” Thread updates shared variable status = “active” Then, it bumps current epoch with trigger = “invoke function F()” BumpEpoch( () => F() ); Guaranteed that all threads have seen “active” status before F() is invoked FASTER uses epochs + triggers extensively (see paper) Threads agree to respect global system state at epoch refresh boundaries Garbage collection, non-blocking index resizing, log buffer maintenance, recovery © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

8 Hash Bucket (64-byte cache line) Record Log (logical addresses)
4/4/ :08 AM Hash Index in Brief Array of hash buckets Bucket entry points to linked list of colliding records All operations are latch-free See paper for details New latch free insert technique Latch free index resizing based on epochs with triggers Just say that its an array of entries pointing to records? Array of 𝟐 𝒌 buckets Hash Bucket (64-byte cache line) Record Log (logical addresses) © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9 FASTER Record Allocator
4/4/ :08 AM FASTER Record Allocator © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

10 Strawman: Log-Structured Record Allocator
4/4/ :08 AM Strawman: Log-Structured Record Allocator Create a single global logical address space Spans storage and main memory Tail pages are in-memory circular buffer Threads allocate records at tail with atomic fetch-and-add: temporal log Use epochs with triggers to ensure flush safety without pinning pages Not scalable Contention on tail of log Every update is copy-on-write Log growth can stress storage Describe naïve solution of pinning © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

11 Hybrid Log Allocator Divide memory into three regions
4/4/ :08 AM Hybrid Log Allocator Divide memory into three regions Stable (on disk)  Read-Copy-Update Mutable (in memory)  In-Place Update Read-only (in memory)  Read-Copy-Update Tail grows  offsets grow Basic Algorithm As tail grows, the offsets also move accordingly. Lifecycle of a record Logical Address Operation < Head Offset Issue async IO request < ReadOnly Offset Copy to tail, update hash table < Infinity Update in-place New Record Add to tail, update hash table © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

12 Lost Update Anomaly Threads read offsets at epoch boundaries
Don’t want to take lock on offsets Problem in multi-threaded setting! Update of ReadOnly Offset not seen by all threads Mutable for one thread == Read-only for another thread Example Thread 1 sees old ReadOnly Offset = R1, does in-place update Thread 2 sees new ReadOnly Offset = R2, does read-copy-update Lost update by thread 1!

13 Solution: Fuzzy Region
Region of memory whose mutability status is not agreed upon by all threads Safe ReadOnly Offset Tracks ReadOnly Offset seen by all threads Updated using epoch trigger action ReadOnlyOffset = K; BumpEpoch( () => { SafeReadOnlyOffset = K } ); Update to RMW algorithm Logical Address Operation < Head Offset Issue async IO request < Safe ReadOnly Offset Copy to tail, update hash table < ReadOnly Offset Go pending < Infinity Update in-place New Record Add to tail, update hash table

14 Other Details - See Paper
Natural caching behavior of hybrid log Captures temporal locality Sizing hybrid log regions of memory Mutable vs. read-only region sizes Recovery and consistency Temporal analytics on log Code generation and language integration Garbage collection and read-hot record handling

15 4/4/ :08 AM Evaluation © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

16 Setup and Workload Machine – Dell PowerEdge R730 server
2 socket, 14 cores per socket, 2 hyper-threads per core 256GB RAM, 3.2TB FusionIO NVMe SSD Modified YCSB-A workload 250 million distinct 8-byte keys, values of 8 and 100 bytes Varying fraction of reads, blind updates, read-modify-writes Baseline Systems In-memory structures: Intel TBB hash map, Masstree Key-value store: RocksDB Caching system: Redis

17 Throughput: Single and Multi Threaded
Single Threaded Multi Threaded

18 Scalability with # Threads
100% RMW; 8 byte payloads 100% blind updates; 100 byte payloads

19 Throughput; Increasing Memory Budget
27GB dataset

20 Conclusions FASTER is a high-performance concurrent multi-core hash key-value store Shows that a single design can “have it all” Handle larger-than-memory data with heavy updates Exploit temporal locality and drifting working set, in the workload Achieve “bare metal” throughput exceeding pure in-mem structures (up to 160 million ops/sec) when working set fits in memory Degrade gracefully when memory is limited Recoverable to a (checkpointed) consistent point after failure

21 4/4/ :08 AM Thank You © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.


Download ppt "FASTER: A Concurrent Key-Value Store with In-Place Updates"

Similar presentations


Ads by Google