The Single Node B-tree for Highly Concurrent Distributed Data Structures by Barbara Hohlt 11/23/2018
Why a B-tree DDS? To do range queries (the queries need NOT be degree-3 transaction protected) Need only sequential scans for related indexed items (retrieve mail messages 3-50, etc.) Performance impact illustrated later 11/23/2018
Prototype DDS: Distributed B-tree clients interact with any client client client client client client client client client client service “front - end” as all persistent service state is WAN in DDS and is consistent throughout entire cluster service DDS lib service DDS lib service DDS lib service interacts with DDS via library; library is 2PC coordinator, handles partitions, replication, etc., and exports B - SAN tree API “brick” is a durable single-node data structure (hashtable,btree)… “brick” is durable single - node B - tree plus RPC skels for storage storage storage storage storage storage network access; brick can be “brick” “brick” “brick” “brick” “brick” “brick” on same node as service storage storage storage storage storage storage “brick” “brick” “brick” example of a distributed B - tree “brick” “brick” “brick” partition with 3 replicas in group 11/23/2018
Architecture clients interact with any service “front-end” WAN Service SAN storage “brick” Service (Worker) DDS lib clients interact with any service “front-end” [all persistent state is in DDS and is consistent across cluster] “brick” is durable single-node B-Tree or HT plus RPC skeletons for network access example of a distributed DDS partition with 3 replicas in group Pull Event WAN service interacts with DDS via library [library is 2PC coordinator, handles partitioning, replication, etc., and exports B-Tree + HT API] 11/23/2018
Single-node hashtable or btree… 11/23/2018
asynchronous I/O Core: Component Layers Application Single-Node Btrees Buffer Cache asynchronous I/O Core: “sinks and sources” TCP network VIA file system storage raw disk Distributed Btrees The application layer makes “search” and “insert” requests to a btree instance. The btree determines what data blocks it needs and fetches them from the global buffer cache. If the cache does not have the needed blocks, it fetches them from the global I/O core, which is transparent to the btree instance. queued completions queued requests 11/23/2018
11/23/2018
API Flavor SN_BtreeCloseRequest, SN_BtreeClosecomplete SN_BtreeCreateRequest, Sn_BtreeCreateComplete SN_BtreeOpenRequest, SN_Btree OpenComplete Sn_BtreeDestroyRequest, SN_BtreeDestroyComplete SN_BtreeReadRequest, SN_BtreeReadComplete SN_BtreeWriteRequest, SN_BtreeWriteComplete SN_BtreeRemoveRequest, SN_BtreeRemoveComplete 11/23/2018
API Flavor, Contd.. Distributed_BtreeCreateRequest, Distributed_BtreeCreateComplete Distributed_BtreeDestroyRequest, Distributed_BtreeDestroyComplete Distributed_BtreeReadRequest, Distributed_BtreeReadComplete … Errors: timeout (even after retries), replica_dead, lockgrab_failed, doesn’t exist, etc. 11/23/2018
Evaluation Metrics Speedup: performance versus resources (data size fixed) Scaleup: data size versus resources (fixed performance) Sizeup: performance versus data size Throughput: total number of reads/writes completed per second Latency: for satisfying a single request 11/23/2018
Single Node B-tree Performance Btrees Megabits per second 11/23/2018
Single Node B-tree Performance 11/23/2018
FSM-based Data Scheduling Scheduling is for: Performance (including fairness, avoiding starvation) Correctness/isolation This functionality has traditionally resided in two different modules (kernel schedules threads, app/database schedules locks). Also, each module optimized individually Our claim is there can be significant performance wins by jointly optimizing both 11/23/2018
How to Achieve Isolation? Use threads and locks Do careful scheduling (e.g. B-trees) Unify all scheduling decisions Problem is: such a globally optimal scheduling is hard In restricted settings, similar to hardware scoreboarding techniques A useful lesson for Database Concurrency You can choose order of operations to avoid conflicts (have a prepare/prefetch phase) to avoid locking across blocking I/O (Lesson: Do not lock if you block) This can be implemented more naturally with asynchronous FSMs than with straight-line threaded code 11/23/2018
Benefits of Using FSMs+events for Concurrency Control Control-flow based concurrency control, as opposed to lock-based concurrency control Can avoid wrong scheduling decisions Unnecessary locks can be eliminated “Locks” can be released faster More flexibility for concurrency-control based on isolation requirements Explicit concurrency-control also avoids deadlocks, priority inversions, race conditions, and convoy formations b1 b2 11/23/2018 T2 T1
Benefits of using FSMs+Queues for concurrency control Control-flow based concurrency control using FSMs and queues, as opposed to lock-based concurrency control Can avoid wrong scheduling decisions Unnecessary locks can be eliminated “Locks” can be released faster More flexibility for concurrency-control based on isolation requirements Explicit scheduling also avoids deadlocks, priority inversions, race conditions, and convoy formations b1 b2 11/23/2018 T2 T1
The Convoy Problem Illustrated Most tasks execute code like: lock(b); read(b); lock(b->next); unlock(b); … Problem is: if task T1 blocks on I/O for b4, then task T2 cannot unlock b3 to acquire a lock on b4, and task T3 cannot unlock b2 to acquire a lock on b3, and so on, forming a convoy even though most blocks are in cache and each task may require only a finite number of locks. b1 b2 b3 b4 Locked and blocked on I/O by T1 Locked by T4 waiting for lock on b2 Locked by T3 waiting for lock on b3 Locked by T2 waiting for lock on b4 11/23/2018 Convoy
Scheduling Based on Data Availability Two transaction T1 and T2 request blocks b1, b2, and b1, b3 respectively and T1 acquires the lock on b1 first Problem is: if T1 acquires a lock on b2 and blocks, T2 cannot make progress, even though T2 can access both b1 and b3 Lesson: schedule depending on how data is available; not how requests enter the system b1 b2 b3 b3 ready Locked and blocked on I/O by T1 T2 blocked by T1 Locked by T1 time 11/23/2018
Scheduling Based on Data Availability (Example of Misordering) Transferring funds from checking to savings. Begin(transaction) 1: read (checking account) 2: read(savings_account) 3: read(teller) // in cache 4: read(bank) // in cache 5: update(savings_account) 6: update(checking_account) 7: update(teller) 8: update(bank) End (transaction) If steps 3 and 4 were swapped with 1 and 2, we would be blocking while holding locks on the bank and teller balances. In a global scheduling model ordering of reads does not matter because a request does not start execution unless all the required data in the most probable execution path is available. 11/23/2018
Distributed Synchronization P1 T2 P2 b2 T3 P3 T4 P4 Conventional lock-based implementations serialize the lock manager code. In the example above, T1 serializes against T3, although T1 and T3 should ideally execute concurrently. Distributed synchronization on distinct queues is possible in FSMs running on multiprocessors, without requiring static data partition 11/23/2018
Single Node Btree “Brick” completion queues btree “instance” requests completions Btree requests are queued in the global event queue. Request completions are queued in the individual btree completion queues. queues global event queue global buffer cache 11/23/2018
FSM for Non-blocking Fetch moving down moving right stop start key > highkey key <= highkey && not leaf && is leaf is leaf has descendents 11/23/2018
Splitting node a into nodes a’ and b’ (c) (d) f c b’ a’ a f’ 11/23/2018
A Single Node B-tree 11/23/2018 . . . Key: 48 <values> Key: 51 25 35 40 47 62 99 36 40 41 47 78 99 51 56 57 62 48 51 53 56 40 99 meta data 11/23/2018
P0 K0 K2k+1 P2k+1 leaf node blink . . . 11/23/2018