The Single Node B-tree for Highly Concurrent Distributed Data Structures by Barbara Hohlt 11/23/2018.

Slides:



Advertisements
Similar presentations
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Advertisements

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Review: Chapters 1 – Chapter 1: OS is a layer between user and hardware to make life easier for user and use hardware efficiently Control program.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Persistent State Service 1 Distributed Object Transactions  Transaction principles  Concurrency control  The two-phase commit protocol  Services for.
1 Transaction Management Database recovery Concurrency control.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
1 I/O Management in Representative Operating Systems.
PRASHANTHI NARAYAN NETTEM.
Transactions and concurrency control
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
PMIT-6102 Advanced Database Systems
Selecting and Implementing An Embedded Database System Presented by Jeff Webb March 2005 Article written by Michael Olson IEEE Software, 2000.
AN OPTIMISTIC CONCURRENCY CONTROL ALGORITHM FOR MOBILE AD-HOC NETWORK DATABASES Brendan Walker.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
R Harikumar, TCS Nityan Gulati, TCS
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
Session-8 Data Management for Decision Support
TRANSACTION MANAGEMENT R.SARAVANAKUAMR. S.NAVEEN..
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Overview of Operating Systems Introduction to Operating Systems: Module 0.
Transactions. Transaction: Informal Definition A transaction is a piece of code that accesses a shared database such that each transaction accesses shared.
Transaction Management Transparencies. ©Pearson Education 2009 Chapter 14 - Objectives Function and importance of transactions. Properties of transactions.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Wide Area Events Using DDS We have a model that can efficiently support a family of applications, Publish-Subscribe-Notify. To realize this model, we implemented.
CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.
CS 540 Database Management Systems
1 Channel Access Concepts – IHEP EPICS Training – K.F – Aug EPICS Channel Access Concepts Kazuro Furukawa, KEK (Bob Dalesio, LANL)
CS294, YelickDataStructs, p1 CS Distributed Data Structures
Practical Database Design and Tuning
Last Class: Introduction
Multithreading / Concurrency
Transactions.
Processes and threads.
Concurrency Control Techniques
Processes and Threads Processes and their scheduling
Alternative system models
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Faster Data Structures in Transactional Memory using Three Paths
Providing Secure Storage on the Internet
Changing thread semantics
Anthony D. Joseph and Ion Stoica
Practical Database Design and Tuning
Multiple Processor Systems
Process Description and Control
Fast Communication and User Level Parallelism
Threads and Concurrency
Threads Chapter 4.
Concurrency: Mutual Exclusion and Process Synchronization
Software Transactional Memory Should Not be Obstruction-Free
Prof. Leonardo Mostarda University of Camerino
Process Description and Control
Presented by: SHILPI AGARWAL
CS510 - Portland State University
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Evaluation of Relational Operations: Other Techniques
CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization
Transaction Management
CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization
Channel Access Concepts
CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization
Database System Architectures
CSE 153 Design of Operating Systems Winter 2019
CSE 542: Operating Systems
Evaluation of Relational Operations: Other Techniques
CSE 542: Operating Systems
Presentation transcript:

The Single Node B-tree for Highly Concurrent Distributed Data Structures by Barbara Hohlt 11/23/2018

Why a B-tree DDS? To do range queries (the queries need NOT be degree-3 transaction protected) Need only sequential scans for related indexed items (retrieve mail messages 3-50, etc.) Performance impact illustrated later 11/23/2018

Prototype DDS: Distributed B-tree clients interact with any client client client client client client client client client client service “front - end” as all persistent service state is WAN in DDS and is consistent throughout entire cluster service DDS lib service DDS lib service DDS lib service interacts with DDS via library; library is 2PC coordinator, handles partitions, replication, etc., and exports B - SAN tree API “brick” is a durable single-node data structure (hashtable,btree)… “brick” is durable single - node B - tree plus RPC skels for storage storage storage storage storage storage network access; brick can be “brick” “brick” “brick” “brick” “brick” “brick” on same node as service storage storage storage storage storage storage “brick” “brick” “brick” example of a distributed B - tree “brick” “brick” “brick” partition with 3 replicas in group 11/23/2018

Architecture clients interact with any service “front-end” WAN Service SAN storage “brick” Service (Worker) DDS lib clients interact with any service “front-end” [all persistent state is in DDS and is consistent across cluster] “brick” is durable single-node B-Tree or HT plus RPC skeletons for network access example of a distributed DDS partition with 3 replicas in group Pull Event WAN service interacts with DDS via library [library is 2PC coordinator, handles partitioning, replication, etc., and exports B-Tree + HT API] 11/23/2018

Single-node hashtable or btree… 11/23/2018

asynchronous I/O Core: Component Layers Application Single-Node Btrees Buffer Cache asynchronous I/O Core: “sinks and sources” TCP network VIA file system storage raw disk Distributed Btrees The application layer makes “search” and “insert” requests to a btree instance. The btree determines what data blocks it needs and fetches them from the global buffer cache. If the cache does not have the needed blocks, it fetches them from the global I/O core, which is transparent to the btree instance. queued completions queued requests 11/23/2018

11/23/2018

API Flavor SN_BtreeCloseRequest, SN_BtreeClosecomplete SN_BtreeCreateRequest, Sn_BtreeCreateComplete SN_BtreeOpenRequest, SN_Btree OpenComplete Sn_BtreeDestroyRequest, SN_BtreeDestroyComplete SN_BtreeReadRequest, SN_BtreeReadComplete SN_BtreeWriteRequest, SN_BtreeWriteComplete SN_BtreeRemoveRequest, SN_BtreeRemoveComplete 11/23/2018

API Flavor, Contd.. Distributed_BtreeCreateRequest, Distributed_BtreeCreateComplete Distributed_BtreeDestroyRequest, Distributed_BtreeDestroyComplete Distributed_BtreeReadRequest, Distributed_BtreeReadComplete … Errors: timeout (even after retries), replica_dead, lockgrab_failed, doesn’t exist, etc. 11/23/2018

Evaluation Metrics Speedup: performance versus resources (data size fixed) Scaleup: data size versus resources (fixed performance) Sizeup: performance versus data size Throughput: total number of reads/writes completed per second Latency: for satisfying a single request 11/23/2018

Single Node B-tree Performance Btrees Megabits per second 11/23/2018

Single Node B-tree Performance 11/23/2018

FSM-based Data Scheduling Scheduling is for: Performance (including fairness, avoiding starvation) Correctness/isolation This functionality has traditionally resided in two different modules (kernel schedules threads, app/database schedules locks). Also, each module optimized individually Our claim is there can be significant performance wins by jointly optimizing both 11/23/2018

How to Achieve Isolation? Use threads and locks Do careful scheduling (e.g. B-trees) Unify all scheduling decisions Problem is: such a globally optimal scheduling is hard In restricted settings, similar to hardware scoreboarding techniques A useful lesson for Database Concurrency You can choose order of operations to avoid conflicts (have a prepare/prefetch phase) to avoid locking across blocking I/O (Lesson: Do not lock if you block) This can be implemented more naturally with asynchronous FSMs than with straight-line threaded code 11/23/2018

Benefits of Using FSMs+events for Concurrency Control Control-flow based concurrency control, as opposed to lock-based concurrency control Can avoid wrong scheduling decisions Unnecessary locks can be eliminated “Locks” can be released faster More flexibility for concurrency-control based on isolation requirements Explicit concurrency-control also avoids deadlocks, priority inversions, race conditions, and convoy formations b1 b2 11/23/2018 T2 T1

Benefits of using FSMs+Queues for concurrency control Control-flow based concurrency control using FSMs and queues, as opposed to lock-based concurrency control Can avoid wrong scheduling decisions Unnecessary locks can be eliminated “Locks” can be released faster More flexibility for concurrency-control based on isolation requirements Explicit scheduling also avoids deadlocks, priority inversions, race conditions, and convoy formations b1 b2 11/23/2018 T2 T1

The Convoy Problem Illustrated Most tasks execute code like: lock(b); read(b); lock(b->next); unlock(b); … Problem is: if task T1 blocks on I/O for b4, then task T2 cannot unlock b3 to acquire a lock on b4, and task T3 cannot unlock b2 to acquire a lock on b3, and so on, forming a convoy even though most blocks are in cache and each task may require only a finite number of locks. b1 b2 b3 b4 Locked and blocked on I/O by T1 Locked by T4 waiting for lock on b2 Locked by T3 waiting for lock on b3 Locked by T2 waiting for lock on b4 11/23/2018 Convoy

Scheduling Based on Data Availability Two transaction T1 and T2 request blocks b1, b2, and b1, b3 respectively and T1 acquires the lock on b1 first Problem is: if T1 acquires a lock on b2 and blocks, T2 cannot make progress, even though T2 can access both b1 and b3 Lesson: schedule depending on how data is available; not how requests enter the system b1 b2 b3 b3 ready Locked and blocked on I/O by T1 T2 blocked by T1 Locked by T1 time 11/23/2018

Scheduling Based on Data Availability (Example of Misordering) Transferring funds from checking to savings. Begin(transaction) 1: read (checking account) 2: read(savings_account) 3: read(teller) // in cache 4: read(bank) // in cache 5: update(savings_account) 6: update(checking_account) 7: update(teller) 8: update(bank) End (transaction) If steps 3 and 4 were swapped with 1 and 2, we would be blocking while holding locks on the bank and teller balances. In a global scheduling model ordering of reads does not matter because a request does not start execution unless all the required data in the most probable execution path is available. 11/23/2018

Distributed Synchronization P1 T2 P2 b2 T3 P3 T4 P4 Conventional lock-based implementations serialize the lock manager code. In the example above, T1 serializes against T3, although T1 and T3 should ideally execute concurrently. Distributed synchronization on distinct queues is possible in FSMs running on multiprocessors, without requiring static data partition 11/23/2018

Single Node Btree “Brick” completion queues btree “instance” requests completions Btree requests are queued in the global event queue. Request completions are queued in the individual btree completion queues. queues global event queue global buffer cache 11/23/2018

FSM for Non-blocking Fetch moving down moving right stop start key > highkey key <= highkey && not leaf && is leaf is leaf has descendents 11/23/2018

Splitting node a into nodes a’ and b’ (c) (d) f c b’ a’ a f’ 11/23/2018

A Single Node B-tree 11/23/2018 . . . Key: 48 <values> Key: 51 25 35 40 47 62 99 36 40 41 47 78 99 51 56 57 62 48 51 53 56 40 99 meta data 11/23/2018

P0 K0 K2k+1 P2k+1 leaf node blink . . . 11/23/2018