Building a Database on S3 Matthias Brantner, Daniela Florescu, David Graf, Donald Kossmann, Tim Kraska Xiang Zhang 2014-04-10.

Slides:



Advertisements
Similar presentations
Chen Zhang Hans De Sterck University of Waterloo
Advertisements

Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 16.
Crash Recovery John Ortiz. Lecture 22Crash Recovery2 Review: The ACID properties  Atomicity: All actions in the transaction happen, or none happens 
IDA / ADIT Lecture 10: Database recovery Jose M. Peña
TRANSACTION PROCESSING SYSTEM ROHIT KHOKHER. TRANSACTION RECOVERY TRANSACTION RECOVERY TRANSACTION STATES SERIALIZABILITY CONFLICT SERIALIZABILITY VIEW.
Making Cloud Storage Provenance- Aware Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard School of Engineering and Applied Sciences.
PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
Acknowledgments Byron Bush, Scott S. Hilpert and Lee, JeongKyu
The Zebra Striped Network File System Presentation by Joseph Thompson.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
CSCI 3140 Module 8 – Database Recovery Theodore Chiasson Dalhousie University.
Chapter 19 Database Recovery Techniques
COMP 451/651 Indexes Chapter 1.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
1 Transaction Management Overview Yanlei Diao UMass Amherst March 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Chapter 19 Database Recovery Techniques. Slide Chapter 19 Outline Databases Recovery 1. Purpose of Database Recovery 2. Types of Failure 3. Transaction.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Wide-area cooperative storage with CFS
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.
CS4432: Database Systems II
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Transactions and Recovery
Case Study - GFS.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
6.4 Data and File Replication Gang Shen. Why replicate  Performance  Reliability  Resource sharing  Network resource saving.
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
INTRODUCTION TO TRANSACTION PROCESSING CHAPTER 21 (6/E) CHAPTER 17 (5/E)
Distributed Deadlocks and Transaction Recovery.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Chapter 16.
1 The Google File System Reporter: You-Wei Zhang.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Larisa kocsis priya ragupathy
By Lecturer / Aisha Dawood 1.  You can control the number of dispatcher processes in the instance. Unlike the number of shared servers, the number of.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Architecture Rajesh. Components of Database Engine.
Optimized Transaction Time Versioning Inside a Database Engine Intern: Feifei Li, Boston University Mentor: David Lomet, MSR.
7202ICT – Database Administration
DBMS Implementation Chapter 6.4 V3.0 Napier University Dr Gordon Russell.
1 Transaction Management Overview Chapter Transactions  Concurrent execution of user programs is essential for good DBMS performance.  Because.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
 Distributed file systems having transaction facility need to support distributed transaction service.  A distributed transaction service is an extension.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Transaction Management Overview. Transactions Concurrent execution of user programs is essential for good DBMS performance. – Because disk accesses are.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
4P13 Week 9 Talking Points
CS 540 Database Management Systems
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.

Database Recovery Techniques
Database Recovery Techniques
Tree-Structured Indexes
Introduction to NewSQL
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Building a Database on S3
B+-Trees and Static Hashing
Database Recovery 1 Purpose of Database Recovery
General External Merge Sort
Presentation transcript:

Building a Database on S3 Matthias Brantner, Daniela Florescu, David Graf, Donald Kossmann, Tim Kraska Xiang Zhang

Outline Introduction What is AWS Using S3 as a disk Basic commit protocols Transactional properties Experiments and results Conclusion and future works

Introduction What is AWS Using S3 as a disk Basic commit protocols Transactional properties Experiments and results Conclusion and future works

Background Utility services need to provide users the basic ingredients such as storage with high scalability and full availability Amazon’s Simple Storage Service(S3) is most successful for storing large objects which are rarely updated such as multi- media objects Eventual consistency : S3 only guarantee that updates will eventually become visible to all clients and the updates persist

Goal Explore how Web-based database applications(at any scale) can be implemented on top of S3 – Preserving scalability and availability – Maximizing the level of consistency

Contributions Show how small objects which are frequently updated by concurrent clients can be implemented on S3 Show how a B-tree can be implemented on S3 Present protocols that show how different levels of consistency can be implemented on S3 Study the cost(response time and $) of running a Web-based application at different levels of consistency on S3

Introduction What is AWS Using S3 as a disk Basic commit protocols Transactional properties Experiments and results Conclusion and future works

Simple Storage System S3 is an infinite store for objects of variable size(min 1 Byte, max 5 GB) S3 allows clients to read and update S3 objects remotely using a SOAP or REST-based interface: – get(uri) : returns an object identified with uri – put(uri,bytestream) : writes a new version of the object – get-if-modified-since(uri,timestamp) : retrieve new version of an object only if the object has changed since the specified timestamp

Simple Storage System S3 is not free. Users need to pay by use: – $0.15 to store 1 GB data for 1 month – $0.01 per 10,000 get request – $0.01 per 1,000 put request – $0.10~0.18 per GB of network bandwidth consumed Services like SmugMug use their own servers to cache the data in order to save the money as well as reducing the latency

Simple Storage System Small objects are clustered into pages and page is the unit of transfer

Simple Queue Service SQS allows users to manage an infinite number of queues with infinite capacity Each queue is referenced by a URI and supports sending and receiving messages via a HTTP or REST-based interface: – createQueue(uri) : creates a new queue identified with uri – send(uri,msg) : sends a message to the queue identified with uri – receive(uri,number-of-msg,timeout) : receives number-of-msg messages from the queue. The returned messages are locked for the specified timeout period – delete(uri,msg-id) : deletes a message from the queue – addGrant(uri,user) : allows another user to access the queue

Simple Queue Service SQS is also not free, users need to pay by use: – $0.01 for sending 1,000 messages – $0.10 per GB of data transferred SQS only makes a best-effort when returning messages in a FIFO manner: there is no guarantee that SQS returns the messages in the right order

Simple Queue Service Round trip time: the total time between initiating the request from the application and the delivery of the result or ack

Introduction What is AWS Using S3 as a disk Basic commit protocols Transactional properties Experiments and results Conclusion and future works

Client-Server Architecture Page Manager : coordinates read and write requests to S3 and buffers pages from S3 in local main memory or disk Record Manager : provides a record-oriented interface, organizes records on pages and carries out free-space management for the creation of new records

Record Manager A record is composed of a key and payload data A collection is implemented as a bucket in S3 and all the pages that store records of that collection are stored as S3 objects in that bucket A collection is identified by a URI

Record Manager The record manager provides functions to operate on records: – Create(key, payload, uri) : creates a new record into the collection identified by uri – Read(key, uri) : reads the payload information of a record given the key of the record and the uri of the collection – Update(key, payload, uri) : updates the payload information of a record given the key of the record and the uri of the collection – Delete(key, uri) : delete a record given the key of the record and the uri of the collection – Scan(uri) : scans through all records of a collection identified by uri

Page Manager It implements a buffer pool for S3 pages It supports reading pages from S3, pinning the pages in the buffer pool, updating the pages in the buffer pool and marking the pages as updated It provides a way to create new pages in S3 It implements commit and abort methods

Page Manager If an application commits a transaction, all the updates are propagated to S3 and all the affected pages are marked as unmodified in the client’s buffer pool If an application aborts a transaction, all pages marked modified or new are simply discarded from the client’s buffer pool

B-tree Indexes A B-tree is identified by the URI of its root page Each node contains a pointer to its right sibling at the same level B-trees can be implemented on top of the page manager – The root and intermediate nodes of the B-tree are stored as pages on S3 and contain (key, uri) pairs: uri refers to the appropriate page at the next lower level – The leaf pages of a primary index contain entries of (key, payload) – The leaf pages of a secondary index contain entries of (search key, record key)

Logging It is assumed that the log records are idempotent : applying a log records twice or more often has the same effect as applying the log record only once There are 3 types of log records: – (insert, key, payload) : describes the creation of a new record – (delete, key) : describes the deletion of a record – (update, key, payload) : describes the update of a record

Introduction What is AWS Using S3 as a disk Basic commit protocols Transactional properties Experiments and results Conclusion and future works

Issue to be addressed When concurrent clients commit updates to records stored on the same page, then the updates of one client may be overwritten by the other client, even if they updates different records The reason is that the transfer unit is a page rather than an individual record

Overview of commit protocol Step 1. Commit: the client generates log records for all the updates that are committed as part of the transaction and sends them to SQS Step 2. Checkpoint: the log records are applied to the pages stored on S3

Overview of commit protocol The commit step can be carried out in constant assuming the number of messages per commit is constant The checkpoint step can be carried out asynchronously and outside the execution of a client application, thus avoid being blocked by end users

Overview of commit protocol This protocol is resilient to failures: when a client crashes during commit, then the client resends all log records when it restarts This protocol does not guarantee atomicity: if a client crashes during commit and it never comes back again or loses the uncommitted log records, then some of the remaining log records will never be applied This protocol provides eventual consistency : eventually all updates become visible to everyone

PU Queues(Pending Update Queues) Clients propagate their log records to PU Queues Each B-tree has one PU queue associated to it (insert and delete ) One PU queue is associated to each leaf node of a primary B- tree of a collection (update )

Checkpoint Protocol for Data Pages The input of a checkpoint is a PU queue To make sure that nobody else is concurrently carrying out a checkpoint on the same PU queue, a LOCK queue is associated to each PU queue

Checkpoint Protocol for Data Pages 1.receive(URIofLOCKQueue, 1, timeout) : If the token message is returned, continue with step 2; Otherwise, terminate Set up a proper timeout period 2.If the leaf page is cached at the client, refresh the cached copy; Otherwise, get a copy from S3

Checkpoint Protocol for Data Pages 3. receive(URIofPUQueue, 256, 0) : Receive as many log records from the PU queue as possible 4. Apply the log records to the local copy of the page

Checkpoint Protocol for Data Pages 5. If step 2~4 are done within the timeout period, put the new version of the page to S3 Otherwise, terminate 6. If step 5 is successful within timeout, delete all log records received in step 3 from PU queue

Checkpoint Protocol for B-trees 1.Obtain the token from the LOCK queue 2.Receive log records from the PU queue 3.Sort the log records by key 4.Find the first log record and go to the leaf node associated with this record, refresh that leaf node from S3 5.Apply all log records that are relevant to that leaf node 6.If timeout is not expired, put the new version of node to S3 7.If timeout is not expired, delete the log records in step 5 8.If not all log records have been processed yet, go to step 4 Otherwise, terminate

Checkpoint Strategies A checkpoint on a Page X can be carried out by: – Reader : A reader of X – Writer : A client who just committed updates to X – Watchdog : A process which periodically checks PU queues of X – Owner : A special client which periodically checks PU queues of X This paper uses writers in general and readers in exceptional cases to carry out checkpoint

Checkpoint Strategies A writer initiates a checkpoint under the following conditions: 1. Each data page records the timestamp of the last checkpoint in its header. 2. When a client commits a log record, it computes the difference between its current time and the timestamp for the last checkpoint. 3. If the absolute value is bigger than a threshold (checkpoint interval ), then the client carries out the checkpoint

Checkpoint Strategies Flaw in writer-only strategy: it is possible that a page which is updated once and then never again is checkpointed. As a result, the update never becomes visible Solution: readers can also initiate checkpoints if they see a page whose last checkpoint was a long time ago With a probability proportional to 1/x in which x is the time period since the last checkpoint

Introduction What is AWS Using S3 as a disk Basic commit protocols Transactional properties Experiments and results Conclusion and future works

Atomicity Problem: the basic commit protocol mentioned above cannot achieve full atomicity Solution: rather than committing log records to PU queue, the client commits to its ATOMIC queue on SQS first. Every log record is annotated with an ID which uniquely identifies the transaction If a client fails and comes back: – It checks its ATOMIC queue, log records which carry the same ID will be propagated to PU queue and be deleted from ATOMIC queue after all the records have been propagated – Other log records without the same ID will be deleted immediately from the ATOMIC queue

Consistency Levels Monotonic Reads : If a client reads the value of a data item x, any successive read operation on x by that client will always return the same value or a more recent value Monotonic Writes : A write operation by a client on data item x is completed before any successive write operation on x by the same client Read your writes : The effect of a write operation by a client on data item x will always be seen by a successive read operation on x by the same client Write follows read : A write operation by a client on data item x following a previous read operation on x by the same client, is guaranteed to take place on the same or a more recent value of x that was read

Isolation: The Limits The idea of Isolation is to serialize transaction in the order of the time they started Snapshot Isolation or BOCC can be implemented on S3, but they need to use a global counter which may become the bottleneck of the system

Introduction What is AWS Using S3 as a disk Basic commit protocols Transactional properties Experiments and results Conclusion and future works

Experiment Naïve way as a baseline: – Write all dirty pages directly to S3 3 levels of consistency: – Basic: the protocol only supports commit and checkpoint step, it supports eventual consistency – Monotonicity: the protocol supports full client-side consistency – Atomicity: Above Monotonicity and Basic, it holds the highest level of consistency The only difference in these 4 variants is the implementation in commit and checkpoint

TPC-W Benchmark Models an online bookstore with queries asking for the availability of products and an update workload that involves the placement of orders – Retrieve the customer record from the database – Search for 6 specific products – Place orders for 3 of the 6 products

Running Time[secs] Each transaction simulates about 12 clicks(1 sec each) of a user The higher the level of consistency, the lower the overall running times Naïve writes all affected pages while other approaches only propagate log record Latency can be reduced by sending several messages in parallel to S3 and SQS

Cost The cost was computed by running a large number of transactions and taking the cost measurements of AWS Cost increases as the level of consistency increases Cost can be reduced by setting the checkpoint interval to a larger value

Vary Checkpoint Interval Checkpoint interval below 10 seconds can effectively initiate a checkpoint for every update that was committed Increase the checkpoint interval above 10 seconds can decrease the cost

Introduction What is AWS Using S3 as a disk Basic commit protocols Transactional properties Experiments and results Conclusion and future works

This paper focuses on high scalability and availability, but there might be some scenarios where ACID properties are more important Some new algorithm needs to be devised. For instance, this system is not able to carry out chained I/O in order to scan through several pages on S3 The right security infrastructure will be crucial for a S3-based information system

Thank you Q&A