Scaling Out Without Partitioning Phil Bernstein & Colin Reid Microsoft Corporation A Novel Transactional Record Manager for Shared Raw Flash © 2010 Microsoft.

Slides:



Advertisements
Similar presentations
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Advertisements

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
11-May-15CSE 542: Operating Systems1 File system trace papers The Zebra striped network file system. Hartman, J. H. and Ousterhout, J. K. SOSP '93. (ACM.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Transaction.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Distributed Systems 2006 Styles of Client/Server Computing.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester,
Database Software File Management Systems Database Management Systems.
1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.
File System Reliability. Main Points Problem posed by machine/disk failures Transaction concept Reliability – Careful sequencing of file system operations.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
1 The Google File System Reporter: You-Wei Zhang.
1 Storage Refinement. Outline Disk failures To attack Intermittent failures To attack Media Decay and Write failure –Checksum To attack Disk crash –RAID.
Managing Multi-User Databases AIMS 3710 R. Nakatsu.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How are data stored? –physical level –logical level.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
Larisa kocsis priya ragupathy
Some key-value stores using log-structure Zhichao Liang LevelDB Riak.
Logging in Flash-based Database Systems Lu Zeping
SOEN 6011 Software Engineering Processes Section SS Fall 2007 Dr Greg Butler
Speaker: 吳晋賢 (Chin-Hsien Wu) Embedded Computing and Applications Lab Department of Electronic Engineering National Taiwan University of Science and Technology,
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Presenters: Rezan Amiri Sahar Delroshan
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Distributed Computing Systems CSCI 4780/6780. Distributed System A distributed system is: A collection of independent computers that appears to its users.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.
Serverless Network File Systems Overview by Joseph Thompson.
Databases Illuminated
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
SYS364 Database Design Continued. Database Design Definitions Initial ERD’s Normalization of data Final ERD’s Database Management Database Models File.
Chap 7: Consistency and Replication
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
Introduction to Active Directory
Philip A. Bernstein Colin Reid Ming Wu Xinhao Yuan Microsoft Corporation March 7, 2012 Published at VLDB 2011:
CS 540 Database Management Systems
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Bigtable: A Distributed Storage System for Structured Data
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Managing Multi-User Databases
Lecture 16: Data Storage Wednesday, November 6, 2006.
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Google Filesystem Some slides taken from Alan Sussman.
Lecture 11: DMBS Internals
Database System Architectures
Database management systems
Presentation transcript:

Scaling Out Without Partitioning Phil Bernstein & Colin Reid Microsoft Corporation A Novel Transactional Record Manager for Shared Raw Flash © 2010 Microsoft Corporation Jan. 16, 2010

What is Hyder? 2 It’s a research project. A software stack for transactional record management Stores [key, value] pairs, which are accessed within transactions It’s a standard interface that underlies all database systems Functionality Records: Stored [key, value] pairs Record operations: Insert, Delete, Update, Get record where field = X; Get next Transactions: Start, Commit, Abort Why build another one? Make it easier to scale out for large-scale web services Exploit technology trends: flash memory, high-speed networks

Network Scaling Out with Partitioning 3 Internet Database Partition Database Partition App $ Log Data Web Server App $ $ Database is partitioned across multiple servers For scalability, avoid distributed transactions Several layers of caching App is responsible for – cache coherence – consistency of cross-partition queries Must carefully configure to balance the load $$$$ Web Server App $ Web Server App $ Database Partition Database Partition App $ Log Data Database Partition Database Partition App $ Log Data Database Partition Database Partition App $ Log Data

The Problem of Many-to-Many Relationships 4 FriendStatus( UserId, FriendId, Status, Updated ) The status of each user, U, is duplicated with all of U’s friends If you partition FriendStatus by UserId, then retrieving the status of a user’s friends is now efficient. But every user’s status is duplicated in many partitions. Can be optimized using stateless distributed caches, with the “truth” in a separate Status(UserId) table But every status update of a user must be fanned out – This is inefficient – It can also cause update anomalies, because fanout is not atomic – The application maintains consistency manually, where necessary So maybe a centralized cache of Status(UserId) is better – If you can make it fault tolerant ….

Network Hyder Scales Out Without Partitioning 5 Internet The log is the database No partitioning is required – Servers share a reliable, distributed log Database is multi-versioned, so server caches are trivially coherent – Servers can fetch pages from the log or other servers’ caches Hyder Log Web Server Hyder $ App Web Server Hyder $ App Web Server Hyder $ App

Hyder Runs in the Application Process 6 Simple programming model No need for client and server caches, plus a cache server Avoids the expense of RPC’s to a database server Network Internet Hyder Log Web Server Hyder $ App Web Server Hyder $ App Web Server Hyder $ App

Enabling Hardware Assumptions I/O operations are now cheap and abundant – Raw flash offers at least 10 4 more IOPS/GB than HDD – With 4K pages, ~100µs to write and ~200µs to write, per chip – Can spread the database across a log, with less physical contiguity Cheap high-performance data center networks – 1Gbps broadcast, with 10Gbps coming soon – Round-trip latencies already under 25 μs on 10 GigE  Many servers can share storage, with high performance Large, cheap, 64-bit addressable memories – Commodity web servers can maintain huge in-memory caches  Reduces the rate that Hyder needs to access the log Many-core web servers – Computation can be squandered  Hyder uses it to maintain consistent views of the database…. 7

Issues with Flash Flash cannot be destructively updated – A block of 64 pages must be erased before programming (writing) each page once – Erases are slow (~2ms) and block the chip Flash has limited erase durability – MLC can be erased ~10K times. SLC is ~100K times – Needs wear-leveling – SLC is ~3x the price of MLC – Dec. 2009: 4GB MLC is $8.50; 1GB SLC is $6.00 Flash densities can double only 2 or 3 more times. – But there are other nonvolatile technologies coming, e.g., phase-change memory (PCM) 8

The Hyder Stack Segments, stripes and streams Highly available, load balanced and self-managing log structured storage 9 Optimistic transaction protocol Supports standard isolation levels Persistent programming language LINQ or SQL layered on Hyder Custom controller interface Flash units are append-only Multi-versioned binary search tree Mapped to log-structured storage

Hyder Stores its Database in a Log Log uses RAID erasure coding for reliability 10

Database is a Binary Search Tree 11 G B A H CI D ADCBIHG Binary Search Tree Tree is marshaled into the log

D Update D’s value Binary Tree is Multi-versioned 12 G B A H CI D Copy on write To update a node, replace nodes up to the root C B G

Transaction Execution Each server has a cache of the last committed database state A transaction reads a snapshot and generates an intention log record 13 ADCB IHG Transaction execution 1.Get pointer to snapshot 2.Generate updates locally 3.Append intention log record D CBG Snapshot G B C D DB cache G B C D H I A last committed

Log Updates are Broadcast 14 Broadcast intention Broadcast ack

Transaction Commit Every server executes a roll-forward of the log When it processes an intention log record, – it checks whether the transaction experienced a conflict – if not, the transaction committed and the server merges the intention into its last committed state All servers make the same commit/abort decisions 15 ADCB IHG D CBG transaction T Did a committed transaction write into T’s readset or writeset here? Snapshot

Hyder Transaction Flow 16 Transaction starts with a recent consistent snapshot Transaction executes on application server Transaction “intention” is appended to the log and partially broadcast to other servers Intention is durably stored in the log Intention log sequence is broadcast to all servers Messages are received over UDP and parsed in parallel Each server sequentially merges each intention into the committed state cache Optimistic concurrency violation causes transaction to abort and optionally retry

Performance The system scales out without partitioning System-wide throughput of update transactions is bounded by the slowed step in the update pipeline – 15K update transactions per second possible over 1 Gigabit Ethernet – 150K update transactions per second expected on 10 Gigabit Ethernet – Conflict detection & merge can do about 300K update transactions per second Abort rate on write-hot data is bounded by txn’s conflict zone – Which is determined by end-to-end transaction latency. – About 200 μs in our prototype  ~ 1500 update TPS if all txns conflict 17 ADCB IHG D CBG Minimize the length of the conflict zone

Major Technologies Flash is append-only. Custom controller has mechanisms for synchronization & fault tolerance Storage is striped, with a self-adaptive algorithm for storage allocation and load balancing Fault-tolerant protocol for a totally ordered log Fast algorithm for conflict detection and merging of intention records into last-committed state 18

Status Most parts have been prototyped. – But there’s a long way to go. We’re working on papers – There’s a short abstract on the HPTS 2009 website 19