Megastore: Providing Scalable, Highly Available Storage for Interactive Services J. Baker, C. Bond, J.C. Corbett, JJ Furman, A. Khorlin, J. Larson, J-M.

Slides:



Advertisements
Similar presentations
Megastore: Providing Scalable, Highly Available Storage for Interactive Services. Presented by: Hanan Hamdan Supervised by: Dr. Amer Badarneh 1.
Advertisements

Paxos Made Simple Leslie Lamport. Introduction ► Lock is the easiest way to manage concurrency  Mutex and semaphore.  Read and write locks in 2PL for.
There is more Consensus in Egalitarian Parliaments Presented by Shayan Saeed Used content from the author's presentation at SOSP '13
NETWORK ALGORITHMS Presenter- Kurchi Subhra Hazra.
High throughput chain replication for read-mostly workloads
CS 5204 – Operating Systems1 Paxos Student Presentation by Jeremy Trimble.
Spanner: Google’s Globally-Distributed Database James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,
Spanner: Google’s Globally-Distributed Database By - James C
Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
Giovanni Chierico | May 2012 | Дубна Consistency in a distributed world.
State Machine Replication Project Presentation Ido Zachevsky Marat Radan Supervisor: Ittay Eyal Winter Semester 2010.
Paxos Made Simple Gene Pang. Paxos L. Lamport, The Part-Time Parliament, September 1989 Aegean island of Paxos A part-time parliament – Goal: determine.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
CS 582 / CMPE 481 Distributed Systems
Strong Consistency and Agreement COS 461: Computer Networks Spring 2011 Mike Freedman 1 Jenkins,
Distributed Systems CS Case Study: Replication in Google Chubby Recitation 5, Oct 06, 2011 Majd F. Sakr, Vinay Kolar, Mohammad Hammoud.
Paxos Quorum Leases Sayed Hadi Hashemi.
CHUBBY and PAXOS Sergio Bernales 1Dennis Kafura – CS5204 – Operating Systems.
Wide-area cooperative storage with CFS
Low-Latency Multi-Datacenter Databases using Replicated Commit
 Distributed Software Chapter 18 - Distributed Software1.
Cloud Storage: All your data belongs to us! Theo Benson This slide includes images from the Megastore and the Cassandra papers/conference slides.
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
Distributed Storage System Survey
1 The Google File System Reporter: You-Wei Zhang.
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
Presented by Dr. Greg Speegle April 12,  Two-phase commit slow relative to local transaction processing  CAP Theorem  Option 1: Reduce availability.
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
CSC 536 Lecture 10. Outline Case study Google Spanner Consensus, revisited Raft Consensus Algorithm.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson,Jean Michel L´eon, Yawei Li, Alexander Lloyd, Vadim Yushprakh Megastore.
An Introduction to Consensus with Raft
Practical Byzantine Fault Tolerance
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Databases Illuminated
Paxos A Consensus Algorithm for Fault Tolerant Replication.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin,
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Google Chubby Lock Service Steve Ko Computer Sciences and Engineering University at Buffalo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
© Spinnaker Labs, Inc. Chubby. © Spinnaker Labs, Inc. What is it? A coarse-grained lock service –Other distributed systems can use this to synchronize.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Google Spanner Steve Ko Computer Sciences and Engineering University at Buffalo.
TRUST Self-Organizing Systems Emin G ü n Sirer, Cornell University.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
EECS 262a Advanced Topics in Computer Systems Lecture 24 Paxos/Megastore November 26 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
EECS 262a Advanced Topics in Computer Systems Lecture 24 Paxos/Megastore April 20 th, 2016 John Kubiatowicz Electrical Engineering and Computer Sciences.
Minimizing Commit Latency of Transactions in Geo-Replicated Data Stores Paper Authors: Faisal Nawab, Vaibhav Arora, Divyakant Argrawal, Amr El Abbadi University.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
Detour: Distributed Systems Techniques
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Chapter 3 System Models.
SysPlex -What’s the problem Problems are growing faster than uni-processor….1980’s Leads to SMP and loosely coupled Even faster than SMP and loosely coupled.
Slide credits: Thomas Kao
a journey from the simple to the optimal
Distributed Systems – Paxos
Open Source distributed document DB for an enterprise
Alternative system models
CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.
CSCI5570 Large Scale Data Processing Systems
Introduction to NewSQL
Implementing Consistency -- Paxos
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
COS 418: Distributed Systems Lecture 16 Wyatt Lloyd
Implementing Consistency -- Paxos
Presentation transcript:

Megastore: Providing Scalable, Highly Available Storage for Interactive Services J. Baker, C. Bond, J.C. Corbett, JJ Furman, A. Khorlin, J. Larson, J-M Léon, Y. Li, A. Lloyd, V. Yushprakh Google Inc. Originally presented at CIDR 2011 by James Larson Presented by George Lee

With Great Scale Comes Great Responsibility A billion Internet users Small fraction is still huge Must please users Bad press is expensive - never lose data Support is expensive - minimize confusion No unplanned downtime No planned downtime Low latency Must also please developers, admins

Megastore Started in 2006 for app development at Google Service layered on: Bigtable (NoSQL scalable data store per datacenter) Chubby (Config data, config locks) Turnkey scaling (apps, users) Developer-friendly features Wide-area synchronous replication partition by "Entity Group"

Entity Group Examples ApplicationEntity GroupsCross-EG Ops User accountsnone BlogsUsers, Blogs Access control, notifications, global indexes MappingLocal patches Patch-spanning ops SocialUsers, Groups Messages, relationships, notifications ResourcesSitesShipments

Achieving Technical Goals Scale Bigtable within datacenters Easy to add Entity Groups (storage, throughput) ACID Transactions Write-ahead log per Entity Group 2PC or Queues between Entity Groups Wide-Area Replication Paxos Tweaks for optimal latency

Two Phase Commit Commit request/Voting phase Coordinator sends query to commit Cohorts prepare and reply Commit/Completion phase Success: Commit and acknowledge Failure: Rollback and acknowledge Disadvantage: Blocking protocol

Basic Paxos Prepare and Promise Proposer selects proposal number N and sends promise to acceptors Acceptors accept or deny the promise Accept! and Accepted Proposer sends out value Acceptors respond to proposer and learners

Message Flow: Basic Paxos Client Proposer Acceptor Learner | | | | | | | X >| | | | | | Request | X >|->|->| | | Prepare(N) | |< X--X--X | | Promise(N,{Va,Vb,Vc}) | X >|->|->| | | Accept!(N,Vn) | | |->| Accepted(N,Vn) |< X--X Response | | | | | | |

Paxos: Quorum-based Consensus "While some consensus algorithms, such as Paxos, have started to find their way into [large-scale distributed storage systems built over failure-prone commodity components], their uses are limited mostly to the maintenance of the global configuration information in the system, not for the actual data replication." -- Lamport, Malkhi, and Zhou, May 2009

Paxos and Megastore In practice, basic Paxos is not used Master-based approach? Megastore’s tweaks Coordinators Local reads Read/write from any replica Replicate log entries on each write

Omissions These were noted in the talk: No current query language Apps must implement query plans Apps have fine-grained control of physical placement Limited per-Entity Group update rate

Is Everybody Happy? Admins linear scaling, transparent rebalancing (Bigtable) instant transparent failover symmetric deployment Developers ACID transactions (read-modify-write) many features (indexes, backup, encryption, scaling) single-system image makes code simple little need to handle failures End Users fast up-to-date reads, acceptable write latency consistency

Take-Aways Constraints acceptable to most apps Entity Group partitioning High write latency Limited per-EG throughput In production use for over 4 years Available on Google App Engine as HRD (High Replication Datastore)

Questions? Rate limitation on writes are insignificant? Why not lots of RDBMS? Why not NoSQL with “mini-databases”?