Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christos Karamanolis.

Slides:



Advertisements
Similar presentations
Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
Advertisements

The google file system Cs 595 Lecture 9.
The Zebra Striped Network File System Presentation by Joseph Thompson.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Recovery CPSC 356 Database Ellen Walker Hiram College (Includes figures from Database Systems by Connolly & Begg, © Addison Wesley 2002)
CSCI 3140 Module 8 – Database Recovery Theodore Chiasson Dalhousie University.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
COS 461 Fall 1997 Transaction Processing u normal systems lose their state when they crash u many applications need better behavior u today’s topic: how.
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
ICS 421 Spring 2010 Distributed Transactions Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/16/20101Lipyeow.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Two phase commit. What we’ve learnt so far Sequential consistency –All nodes agree on a total order of ops on a single object Crash recovery –An operation.
CS 582 / CMPE 481 Distributed Systems
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Persistent State Service 1 Distributed Object Transactions  Transaction principles  Concurrency control  The two-phase commit protocol  Services for.
Reliability and Partition Types of Failures 1.Node failure 2.Communication line of failure 3.Loss of a message (or transaction) 4.Network partition 5.Any.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
TRANSACTION PROCESSING TECHNIQUES BY SON NGUYEN VIJAY RAO.
Distributed Databases
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Distributed Deadlocks and Transaction Recovery.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Chapter 20 Distributed File Systems Copyright © 2008.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.
Practical Byzantine Fault Tolerance
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
 Distributed file systems having transaction facility need to support distributed transaction service.  A distributed transaction service is an extension.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Ceph: A Scalable, High-Performance Distributed File System
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Transactions. Transaction: Informal Definition A transaction is a piece of code that accesses a shared database such that each transaction accesses shared.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
CS 162 Section 10 Two-phase commit Fault-tolerant computing.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Distributed Databases
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Database Recovery Techniques
Slide credits: Thomas Kao
CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.
Google Filesystem Some slides taken from Alan Sussman.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Commit Protocols CS60002: Distributed Systems
RELIABILITY.
Outline Announcements Fault Tolerance.
Chapter 10 Transaction Management and Concurrency Control
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Transactions in Distributed Systems
Transaction Communication
Presentation transcript:

Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christos Karamanolis HP Laboratories and VMware - presented by Mahesh Balakrishnan *Some slides stolen from Aguilera’s SOSP presentation

Motivation  Datacenter Infrastructural Apps Clustered file systems, lock managers, group communication services…  Distributed State  Current Solution Space: Message-passing protocols: replication, cache consistency, group membership, file data/metadata management Databases: powerful but expensive/inefficient Distributed Shared Memory: doesn’t work!

Sinfonia  Distributed Shared Memory as a Service  Consists of multiple memory nodes exposing flat, fine-grained address spaces  Accessed by processes running on application nodes via user library

Assumptions  Assumptions: Datacenter environment Trustworthy applications, no Byzantine failures Low, steady latencies No network partitions  Goal: help build infrastructural services Fault-tolerance, Scalability, Consistency and Performance

Design Principles Principle 1: Reduce operation coupling to obtain scalability. [flat address space] Principle 2: Make components reliable before scaling them. [fault-tolerant memory nodes] Partitioned address space: (mem-node-id, addr) Allows for clever data placement by application (clustering / striping)

Piggybacking Transactions  2PC Transaction: coordinator + participants Set of actions followed by 2-phase commit  Can piggyback if: Last action does not affect coordinator’s abort/commit decision Last action’s impact on coordinator’s abort/commit decision is known by participant  Can we piggyback entire transaction onto 2- phase commit?

Minitransactions

 Consist of: Set of compare items Set of read items Set of write items  Semantics: Check data in compare items (equality) If all match, then:  Retrieve data in read items  Write data in write items Else abort

Minitransactions

Example Minitransactions  Examples: Swap Compare-and-Swap Atomic read of many data Acquire a lease Acquire multiple leases Change data if lease is held  Minitransaction Idioms: Validate cache using compare items and write if valid Use compare items to validate data without read/write items; commit indicates validation was successful for read-only operations  How powerful are minitransactions?

Other Design Choices  Caching: none Delegated to application Minitransactions allow developer to atomically validate cached data and apply writes  Load-Balancing: none Delegated to application Minitransactions allow developer to atomically migrate many pieces of data Complications? Changes address of data…

Fault-Tolerance  Application node crash  No data loss/inconsistency  Levels of protection: Single memory node crashes do not impact availability Multiple memory node crashes do not impact durability (if they restart and stable storage is unaffected) Disaster recovery  Four Mechanisms: Disk Image – durability Logging – durability Replication – availability Backups – disaster recovery

Fault-Tolerance Modes

Fault-Tolerance  Standard 2PC blocks on coordinator crashes Undesirable: app nodes fail frequently Traditional solution: 3PC, extra phase  Sinfonia uses dedicated backup ‘recovery coordinator’  Block on participant crashes Assumption: memory nodes always recover from crashes (from principle 1…)  Single-site minitransaction can be done as 1PC

Protocol Timeline  Serializability: per-item locks – acquire all-or-nothing If acquisition fails, abort and retry transaction after interval  What if coordinator fails between phase C and D? Participant 1 has committed, 2 and 3 have not … But 2 and 3 cannot be read until recovery coordinator triggers their commit  no observable inconsistency

Recovery from coordinator crashes  Recovery Coordinator periodically probes memory node logs for orphan transactions  Phase 1: requests participants to vote ‘abort’; participants reply with previous existing votes, or vote ‘abort’  Phase 2: tells participants to commit i.f.f all votes are ‘commit’  Note: once a participant votes ‘abort’ or ‘commit’, it cannot change its vote  temporary inconsistencies due to coordinator crashes cannot be observed

Redo Log  Multiple data structures  Memory node recovery using redo log  Log garbage collection Garbage collect only when transaction has been durably applied at every memory node involved

Additional Details  Consistent disaster-tolerant backups Lock all addresses on all nodes (blocking lock, not all-or-nothing)  Replication Replica updated in parallel with 2PC Handles false failovers – power down the primary when fail-over occurs  Naming Directory Server: logical ids to (ip, application id)

SinfoniaFS  Cluster File System: Cluster nodes (application nodes) share a common file system stored across memory nodes  Sinfonia simplifies design: Cluster nodes do not need to be aware of each other Do not need logging to recover from crashes Do not need to maintain caches at remote nodes Can leverage write-ahead log for better performance  Exports NFS v2: all NFS operations are implemented by minitransactions!

SinfoniaFS Design  Inodes and data blocks (16 KB), chaining-list blocks  Cluster nodes can cache extensively Validation occurs before use via compare items in minitransactions. Read-only operations require only compare items; if transaction aborts due to mismatch, cache is refreshed before retry  Node locality: inode, chaining list and file collocated  Load-balancing Migration not implemented

SinfoniaFS Design

SinfoniaGCS  Design 1: Global queue of messages Write: find tail, add to it Inefficient – retries require message resend  Better design: Global queue of pointers Actual messages stored in per-member queues Write: add msg to data queue, use minitransaction to add pointer to global queue  Metadata: view, member queue locations  Essentially provides totally ordered broadcast

SinfoniaGCS Design

Sinfonia is easy-to-use

Evaluation: base performance  Multiple threads on single application node accessing single memory node  6 items from 50,000  Comparison against Berkeley DB: address as key  B-tree contention

Evaluation: optimization breakdown  Non-batched items: standard transactions  Batched items: Batch actions + 2PC  2PC combined: Sinfonia multi-site minitransaction  1PC combined: Sinfonia single-site minitransaction

Evaluation: scalability  Aggregate throughput increases as memory nodes are added to the system  System size n  n/2 memory nodes and n/2 application nodes  Each minitransaction involves six items and two memory nodes (except for system size 2)

Evaluation: scalability

Effect of Contention  Total # of items reduced to increase contention  Compare-and-Swap, Atomic Increment (validate + write)  Operations not directly supported by minitransactions suffer under contention due to lock retry mechanism

Evaluation: SinfoniaFS  Comparison of single- node Sinfonia with NFS

Evaluation: SinfoniaFS

Evaluation: SinfoniaGCS

Discussion  No notification functionality In a producer/consumer setup, how does consumer know data is in the shared queue?  Rudimentary Naming  No Allocation / Protection mechanisms  SinfoniaGCS/SinfoniaFS Are comparisons fair?

Discussion  Does shared storage have to be infrastructural? Extra power/cooling costs of a dedicated bank of memory nodes Lack of fate-sharing: application reliability depends on external memory nodes  Transactions versus Locks  What can minitransactions (not) do?

Conclusion  Minitransactions fast subset of transactions powerful (all NFS operations, for example)  Infrastructural approach Dedicated memory nodes Dedicated backup 2PC coordinator Implication: fast two-phase protocol can tolerate most failure modes