Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.

Slides:



Advertisements
Similar presentations
Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
Advertisements

IDA / ADIT Lecture 10: Database recovery Jose M. Peña
TRANSACTION PROCESSING SYSTEM ROHIT KHOKHER. TRANSACTION RECOVERY TRANSACTION RECOVERY TRANSACTION STATES SERIALIZABILITY CONFLICT SERIALIZABILITY VIEW.
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christos Karamanolis.
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture X: Transactions.
COS 461 Fall 1997 Transaction Processing u normal systems lose their state when they crash u many applications need better behavior u today’s topic: how.
Spark: Cluster Computing with Working Sets
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
E-Transactions: End-to-End Reliability for Three-Tier Architectures Svend Frølund and Rachid Guerraoui.
CMPT Dr. Alexandra Fedorova Lecture X: Transactions.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Distributed Snapshots –Termination detection Election algorithms –Bully –Ring.
Two phase commit. What we’ve learnt so far Sequential consistency –All nodes agree on a total order of ops on a single object Crash recovery –An operation.
Persistent State Service 1 Distributed Object Transactions  Transaction principles  Concurrency control  The two-phase commit protocol  Services for.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Remote Backup Systems.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
Transaction Management WXES 2103 Database. Content What is transaction Transaction properties Transaction management with SQL Transaction log DBMS Transaction.
File System Reliability. Main Points Problem posed by machine/disk failures Transaction concept Reliability – Careful sequencing of file system operations.
Distributed Databases
Distributed File Systems Sarah Diesburg Operating Systems CS 3430.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
Distributed Systems. Interprocess Communication (IPC) Processes are either independent or cooperating – Threads provide a gray area – Cooperating processes.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
BIS Database Systems School of Management, Business Information Systems, Assumption University A.Thanop Somprasong Chapter # 10 Transaction Management.
UNIX File and Directory Caching How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Concurrency Control. Objectives Management of Databases Concurrency Control Database Recovery Database Security Database Administration.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Databases Illuminated
File System Implementation
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Review CS File Systems - Partitions What is a hard disk partition?
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Chapter Five Distributed file systems. 2 Contents Distributed file system design Distributed file system implementation Trends in distributed file systems.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
System Components Operating System Services System Calls.
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
Remote Backup Systems.
Distributed File Systems
Outline Announcements Fault Tolerance.
Overview Continuation from Monday (File system implementation)
Printed on Monday, December 31, 2018 at 2:03 PM.
Outline Announcements Lab2 Distributed File Systems 1/17/2019 COP5611.
Distributed Transactions
Chapter 2: Operating-System Structures
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
Transactions in Distributed Systems
by Mikael Bjerga & Arne Lange
Chapter 2: Operating-System Structures
Remote Backup Systems.
Concurrency Control.
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis Presented by Jay Chen 9/19/07

Problem Building (scalable) distributed systems is hard Specifically, sharing data via message passing is error prone Distributed state protocols must be developed for:  Replication  File data and metadata management  Cache Consistency  Group Membership

Goals Want to build infrastructure applications such as cluster file systems, lock managers, and group communication services Want shared application data that is fault- tolerant, scalable, and consistent Want to make building these applications easier

Solution Change the paradigm for building scalable distributed systems Transform the problem from message passing protocols to data structure design and manipulation Export minitransaction primitive that atomically access, and conditionally modify data at multiple nodes

Design Principles Principle 1: Reduce operation coupling to obtain scalability  Sinfonia does this by not imposing structure on the data it services Principle 2: Make components reliable before scaling them  Individual Sinfonia nodes are fault-tolerant

Components Memory nodes – hold application data, either in RAM or on stable storage. User library – runs on application nodes Memory nodes and application nodes are logically distinct, but may run on the same machine Linear address space referenced via (memory-node-id, address) pairs

Minitransactions Coordinator executes a transaction by asking participants to perform one or more actions  At the end of the transaction the coordinator executes two- phase commit  Sinfonia piggybacks transactions on top of the two-phase commit protocol Guarantees:  Atomicity – minitransaction executes completely or not at all  Consistency – data is not corrupted  Isolation – minitransactions are serializable  Durability – minitransactions are not lost even given failures

Minitransaction Details Minitransaction contains  Compare items  Read items  Write items Minitransactions are powerful enough to implement powerful primitives  Swap – read item returns old value and write item replaces it  Compare and swap  Atomic read of many data  Acquire a lease  Acquire multiple leases atomically  Change data if lease is held Application uses the user library to communicate with memory nodes through RPCs  Minitransactions are implemented on top of this

Various Implementation Details and Optimizations Fault tolerance - transparent recovery from:  Coordinator crashes – Dedicated recovery coordinator node  Participant crashes – Redo logs, decided lists  Complete system crashes – Replay logs and vote Log garbage collection Read only minitransactions are not logged Consistent backups – via locked disk snapshots Replication – primary copy replication scheme

Application: Cluster File System NFS v2 interface for cluster file system  Superblock - global info  Inodes keep file attributes  Data blocks 16KB each  Free-block bitmap  Chaining-list blocks - indicate blocks in a file All NFS functions implemented with a single minitransaction

Application: Group Communication Service ensures that all members receive the same messages and in the same order Instead of ensuring total order via token ring schemes each member has a dedicated queue stored on a memory node Messages are threaded together with “next” pointers to create a global list  Each message is given a global sequence number(GSN) once threaded Writers write to their queue and update their lastThreaded value instead of updating a global tail pointer  To find the global tail, members can read all the lastThreaded values and find the message with the highest GSN Readers keep a pointer to the latest message received, and follow “next” pointers to retrieve further messages

Costs and Considerations It is shown that the system does not scale for data spread or for contention  Application writer’s job to consider node locality during application design (data accessed together should be on the same node)  In contrast to data striping which is argued improves single- user throughput, but reduces scalability  Load migration is also an application’s responsibility All evaluations focused on data throughput, but there are few evaluations for latency  This seems fairly important for group communication systems

Discuss