Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.

Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis Presented by Jay Chen 9/19/07

Problem Building (scalable) distributed systems is hard Specifically, sharing data via message passing is error prone Distributed state protocols must be developed for:  Replication  File data and metadata management  Cache Consistency  Group Membership

Goals Want to build infrastructure applications such as cluster file systems, lock managers, and group communication services Want shared application data that is fault- tolerant, scalable, and consistent Want to make building these applications easier

Solution Change the paradigm for building scalable distributed systems Transform the problem from message passing protocols to data structure design and manipulation Export minitransaction primitive that atomically access, and conditionally modify data at multiple nodes

Design Principles Principle 1: Reduce operation coupling to obtain scalability  Sinfonia does this by not imposing structure on the data it services Principle 2: Make components reliable before scaling them  Individual Sinfonia nodes are fault-tolerant

Components Memory nodes – hold application data, either in RAM or on stable storage. User library – runs on application nodes Memory nodes and application nodes are logically distinct, but may run on the same machine Linear address space referenced via (memory-node-id, address) pairs

Minitransactions Coordinator executes a transaction by asking participants to perform one or more actions  At the end of the transaction the coordinator executes two- phase commit  Sinfonia piggybacks transactions on top of the two-phase commit protocol Guarantees:  Atomicity – minitransaction executes completely or not at all  Consistency – data is not corrupted  Isolation – minitransactions are serializable  Durability – minitransactions are not lost even given failures

Minitransaction Details Minitransaction contains  Compare items  Read items  Write items Minitransactions are powerful enough to implement powerful primitives  Swap – read item returns old value and write item replaces it  Compare and swap  Atomic read of many data  Acquire a lease  Acquire multiple leases atomically  Change data if lease is held Application uses the user library to communicate with memory nodes through RPCs  Minitransactions are implemented on top of this

Various Implementation Details and Optimizations Fault tolerance - transparent recovery from:  Coordinator crashes – Dedicated recovery coordinator node  Participant crashes – Redo logs, decided lists  Complete system crashes – Replay logs and vote Log garbage collection Read only minitransactions are not logged Consistent backups – via locked disk snapshots Replication – primary copy replication scheme

Application: Cluster File System NFS v2 interface for cluster file system  Superblock - global info  Inodes keep file attributes  Data blocks 16KB each  Free-block bitmap  Chaining-list blocks - indicate blocks in a file All NFS functions implemented with a single minitransaction

Application: Group Communication Service ensures that all members receive the same messages and in the same order Instead of ensuring total order via token ring schemes each member has a dedicated queue stored on a memory node Messages are threaded together with “next” pointers to create a global list  Each message is given a global sequence number(GSN) once threaded Writers write to their queue and update their lastThreaded value instead of updating a global tail pointer  To find the global tail, members can read all the lastThreaded values and find the message with the highest GSN Readers keep a pointer to the latest message received, and follow “next” pointers to retrieve further messages

Costs and Considerations It is shown that the system does not scale for data spread or for contention  Application writer’s job to consider node locality during application design (data accessed together should be on the same node)  In contrast to data striping which is argued improves single- user throughput, but reduces scalability  Load migration is also an application’s responsibility All evaluations focused on data throughput, but there are few evaluations for latency  This seems fairly important for group communication systems

Discuss

Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.

Similar presentations

Presentation on theme: "Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.

Similar presentations

Presentation on theme: "Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis."— Presentation transcript:

Similar presentations

About project

Feedback