Download presentation
Presentation is loading. Please wait.
1
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis Presented by Jay Chen 9/19/07
2
Problem Building (scalable) distributed systems is hard Specifically, sharing data via message passing is error prone Distributed state protocols must be developed for: Replication File data and metadata management Cache Consistency Group Membership
3
Goals Want to build infrastructure applications such as cluster file systems, lock managers, and group communication services Want shared application data that is fault- tolerant, scalable, and consistent Want to make building these applications easier
4
Solution Change the paradigm for building scalable distributed systems Transform the problem from message passing protocols to data structure design and manipulation Export minitransaction primitive that atomically access, and conditionally modify data at multiple nodes
5
Design Principles Principle 1: Reduce operation coupling to obtain scalability Sinfonia does this by not imposing structure on the data it services Principle 2: Make components reliable before scaling them Individual Sinfonia nodes are fault-tolerant
6
Components Memory nodes – hold application data, either in RAM or on stable storage. User library – runs on application nodes Memory nodes and application nodes are logically distinct, but may run on the same machine Linear address space referenced via (memory-node-id, address) pairs
7
Minitransactions Coordinator executes a transaction by asking participants to perform one or more actions At the end of the transaction the coordinator executes two- phase commit Sinfonia piggybacks transactions on top of the two-phase commit protocol Guarantees: Atomicity – minitransaction executes completely or not at all Consistency – data is not corrupted Isolation – minitransactions are serializable Durability – minitransactions are not lost even given failures
8
Minitransaction Details Minitransaction contains Compare items Read items Write items Minitransactions are powerful enough to implement powerful primitives Swap – read item returns old value and write item replaces it Compare and swap Atomic read of many data Acquire a lease Acquire multiple leases atomically Change data if lease is held Application uses the user library to communicate with memory nodes through RPCs Minitransactions are implemented on top of this
9
Various Implementation Details and Optimizations Fault tolerance - transparent recovery from: Coordinator crashes – Dedicated recovery coordinator node Participant crashes – Redo logs, decided lists Complete system crashes – Replay logs and vote Log garbage collection Read only minitransactions are not logged Consistent backups – via locked disk snapshots Replication – primary copy replication scheme
10
Application: Cluster File System NFS v2 interface for cluster file system Superblock - global info Inodes keep file attributes Data blocks 16KB each Free-block bitmap Chaining-list blocks - indicate blocks in a file All NFS functions implemented with a single minitransaction
11
Application: Group Communication Service ensures that all members receive the same messages and in the same order Instead of ensuring total order via token ring schemes each member has a dedicated queue stored on a memory node Messages are threaded together with “next” pointers to create a global list Each message is given a global sequence number(GSN) once threaded Writers write to their queue and update their lastThreaded value instead of updating a global tail pointer To find the global tail, members can read all the lastThreaded values and find the message with the highest GSN Readers keep a pointer to the latest message received, and follow “next” pointers to retrieve further messages
12
Costs and Considerations It is shown that the system does not scale for data spread or for contention Application writer’s job to consider node locality during application design (data accessed together should be on the same node) In contrast to data striping which is argued improves single- user throughput, but reduces scalability Load migration is also an application’s responsibility All evaluations focused on data throughput, but there are few evaluations for latency This seems fairly important for group communication systems
13
Discuss
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.