CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems

CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems http://www.cs.berkeley.edu/~yelick/294

CS294, YelickApplications, p2 Agenda Design principles Specifics of Grapevine Specifics of Porcupine Comparisons Other applications besides mail

CS294, YelickApplications, p3 Specialization vs. Single Solution What does Grapevine do? –Message delivery (mail) –Naming, authentication, & access control –Resource location What does Porcupine do? –Primarily a mail server –Uses DNS for naming Difference in distributed systems infrastructure

CS294, YelickApplications, p4 Grapevine Prototype 1983 configuration –17 servers (typically Altos) 128 KB memory, 5MB disk, 30usec proc call –4400 individuals and 1500 groups –8500 messages, 35,000 receptions per day Designed for up to –30 servers, 10K users Used as actual mail server at Parc –Grew from 5 to 30 servers over 3 years

CS294, YelickApplications, p5 Porcupine Prototype 30-node PC cluster (not-quite identical) –Linux 2.2.7 –42,000 lines of C++ code –100Mb/s Ethernet + 1Gb/s hubs Designed for up to –1 billion messages/day Synthetic load

CS294, YelickApplications, p6 Functional Homogeneity Any node can perform any function –Why did they consider abandoning it in Grapevine? Other internet services? Functional homogeneity Replication Availability Principle Automatic reconfiguration Dynamic Scheduling ManageabilityPerformance Technique Goals

CS294, YelickApplications, p7 Evolutionary Growth Principle To grow over time, a system must use scalable data structures and algorithms Given p nodes –O(1) memory per node –O(1) time to perform important operations This is “ideal” but often impractical –E.g., O(log(p)) may be fine Each order of magnitude “feels” different: under 10, 100, 1K, 10K

CS294, YelickApplications, p8 Separation of Hard/Soft State Hard state: information that cannot be lost, must use stable storage. Also uses replication to increase availability. –E.g., message bodies, passwords. Soft state: information that can be constructed from hard state. Not replicated (except for performance). –E.g., list of nodes containing user’s mail

CS294, YelickApplications, p10 Sending a Message: Grapevine User calls Grapevine User Package (GVUP) on own machine GVUP broadcasts looking for servers Name server returns list of registration servers GVUP selects 1 and send mail to it Mail server looks up name in “to” field Connects to server with primary or secondary inbox for that name client server registration mail server primary inbox server secondary inbox GVUP

CS294, YelickApplications, p11 Replication in Grapevine Sender side replication –Any node can accept any mail Receiver side replication –Every user has 2 copies of their inbox Messages bodies are not replicated –Stored on disk; almost always recoverable –Message bodies are shared among recipients (4.7x sharing on average) What conclusions can you draw?

CS294, YelickApplications, p12 Reliability Limits in Grapevine Only one copy of message body Direct connection between mail server and (one of 2) inbox machines Others?

CS294, YelickApplications, p13 Limits to Scaling in Grapevine Every registration server knows the names of all (15 KB for 17 nodes) –Registration servers –Registries: logical groups/mailing lists –Could add hierarchy for scaling Resource discovery –Linear search through all servers to find the “closest” –How important is distance?

CS294, YelickApplications, p14 Configuration Questions When to add servers: –When load is too high –When network is unreliable Where to distribute registries Where to distribute inboxes All decisions made by humans in Grapevine –Some rules of thumb, e.g., for reg and mail servers, the primary inbox is local, the second is nearby and a third is at the other end of the internet. –Is there something fundamental here? Handing node failures and link failures (partitions)?

CS294, YelickApplications, p16 Porcupine Structures Mailbox fragment: chunk of mail messages for 1 user (hard) Mail map: list of nodes containing fragments for each user (soft) User profile db: names, passwords,… (hard) User profile soft state: copy of profile, used for performance (soft) User map: maps user name (hashed) to node currently storing mail map and profile Cluster membership: nodes currently available (soft, but replicated) Saito, 99

CS294, YelickApplications, p17 Porcupine Architecture Node A Node B Node Z... SMTP server POP server IMAP server Mail map Mailbox manager User DB manager Replication Manager Membership Manager RPC Load Balancer User map Saito, 99 User manager User Profile soft

CS294, YelickApplications, p18 Porcupine Operations Protocol handling User lookup Load Balancing Message store Internet AB... A 1. “send mail to bob” 2. Who manages bob?  A 3. “Verify bob” 5. Pick the best nodes to store new msg  C DNS-RR selection 4. “OK, bob has msgs on C and D 6. “Store msg” B C... C Saito, 99

CS294, YelickApplications, p19 Basic Data Structures “bob” BCACABAC bob : {A,C} ann : {B} BCACABAC suzy : {A,C} joe : {B} BCACABAC Apply hash function User map Mail map /user info Mailbox storage ABC Bob’s MSGs Suzy’s MSGs Bob’s MSGs Joe’s MSGs Ann’s MSGs Suzy’s MSGs Saito, 99

CS294, YelickApplications, p20 Performance in Porcupine Goals Scale performance linearly with cluster size Strategy: Avoid creating hot spots Partition data uniformly among nodes Fine-grain data partition Saito, 99

CS294, YelickApplications, p21 How does Performance Scale? 68m/day 25m/day Saito, 99

CS294, YelickApplications, p22 Availability in Porcupine Goals: Maintain function after failures React quickly to changes regardless of cluster size Graceful performance degradation / improvement Strategy: Two complementary mechanisms Hard state: email messages, user profile  Optimistic fine-grain replication Soft state: user map, mail map  Reconstruction after membership change Saito, 99

CS294, YelickApplications, p23 Soft-state Reconstruction BCABABAC bob : {A,C} joe : {C} BCABABAC BAABABAB bob : {A,C} joe : {C} BAABABAB ACACACAC bob : {A,C} joe : {C} ACACACAC suzy : {A,B} ann : {B} 1. Membership protocol Usermap recomputation 2. Distributed disk scan suzy : ann : Timeline A B ann : {B} BCABABAC suzy : {A,B} C ann : {B} BCABABAC suzy : {A,B} ann : {B} BCABABAC suzy : {A,B} Saito, 99

CS294, YelickApplications, p24 How does Porcupine React to Configuration Changes? Saito, 99

CS294, YelickApplications, p25 Hard-state Replication Goals: Keep serving hard state after failures Handle unusual failure modes Strategy: Exploit Internet semantics Optimistic, eventually consistent replication Per-message, per-user-profile replication Efficient during normal operation Small window of inconsistency Saito, 99

CS294, YelickApplications, p26 How Efficient is Replication? 68m/day 24m/day Saito, 99

CS294, YelickApplications, p27 How Efficient is Replication? 68m/day 24m/day 33m/day Saito, 99

CS294, YelickApplications, p28 Load balancing: Deciding where to store messages Goals: Handle skewed workload well Support hardware heterogeneity No voodoo parameter tuning Strategy: Spread-based load balancing Spread: soft limit on # of nodes per mailbox Large spread  better load balance Small spread  better affinity Load balanced within spread Use # of pending I/O requests as the load measure Saito, 99

CS294, YelickApplications, p29 How Well does Porcupine Support Heterogeneous Clusters? +16.8m/day (+25%) +0.5m/day (+0.8%) Saito, 99

CS294, YelickApplications, p31 Other Approaches Monolithic server Porcupine? Cluster- based OS Distributed file system & frontend Static Partitioning Manageability Manageability & availability per dollar

CS294, YelickApplications, p32 Consistency Both systems use distribution and replication to achieve their goals Ideally, these should be properties of the implementation, not the interface, I.e., they should be transparent. A common definition of “reasonable” behavior is transaction (ACID) semnatics

CS294, YelickApplications, p33 ACID Properties n Atomicity: A transaction’s changes to the state are atomic: either all happen or none happen. These changes include database changes, messages, and actions on transducers. n Consistency: A transaction is a correct transformation of the state. The actions taken as a group do not violate any of the integrity constraints associated with the state. This requires that the transaction be a correct program. n Isolation: Even though transactions execute concurrently, it appears to each transaction T, that others executed either before T or after T, but not both. n Durability: Once a transaction completes successfully (commits), its changes to the state survive failures. Reuter

CS294, YelickApplications, p34 Consistency in Grapevine Operations in Grapevine are not atomic –Add name; Put name on list Visible failure, name not available for 2 nd op Could stick with single server per session? Problem for sysadmins, not general users –Add user to distribution list; mail to list Problem for general users Invisible failure; mail not delivered to someone –Distributed Garbage Collection (gc) is a well- known, hard problem Removing unused distribution lists is related

CS294, YelickApplications, p35 Human Intervention Grapevine has two types of operators –Basic administrators –Experts In what ways is Porcupine easier to administer? –Automatic load balancing –Both do some dynamic resource discovery

CS294, YelickApplications, p37 Characteristics of Mail Scale: commercial services handle 10M messages per day Write-intensive, following don’t work: –Stateless transformation –Web caching Consistency requirements fairly weak –Compared to file systems or databases

CS294, YelickApplications, p38 Other Applications How would support for other applications differ? –Web servers –File servers –Mobile network services –Sensor network services Read-mostly, write-mostly, or both Disconnected operation (IMAP) Continuous vs. Discrete input

CS294, YelickApplications, p39 Harvest and Yield Yield: probability of completing a query Harvest: (application-specific) fidelity of the answer –Fraction of data represented? –Precision? –Semantic proximity? Harvest/yield questions: –When can we trade harvest for yield to improve availability? –How to measure harvest “threshold” below which response is not useful? Copyright Fox, 1999

CS294, YelickApplications, p40 Search Engine Stripe database randomly across all nodes, replicate high-priority data –Random striping: worst case == average case –Replication: high priority data unlikely to be lost –Harvest: fraction of nodes reporting Questions… –Why not just wait for all nodes to report back? –Should harvest be reported to end user? –What is the “useful” harvest threshold? –Is nondeterminism a problem? Trade harvest for yield/throughput Copyright Fox, 1999

CS294, YelickApplications, p41 General Questions What do both systems to achieve: –Parallelism (scalability) Partitioned data structures –Locality (performance) Replication, scheduling of related tasks/data –Reliability Replication, stable storage What are the trade-off?

CS294, YelickApplications, p42 Administrivia Read wireless (Baker) paper for 9/7 –Short discussion next Thursday 9/7 (4:30-5:00 only) Read Network Objects paper for Tuesday How to get the Mitzenmacher paper for next week –Read tornado codes as well, if interested

CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems

Similar presentations

Presentation on theme: "CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems

Similar presentations

Presentation on theme: "CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems"— Presentation transcript:

Similar presentations

About project

Feedback