CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems

Slides:



Advertisements
Similar presentations
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Advertisements

Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
NETWORK LOAD BALANCING NLB.  Network Load Balancing (NLB) is a Clustering Technology.  Windows Based. (windows server).  To scale performance, Network.
Distributed components
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
GrapevineCS-4513, D-Term Introduction to the Grapevine Distributed System CS-4513 Distributed Computing Systems.
REK’s adaptation of Prof. Claypool’s adaptation of
Distributed Database Management Systems
Election AlgorithmsCS-4513 D-term Election Algorithms CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts,
G Robert Grimm New York University Porcupine.
G Robert Grimm New York University Grapevine.
Computer Science Lecture 2, page 1 CS677: Distributed OS Last Class: Introduction Distributed Systems – A collection of independent computers that appears.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
1 Porcupine: A Highly Available Cluster-based Mail Service Yasushi Saito Brian Bershad Hank Levy University of Washington Department of Computer Science.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Scalable Applications and Real Time Response Ashish Motivala CS 614 April 17 th 2001.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
Manageability, Availability and Performance in Porcupine: A Highly Scalable, Cluster-based Mail Service.
University of Pennsylvania 11/21/00CSE 3801 Distributed File Systems CSE 380 Lecture Note 14 Insup Lee.
DISTRIBUTED COMPUTING
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Porcupine: A Highly Available Cluster- based Mail Service Y. Saito, B. Bershad, H. Levy U. Washington.
Distributed Databases
Distributed File Systems Sarah Diesburg Operating Systems CS 3430.
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
Exploiting Application Semantics: Harvest, Yield CS 444A Fall 99 Software for Critical Systems Armando Fox & David Dill © 1999 Armando Fox.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED.
CS Storage Systems Lecture 14 Consistency and Availability Tradeoffs.
Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS Bill Devlin, Jim Cray, Bill Laing, George Spix Microsoft Research Dec
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
Manageability, Availability and Performance in Porcupine: A Highly Scalable, Cluster-based Mail Service Yasushi Saito, Brian N Bershad and Henry M.Levy.
2/1/00 Porcupine: a highly scalable service Authors: Y. Saito, B. N. Bershad and H. M. Levy This presentation by: Pratik Mukhopadhyay CSE 291 Presentation.
SOEN 6011 Software Engineering Processes Section SS Fall 2007 Dr Greg Butler
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
Chapter 20 Distributed File Systems Copyright © 2008.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Distributed Computing Systems CSCI 4780/6780. Distributed System A distributed system is: A collection of independent computers that appears to its users.
Distributed Computing Systems CSCI 4780/6780. Geographical Scalability Challenges Synchronous communication –Waiting for a reply does not scale well!!
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
Service Primitives for Internet Scale Applications Amr Awadallah, Armando Fox, Ben Ling Computer Systems Lab Stanford University.
CS603 Basics of underlying platforms January 9, 2002.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Basics of the Domain Name System (DNS) By : AMMY- DRISS Mohamed Amine KADDARI Zakaria MAHMOUDI Soufiane Oujda Med I University National College of Applied.
CS294, YelickDataStructs, p1 CS Distributed Data Structures
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Last Class: Introduction
Cluster-Based Scalable
Distributed File Systems
Large-scale file systems and Map-Reduce
File System Implementation
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
CSE 486/586 Distributed Systems Consistency --- 1
CSE 451: Operating Systems Spring 2005 Module 20 Distributed Systems
Database Security Transactions
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
CSE 451: Operating Systems Winter 2004 Module 19 Distributed Systems
CSE 451: Operating Systems Winter 2007 Module 21 Distributed Systems
Presentation transcript:

CS294, YelickApplications, p1 CS Applications of Reliable Distributed Systems

CS294, YelickApplications, p2 Agenda Design principles Specifics of Grapevine Specifics of Porcupine Comparisons Other applications besides mail

CS294, YelickApplications, p3 Specialization vs. Single Solution What does Grapevine do? –Message delivery (mail) –Naming, authentication, & access control –Resource location What does Porcupine do? –Primarily a mail server –Uses DNS for naming Difference in distributed systems infrastructure

CS294, YelickApplications, p4 Grapevine Prototype 1983 configuration –17 servers (typically Altos) 128 KB memory, 5MB disk, 30usec proc call –4400 individuals and 1500 groups –8500 messages, 35,000 receptions per day Designed for up to –30 servers, 10K users Used as actual mail server at Parc –Grew from 5 to 30 servers over 3 years

CS294, YelickApplications, p5 Porcupine Prototype 30-node PC cluster (not-quite identical) –Linux –42,000 lines of C++ code –100Mb/s Ethernet + 1Gb/s hubs Designed for up to –1 billion messages/day Synthetic load

CS294, YelickApplications, p6 Functional Homogeneity Any node can perform any function –Why did they consider abandoning it in Grapevine? Other internet services? Functional homogeneity Replication Availability Principle Automatic reconfiguration Dynamic Scheduling ManageabilityPerformance Technique Goals

CS294, YelickApplications, p7 Evolutionary Growth Principle To grow over time, a system must use scalable data structures and algorithms Given p nodes –O(1) memory per node –O(1) time to perform important operations This is “ideal” but often impractical –E.g., O(log(p)) may be fine Each order of magnitude “feels” different: under 10, 100, 1K, 10K

CS294, YelickApplications, p8 Separation of Hard/Soft State Hard state: information that cannot be lost, must use stable storage. Also uses replication to increase availability. –E.g., message bodies, passwords. Soft state: information that can be constructed from hard state. Not replicated (except for performance). –E.g., list of nodes containing user’s mail

CS294, YelickApplications, p9 Agenda Design principles Specifics of Grapevine Specifics of Porcupine Comparisons Other applications besides mail

CS294, YelickApplications, p10 Sending a Message: Grapevine User calls Grapevine User Package (GVUP) on own machine GVUP broadcasts looking for servers Name server returns list of registration servers GVUP selects 1 and send mail to it Mail server looks up name in “to” field Connects to server with primary or secondary inbox for that name client server registration mail server primary inbox server secondary inbox GVUP

CS294, YelickApplications, p11 Replication in Grapevine Sender side replication –Any node can accept any mail Receiver side replication –Every user has 2 copies of their inbox Messages bodies are not replicated –Stored on disk; almost always recoverable –Message bodies are shared among recipients (4.7x sharing on average) What conclusions can you draw?

CS294, YelickApplications, p12 Reliability Limits in Grapevine Only one copy of message body Direct connection between mail server and (one of 2) inbox machines Others?

CS294, YelickApplications, p13 Limits to Scaling in Grapevine Every registration server knows the names of all (15 KB for 17 nodes) –Registration servers –Registries: logical groups/mailing lists –Could add hierarchy for scaling Resource discovery –Linear search through all servers to find the “closest” –How important is distance?

CS294, YelickApplications, p14 Configuration Questions When to add servers: –When load is too high –When network is unreliable Where to distribute registries Where to distribute inboxes All decisions made by humans in Grapevine –Some rules of thumb, e.g., for reg and mail servers, the primary inbox is local, the second is nearby and a third is at the other end of the internet. –Is there something fundamental here? Handing node failures and link failures (partitions)?

CS294, YelickApplications, p15 Agenda Design principles Specifics of Grapevine Specifics of Porcupine Comparisons Other applications besides mail

CS294, YelickApplications, p16 Porcupine Structures Mailbox fragment: chunk of mail messages for 1 user (hard) Mail map: list of nodes containing fragments for each user (soft) User profile db: names, passwords,… (hard) User profile soft state: copy of profile, used for performance (soft) User map: maps user name (hashed) to node currently storing mail map and profile Cluster membership: nodes currently available (soft, but replicated) Saito, 99

CS294, YelickApplications, p17 Porcupine Architecture Node A Node B Node Z... SMTP server POP server IMAP server Mail map Mailbox manager User DB manager Replication Manager Membership Manager RPC Load Balancer User map Saito, 99 User manager User Profile soft

CS294, YelickApplications, p18 Porcupine Operations Protocol handling User lookup Load Balancing Message store Internet AB... A 1. “send mail to bob” 2. Who manages bob?  A 3. “Verify bob” 5. Pick the best nodes to store new msg  C DNS-RR selection 4. “OK, bob has msgs on C and D 6. “Store msg” B C... C Saito, 99

CS294, YelickApplications, p19 Basic Data Structures “bob” BCACABAC bob : {A,C} ann : {B} BCACABAC suzy : {A,C} joe : {B} BCACABAC Apply hash function User map Mail map /user info Mailbox storage ABC Bob’s MSGs Suzy’s MSGs Bob’s MSGs Joe’s MSGs Ann’s MSGs Suzy’s MSGs Saito, 99

CS294, YelickApplications, p20 Performance in Porcupine Goals Scale performance linearly with cluster size Strategy: Avoid creating hot spots Partition data uniformly among nodes Fine-grain data partition Saito, 99

CS294, YelickApplications, p21 How does Performance Scale? 68m/day 25m/day Saito, 99

CS294, YelickApplications, p22 Availability in Porcupine Goals: Maintain function after failures React quickly to changes regardless of cluster size Graceful performance degradation / improvement Strategy: Two complementary mechanisms Hard state: messages, user profile  Optimistic fine-grain replication Soft state: user map, mail map  Reconstruction after membership change Saito, 99

CS294, YelickApplications, p23 Soft-state Reconstruction BCABABAC bob : {A,C} joe : {C} BCABABAC BAABABAB bob : {A,C} joe : {C} BAABABAB ACACACAC bob : {A,C} joe : {C} ACACACAC suzy : {A,B} ann : {B} 1. Membership protocol Usermap recomputation 2. Distributed disk scan suzy : ann : Timeline A B ann : {B} BCABABAC suzy : {A,B} C ann : {B} BCABABAC suzy : {A,B} ann : {B} BCABABAC suzy : {A,B} Saito, 99

CS294, YelickApplications, p24 How does Porcupine React to Configuration Changes? Saito, 99

CS294, YelickApplications, p25 Hard-state Replication Goals: Keep serving hard state after failures Handle unusual failure modes Strategy: Exploit Internet semantics Optimistic, eventually consistent replication Per-message, per-user-profile replication Efficient during normal operation Small window of inconsistency Saito, 99

CS294, YelickApplications, p26 How Efficient is Replication? 68m/day 24m/day Saito, 99

CS294, YelickApplications, p27 How Efficient is Replication? 68m/day 24m/day 33m/day Saito, 99

CS294, YelickApplications, p28 Load balancing: Deciding where to store messages Goals: Handle skewed workload well Support hardware heterogeneity No voodoo parameter tuning Strategy: Spread-based load balancing Spread: soft limit on # of nodes per mailbox Large spread  better load balance Small spread  better affinity Load balanced within spread Use # of pending I/O requests as the load measure Saito, 99

CS294, YelickApplications, p29 How Well does Porcupine Support Heterogeneous Clusters? +16.8m/day (+25%) +0.5m/day (+0.8%) Saito, 99

CS294, YelickApplications, p30 Agenda Design principles Specifics of Grapevine Specifics of Porcupine Comparisons Other applications besides mail

CS294, YelickApplications, p31 Other Approaches Monolithic server Porcupine? Cluster- based OS Distributed file system & frontend Static Partitioning Manageability Manageability & availability per dollar

CS294, YelickApplications, p32 Consistency Both systems use distribution and replication to achieve their goals Ideally, these should be properties of the implementation, not the interface, I.e., they should be transparent. A common definition of “reasonable” behavior is transaction (ACID) semnatics

CS294, YelickApplications, p33 ACID Properties n Atomicity: A transaction’s changes to the state are atomic: either all happen or none happen. These changes include database changes, messages, and actions on transducers. n Consistency: A transaction is a correct transformation of the state. The actions taken as a group do not violate any of the integrity constraints associated with the state. This requires that the transaction be a correct program. n Isolation: Even though transactions execute concurrently, it appears to each transaction T, that others executed either before T or after T, but not both. n Durability: Once a transaction completes successfully (commits), its changes to the state survive failures. Reuter

CS294, YelickApplications, p34 Consistency in Grapevine Operations in Grapevine are not atomic –Add name; Put name on list Visible failure, name not available for 2 nd op Could stick with single server per session? Problem for sysadmins, not general users –Add user to distribution list; mail to list Problem for general users Invisible failure; mail not delivered to someone –Distributed Garbage Collection (gc) is a well- known, hard problem Removing unused distribution lists is related

CS294, YelickApplications, p35 Human Intervention Grapevine has two types of operators –Basic administrators –Experts In what ways is Porcupine easier to administer? –Automatic load balancing –Both do some dynamic resource discovery

CS294, YelickApplications, p36 Agenda Design principles Specifics of Grapevine Specifics of Porcupine Comparisons Other applications besides mail

CS294, YelickApplications, p37 Characteristics of Mail Scale: commercial services handle 10M messages per day Write-intensive, following don’t work: –Stateless transformation –Web caching Consistency requirements fairly weak –Compared to file systems or databases

CS294, YelickApplications, p38 Other Applications How would support for other applications differ? –Web servers –File servers –Mobile network services –Sensor network services Read-mostly, write-mostly, or both Disconnected operation (IMAP) Continuous vs. Discrete input

CS294, YelickApplications, p39 Harvest and Yield Yield: probability of completing a query Harvest: (application-specific) fidelity of the answer –Fraction of data represented? –Precision? –Semantic proximity? Harvest/yield questions: –When can we trade harvest for yield to improve availability? –How to measure harvest “threshold” below which response is not useful? Copyright Fox, 1999

CS294, YelickApplications, p40 Search Engine Stripe database randomly across all nodes, replicate high-priority data –Random striping: worst case == average case –Replication: high priority data unlikely to be lost –Harvest: fraction of nodes reporting Questions… –Why not just wait for all nodes to report back? –Should harvest be reported to end user? –What is the “useful” harvest threshold? –Is nondeterminism a problem? Trade harvest for yield/throughput Copyright Fox, 1999

CS294, YelickApplications, p41 General Questions What do both systems to achieve: –Parallelism (scalability) Partitioned data structures –Locality (performance) Replication, scheduling of related tasks/data –Reliability Replication, stable storage What are the trade-off?

CS294, YelickApplications, p42 Administrivia Read wireless (Baker) paper for 9/7 –Short discussion next Thursday 9/7 (4:30-5:00 only) Read Network Objects paper for Tuesday How to get the Mitzenmacher paper for next week –Read tornado codes as well, if interested