Manageability, Availability and Performance in Porcupine: A Highly Scalable, Cluster-based Mail Service.

Slides:



Advertisements
Similar presentations
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Advertisements

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Distributed Systems 1 Topics  What is a Distributed System?  Why Distributed Systems?  Examples of Distributed Systems  Distributed System Requirements.
Distributed components
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Technical Architectures
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
City University London
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
G Robert Grimm New York University Porcupine.
1 Porcupine: A Highly Available Cluster-based Mail Service Yasushi Saito Brian Bershad Hank Levy University of Washington Department of Computer Science.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Chapter 2: Application layer  2.1 Web and HTTP  2.2 FTP 2-1 Lecture 5 Application Layer.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Lecture-10 Distributed Database System A distributed database system consists of loosely.
Lesson 1: Configuring Network Load Balancing
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Clusters Robert Grimm New York University Or: How to replace Big Iron with PCs.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Porcupine: A Highly Available Cluster- based Mail Service Y. Saito, B. Bershad, H. Levy U. Washington.
Distributed Databases
Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Managing DHCP. 2 DHCP Overview Is a protocol that allows client computers to automatically receive an IP address and TCP/IP settings from a Server Reduces.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
1 The Google File System Reporter: You-Wei Zhang.
Client/Server Databases and the Oracle 10g Relational Database
IT 424 Networks2 IT 424 Networks2 Ack.: Slides are adapted from the slides of the book: “Computer Networking” – J. Kurose, K. Ross Chapter 2: Application.
Module 12: Designing High Availability in Windows Server ® 2008.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
CS Storage Systems Lecture 14 Consistency and Availability Tradeoffs.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
Manageability, Availability and Performance in Porcupine: A Highly Scalable, Cluster-based Mail Service Yasushi Saito, Brian N Bershad and Henry M.Levy.
Recovery-Oriented Computing User Study Training Materials October 2003.
CH2 System models.
04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.
2/1/00 Porcupine: a highly scalable service Authors: Y. Saito, B. N. Bershad and H. M. Levy This presentation by: Pratik Mukhopadhyay CSE 291 Presentation.
By Lecturer / Aisha Dawood 1.  You can control the number of dispatcher processes in the instance. Unlike the number of shared servers, the number of.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Data Communications and Networks Chapter 5 – Network Services DNS, DHCP, FTP and SMTP ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Networking Material taken mainly from HowStuffWorks.com.
The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.
CHAPTER 7 CLUSTERING SERVERS. CLUSTERING TYPES There are 2 types of clustering ; Server clusters Network Load Balancing (NLB) The difference between the.
17 Establishing Dial-up Connection to the Internet Using Windows 9x 1.Install and configure the modem 2.Configure Dial-Up Adapter 3.Configure Dial-Up Networking.
Introduction to Active Directory
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
MCSE Guide to Microsoft Exchange Server 2003 Administration Chapter One Introduction to Exchange Server 2003.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
VIRTUAL SERVERS Chapter 7. 2 OVERVIEW Exchange Server 2003 virtual servers Virtual servers in a clustering environment Creating additional virtual servers.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Large-scale file systems and Map-Reduce
Google Filesystem Some slides taken from Alan Sussman.
Chapter 16: Distributed System Structures
An Introduction to Computer Networking
Internet Protocols IP: Internet Protocol
Overview Multimedia: The Role of WINS in the Network Infrastructure
Database System Architectures
Presentation transcript:

Manageability, Availability and Performance in Porcupine: A Highly Scalable, Cluster-based Mail Service

What is the porcupine It’s a scalable and highly available Internet mail service that supports the SMTP protocol for sending and receiving messages across the Internet. Users retrieve their messages using any mail user agent that supports either the POP or IMAP retrieval protocols It uses a cluster of 30 commodity PCs connected by 1Gb/s Ethernet hubs. It runs on Linux and uses the ext2 file system for storage. It consists of fourteen major components written in C++. The total system size is about forty-one thousand lines of code, yielding a 1MB executable.

System requirements Manageability requirements: the system must self- configure with respect to load and data distribution and self-heal with respect to failure and recovery. Availability requirements: Despite component failures, the system should deliver good service to all of its users at all times. Performance requirements: we target a system that scales to hundreds of machines, which is sufficient to service a few billion mail messages per day with today’s commodity PC hardware and system area networks.

How to satisfy the system requirements

System architecture overview A key aspect of Porcupine is its functional homogeneity: any node can perform any function. Replicated state offers guarantees about the data that the service may be managing. Hard state: information that cannot be lost and therefore must be maintained in stable storage, such as: an message and a user’s password. Soft state: information that, if lost, can be reconstructed from existing hard state. For example, the list of nodes containing mail for a particular user. BASE: basically available, soft state, eventual consistency.

Key Data Structures Mailbox fragment: The collection of mail messages stored for a given user at any given node Mail map: This list describes the nodes containing mailbox fragments for a given user. User profile database:describes Porcupine’s client population. User profile soft state:separates the storage and the management of user profile, each Porcupine node uniquely stores a soft-state copy of a subset of the profile database entries. User map: a table that maps the hash value of each user name to a node currently responsible for managing that user’s profile soft state and mail map. Cluster membership list: Each node maintains its own view.

Data Structure Managers(on each node) user manager: manages soft state including user profile soft state and mail maps. mailbox manager & user database manager: maintain persistent storage and enable remote access to mailbox fragments and user profiles. replication manager: ensures the consistency of replicated objects stored in that node’s local persistent storage. membership manager: maintains that node’s view of the overall cluster state. load balancer RPC manager: supports remote inter-module communication. delivery proxy: handle incoming SMTP requests. retrieval proxies : handle POP and IMAP requests.

A Mail Transaction in Progress

Advantages and Tradeoffs By decoupling the delivery and retrieval agents from the storage services and user manager in this way, the system can: balance mail delivery tasks dynamically any node can store mail for any user, and no single node is permanently responsible for a user’s mail or soft profile information. extremely fault tolerant Dilemma: load balance and fault tolerant  distributed across a large number of nodes & affinity and locality  limit the spread of a user’s mail

SELF MANAGEMENT 1. Membership services uses a variant of the Three Round Membership Protocol (TRM) [Christian and Schmuck 1995] to detect membership changes. Once membership has been established, the coordinator periodically broadcasts probe packets over the network. How can one node may discover the failure of another:  Timeout mechanism  nodes within a membership set periodically “ping” their next highest neighbor in IP address order, with the largest IP address pinging the smallest.

An example of TRM

SELF MANAGEMENT 2. Use map the user map is replicated across all nodes and is recomputed during each membership change as a side effect of the TRM protocol. Each entry in the user map is associated with an epoch ID. The former example.

SELF MANAGEMENT 3. Soft State Reconstruction the user profile soft state and the mail map for each user. Reconstruction is a two-step process, completely distributed, but unsynchronized. first step occurs immediately after the third round of membership reconfiguration. Here, each node compares the previous and current user maps to identify any buckets having new assignments. second step: every node identifying a new bucket assignment sends the new manager of the bucket any soft state corresponding to the hard state for that bucket (if have any mailbox fragments, user profile) Cost: constant per node in the long term, regardless of cluster size.

SELF MANAGEMENT 4. Effects of Configuration Changes on Mail Sessions When a node fails, all SMTP, POP, and IMAP sessions hosted on the node abort. For SMTP sessions, the remote MTAs retry delivery later. (delay and possible duplicate message) POP or IMAP session: report an error, but continues retrieve messages stored on other nodes. 5. Node Addition Installs the Porcupine software on the node. noticed by the membership protocol and added to the cluster. Other nodes upload soft state to new nodes. Service rebalancer: background, run manually.

REPLICATION AND AVAILABILITY replicates the user database and mailbox fragments to ensure their availability. 1. Replication Properties 1) Update anywhere: initiated at any replica. 2) Eventual consistency: During periods of failure, replicas may become inconsistent for short periods of time, but conflicts are eventually resolved. 3) Total update 4) Lock free. 5) Ordering by loosely synchronized clocks.

REPLICATION AND AVAILABILITY 2. Replication Manager A replication manager running on each host exchanges messages among nodes to ensure replication consistency. Two interface:  one for the creation and deletion of objects, which is used by the higher level delivery and retrieval agents.  another for interfacing to the specific managers, which are responsible for maintaining on-disk data structures. user’s mail map reflects all the replicas, For example, if Alice has two fragments, one replicated on A and B, and another replicated on nodes B and C, the mail map for Alice records {{A,B}, {B,C}}.

REPLICATION AND AVAILABILITY 2. Create and update objects To create a new replicated object (when delivery), an agent generates an object ID and the set of nodes on which the object is to be replicated. An object ID is simply an opaque, unique string. Given an object ID and an intended replica set, a delivery or retrieval agent can initiate an update request to the object by sending an update message to any replica manager in the set. Replication manager maintains a persistent update log. Log entry: <timestamp, objectID, target-nodes, remaining-nodes>

The coordinating replication manager push updates to all the nodes found in the remaining- nodes field of an entry. For multiple pending updates: the newest timestamp. the peer updates the log, the replica, and sends ACks The participants retire the completed log entrys If the coordinator fails, the agent will select another coordinator. Problem: Deliver a message to the user more than once. Updates remain at most a week. If a node is restored after that time, it must reenter the Porcupine cluster as a “new” node.

DYNAMIC LOAD BALANCING delivery and retrieval proxies make load-balancing How to collect load information: (1) as a side-effect of RPC operations (2) through a virtual ring The load information: a boolean, which indicates whether or not the disk is full, and an integer, which is the number of pending remote procedure calls. Spread-limiting load balancer Dynamic feature: a given user’s mail will be found on relatively few nodes, but those nodes can change entirely each time the user retrieves and deletes mail from the server.

Platform and Workload System configuration The mailbox fragment files: if Ann’s hash value is 9, then her fragment files are spool/9/ann and spool/9/ann.id x (index file). Most of a node’s memory is consumed by the soft user profile state. A synthetic workload:  mean message size of 4.7KB, with a fairly fat tail up to about 1MB.  Mail delivery (SMTP) accounts for about 90% of the trans-actions, with mail retrieval (POP) accounting for about 10%.

Platform and Workload System configuration The mailbox fragment files: if Ann’s hash value is 9, then her fragment files are spool/9/ann and spool/9/ann.id x (index file). Most of a node’s memory is consumed by the soft user profile state. A synthetic workload:  mean message size of 4.7KB, with a fairly fat tail up to about 1MB.  Mail delivery (SMTP) accounts for about 90% of the trans-actions, with mail retrieval (POP) accounting for about 10%.