Shen, Yang, Chu, Holliday, Kuschner, and Zhu

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

High throughput chain replication for read-mostly workloads
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Enterprise Job Scheduling for Clustered Environments Stratos Paulakis, Vassileios Tsetsos, and Stathes Hadjiefthymiades P ervasive C omputing R esearch.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Manageability, Availability and Performance in Porcupine: A Highly Scalable, Cluster-based Mail Service.
G Robert Grimm New York University (with some slides by Steve Gribble) Distributed Data Structures for Internet Services.
Case Study - GFS.
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy,
Word Wide Cache Distributed Caching for the Distributed Enterprise.
PMIT-6102 Advanced Database Systems
1 The Google File System Reporter: You-Wei Zhang.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
An Efficient Topology-Adaptive Membership Protocol for Large- Scale Cluster-Based Services Jingyu Zhou * §, Lingkun Chu*, Tao Yang* § * Ask Jeeves §University.
CS Storage Systems Lecture 14 Consistency and Availability Tradeoffs.
2/1/00 Porcupine: a highly scalable service Authors: Y. Saito, B. N. Bershad and H. M. Levy This presentation by: Pratik Mukhopadhyay CSE 291 Presentation.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Service Primitives for Internet Scale Applications Amr Awadallah, Armando Fox, Ben Ling Computer Systems Lab Stanford University.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Distributed Systems CS Consistency and Replication – Part I Lecture 10, September 30, 2013 Mohammad Hammoud.
)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Group Communication Theresa Nguyen ICS243f Spring 2001.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Network Computing Laboratory Load Balancing and Stability Issues in Algorithms for Service Composition Bhaskaran Raman & Randy H.Katz U.C Berkeley INFOCOM.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
EJB Enterprise Java Beans JAVA Enterprise Edition
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
Heitor Moraes, Marcos Vieira, Italo Cunha, Dorgival Guedes
The Case for a Session State Storage Layer
CSE-291 Cloud Computing, Fall 2016 Kesden
MongoDB Distributed Write and Read
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Maximum Availability Architecture Enterprise Technology Centre.
Introduction to NewSQL
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
Introduction to J2EE Architecture
Replication Middleware for Cloud Based Storage Service
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Be Fast, Cheap and in Control
Department of Computer Science University of California, Santa Barbara
Providing Secure Storage on the Internet
Integrated Resource Management for Cluster-based Internet Services
Cluster Load Balancing for Fine-grain Network Services
Decoupled Storage: “Free the Replicas!”
Distributed Systems CS
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
THE GOOGLE FILE SYSTEM.
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Shen, Yang, Chu, Holliday, Kuschner, and Zhu 11/27/2018 Neptune: Scalable Replication Management and Programming Support for Cluster-based Network Services Kai Shen, Tao Yang, Lingkun Chu, JoAnne L. Holliday, Douglas K. Kuschner, and Huican Zhu Department of Computer Science University of California, Santa Barbara http://www.cs.ucsb.edu/research/Neptune USITS 2001, San Francisco

Shen, Yang, Chu, Holliday, Kuschner, and Zhu 11/27/2018 Motivations Availability, incremental-scalability, and manageability - key requirements for building large-scale network services. Challenging for those with frequent persistent data updates. Existing solutions in managing persistent data: Pure data partitioning: no availability guarantee; bad at dealing with runtime hot-spots. Disk-sharing: inherently unscalable; single-point of failure. Replication provided by database vendors: tied to specific database systems; inflexible in consistency. 11/27/2018 USITS 2001, San Francisco USITS 2001, San Francisco

Shen, Yang, Chu, Holliday, Kuschner, and Zhu 11/27/2018 Neptune Project Goal Design a scalable clustering architecture for aggregating and replicating network services with persistent data. Provide a simple and flexible programming model to shield complexity of data replication, service discovery, load balancing, and failover management. Provide flexible replica consistency support to address availability and performance tradeoffs for different services. 11/27/2018 USITS 2001, San Francisco USITS 2001, San Francisco

Shen, Yang, Chu, Holliday, Kuschner, and Zhu 11/27/2018 Related Work TACC, MultiSpace: infrastructure support for cluster-based network services. DDS: distributed persistent data structure for network services. Porcupine: cluster-based email service (with commutative updates). Bayou: weak consistency for wide-area applications. BEA Tuxedo – platform middleware supporting transactional RPC. 11/27/2018 USITS 2001, San Francisco USITS 2001, San Francisco

Outline Motivations & Related Work System Architecture and Assumptions Replica Consistency and Failure Recovery System implementation and Service Deployments Experimental Studies 11/27/2018 USITS 2001, San Francisco

Partitionable Network Services Characteristics of network services: Information independence. Service data can be divided into independent categories (e.g. discussion group). User independence. Data accessed by different users tend to be independent (e.g. email service). Neptune is targeting partitionable network services: Service data can be divided into independent partitions. Each service access can be delivered independently on a single partition; or Each service access can be aggregated from sub-services each of which can be delivered independently on a single partition. 11/27/2018 USITS 2001, San Francisco

Conceptual Architecture for a Neptune Service Cluster 11/27/2018 USITS 2001, San Francisco

Neptune Components Neptune components on client and server-side: Neptune Server Module: starts, regulates, terminates registered service instances and maintains replica data consistency. Neptune Client Module: provides location-transparent accesses to application service clients. 11/27/2018 USITS 2001, San Francisco

Programming Interfaces Request/Response communications: Client-side API: (called by service clients) NeptuneCall (CltHandle, Service, Partition, SvcMethod, Request, Response); Service Interface: (abstract interface that application services implement) SvcMethod (SvcHandle, Partition, Request, Response); Stream-based communications: Neptune sets up a bi-directional stream between the service client and the service instance. 11/27/2018 USITS 2001, San Francisco

Assumptions All system modules follow fail-stop failure model. Network partitions do not occur inside the service cluster.  Neptune does allow persistent data survive all-node failures. Atomic execution is supported if each underlying service module ensures atomicity in stand-alone configuration. 11/27/2018 USITS 2001, San Francisco

Neptune Replica Consistency Model A service access is called a write if it changes the state of persistent data; and it is called a read otherwise. Level 1: Write-anywhere replication for commutative writes. Writes are accepted at any replica and propagated to peers. E.g. message board (append-only). Level 2: Primary-secondary replication for ordered writes. Writes are only accepted at primary node, then ordered and propagated to secondaries. Level 3: Primary-secondary replication with staleness control. Soft time-based staleness bound and progressive version delivery. Not strong consistency because writes completed independently at each replica. 11/27/2018 USITS 2001, San Francisco

Soft Time-based Staleness Bound Semantics: each read serviced at a replica at most x seconds stale compared to the primary. Important for services such as on-line auction. Implementation: Each replica periodically announces its data version; Neptune client module directs requests only to replicas with a fresh enough version. The bound is soft, depending on network latency, announcement frequency, and intermittent packet losses. 11/27/2018 USITS 2001, San Francisco

Progressive Version Delivery From each client’s point of view, Writes are always seen by subsequent reads. Versions delivered for reads are progressive. Important for services like on-line auction. Implementation: Each replica periodically announces its data version; Each service invocation returns a version number for a service client to keep as a session variable; Neptune client module directs a read to a replica with an announced version >= all the previously-returned version. 11/27/2018 USITS 2001, San Francisco

Failure Recovery A REDO log is maintained for each data partition at each replica, which has two portions: Committed portion: completed writes; Uncommitted portion: writes received but not yet completed. Three-phrase recovery for primary-secondary replication (level-2 & level-3): Synchronize with underlying service module; Recover missed writes from the current primary; Resume normal operations. Only phase one is necessary for write-anywhere replication (level-1). 11/27/2018 USITS 2001, San Francisco

Outline Motivations & Related Work System Architecture and Assumptions Replica Consistency and Failure Recovery System Implementation and Service Deployments Experimental Studies 11/27/2018 USITS 2001, San Francisco

Prototype System Implementation on a Linux cluster Service availability and node runtime workload are announced through IP Multicast. multicast once a second; kept as soft state, expires in five seconds. Service instances can run either as processes or threads in Neptune server runtime environment. Each Neptune server module maintains a process/thread pool and a waiting queue. 11/27/2018 USITS 2001, San Francisco

Experience with Service Deployments On-line discussion group View message headers, view message, and add message. All three consistency levels can be applied. Auction Level 3 consistency with staleness control is used. Persistent cache Store key-value pairs (e.g. caching query result). Level 2 consistency (Primary-secondary) is used.  Fast prototyping and implementation without worrying about replication/clustering complexities. 11/27/2018 USITS 2001, San Francisco

Experimental Settings for Performance Evaluation Synthetic Workloads: 10% and 50% write percentages; Balanced workload to assess best-case scalability; Skewed workload to evaluate the impact of runtime hotspots. Metric: maximum throughput when at least 98% client requests are completed in 2 seconds. Evaluation Environment: Linux cluster with dual 400MHz Pentium IIs, 512MB/1GB memory, dual 100Mb/s Ethernet interfaces. Lucent P550 Ethernet switch with 22Gb/s backplane bandwidth. 11/27/2018 USITS 2001, San Francisco

Scalability under Balanced Workload NoRep is about twice as fast as Rep=4 under 50% writes. Insignificant performance difference across three consistency levels under balanced workload. 11/27/2018 USITS 2001, San Francisco

Skewed Workload Each skewed workload consists of requests chosen from a set of partitions according to Zipf distribution. Define the workload imbalance factor as the proportion of the requests directed to the most popular partition. For a 16-partition service, an imbalance factor of 1/16 indicates a completely balanced workload. An imbalance factor of 1 means all requests are directed to one partition. 11/27/2018 USITS 2001, San Francisco

Impact of Workload Imbalance on Replication Degrees 10% writes; level-2 consistency; 8 nodes. Replication provides dynamic load-sharing for runtime hot-spots (Rep=4 could be up to 3 times as fast as NoRep). 11/27/2018 USITS 2001, San Francisco

Impact of Workload Imbalance on Consistency Levels 10% writes; Rep degree 4; 8 nodes. Modest performance difference: Up to 12% between level-2 and level-3; Up to 9% between level-1 and level-2. 11/27/2018 USITS 2001, San Francisco

Failure Recovery for Primary-secondary Replication Graceful performance degradation. Performance drop after the three-node failure. Errors and timeouts trailing each recovery (write recovery and sync overhead). 11/27/2018 USITS 2001, San Francisco

Shen, Yang, Chu, Holliday, Kuschner, and Zhu 11/27/2018 Conclusions Contributions: Scalable replication for cluster-based network services; multi-level consistency with staleness control. A simple programming model to shield replication and clustering complexities from application service authors. Evaluation results: Replication improves performance for runtime hotspots. Performance of level 3 consistency is competitive. Level 2/3 carries extra overhead during failure recovery. http://www.cs.ucsb.edu/research/Neptune 11/27/2018 USITS 2001, San Francisco USITS 2001, San Francisco