© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.

Slides:



Advertisements
Similar presentations
Replication for Availability & Durability with MySQL and Amazon RDS Grant McAlister.
Advertisements

Distributed Processing, Client/Server and Clusters
High throughput chain replication for read-mostly workloads
Silberschatz and Galvin  Operating System Concepts Module 16: Distributed-System Structures Network-Operating Systems Distributed-Operating.
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
FlareCo Ltd ALTER DATABASE AdventureWorks SET PARTNER FORCE_SERVICE_ALLOW_DATA_LOSS Slide 1.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
70-270, MCSE/MCSA Guide to Installing and Managing Microsoft Windows XP Professional and Windows Server 2003 Chapter Nine Managing File System Access.
The Google File System.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Failures in the System  Two major components in a Node Applications System.
National Manager Database Services
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
Disaster Recovery as a Cloud Service Chao Liu SUNY Buffalo Computer Science.
Implementing Multi-Site Clusters April Trần Văn Huệ Nhất Nghệ CPLS.
Module 12: Designing High Availability in Windows Server ® 2008.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Introduction à Couchbase Server 2.0 Tugdual Grall
Distributed File Systems
Continuous Access Overview Damian McNamara Consultant.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Distributed File Systems Overview  A file system is an abstract data type – an abstraction of a storage device.  A distributed file system is available.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Copyright 2008 Kenneth M. Chipps Ph.D. Controlling Flow Last Update
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
DNS DNS overview DNS operation DNS zones. DNS Overview Name to IP address lookup service based on Domain Names Some DNS servers hold name and address.
Peter Mattei HP Storage Consultant 16. May 2013
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Spring 2003CS 4611 Replication Outline Failure Models Mirroring Quorums.
Distributed Systems CS Consistency and Replication – Part IV Lecture 13, Oct 23, 2013 Mohammad Hammoud.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Cassandra Architecture.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
Cassandra The Fortune Teller
Jonathan Walpole Computer Science Portland State University
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Distributed Shared Memory
AlwaysOn Mirroring, Clustering
Alternative system models
MongoDB Distributed Write and Read
Web Caching? Web Caching:.
Google Filesystem Some slides taken from Alan Sussman.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
View Change Protocols and Reconfiguration
Distributed System Structures 16: Distributed Structures
GARRETT SINGLETARY.
By - Ricardo Sanchez, Ken Wolters and William Hibbard
Ewen Cheslack-Postava
CMPE 252A : Computer Networks
THE GOOGLE FILE SYSTEM.
Distributed Availability Groups
Setting up PostgreSQL for Production in AWS
Designing Database Solutions for SQL Server
Presentation transcript:

© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication Queue Write Response Before writing to disk Couchbase Server Node To other node. replicas within a cluster Data is persisted to disk by Couchbase Server asynchronously, based on rules in the system. Great performance, but if node fails all data in queues will be lost forever! No chance to recover such data. This is not acceptable for some of the videoscape apps APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication Queue Write Response After writing to disk Couchbase Server Node APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication Queue Write Response After writing to disk and one or more replica? Couchbase Server Node Trading performance for better reliability. On a per- operation basis. In Couchbase, this is supported, but there is a performance impact that need to be quantified! (assumed to be significant, until proven otherwise) Node failure, can still result in the loss of data in the replication queue! Meaning local writes to disk may not have replicas on other nodes yet. Q1. Is the above mode supported by Couchbase (i.e. respond to write only after writing to local disk and insuring that at least one replica is successfully copied on another node within a cluster ? Q2. If supported, can this be done on a per operation basis? Q3. Do you have any performance data on this and the previous case? Please see notes section below

© 2011 Cisco All rights reserved.Cisco Confidential Replication vs Persistance 2 Managed Cache Disk Queue Disk Replication Queue App Server Server 2 Replication Queue 2 2 Managed Cache Disk Queue Disk Replication Queue 2 2 Managed Cache Disk Queue Disk Replication Queue Server 1 Server 3 Replication allows us to block the write while a node persists that write to the memcached layer of 2 other nodes. This is typically a very quick operation and it means we can return control back the app server near immediately instead of waiting on a write to disk

© 2011 Cisco All rights reserved.Cisco Confidential 3 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication Queue Write Couchbase Server Node To other node. replicas within a cluster XDCR Queue NIC Replicas to other DCs When are the replicas to other DCs put in the XDCR queue? Based on the user manual, this is done after writing to local disk! This will obviously add major latencies. Q5. Is there another option to accelerate this as is the case of local replicas? my understanding is that when the “replication queue” is used for local replications, couchbase puts the data in the replication queue before writing to the disk. Is this correct? Why is this different for the “XDCR queue” case, i.e. write to disk first? Response After writing replica to remote DC? Q4. Is this mode supported by Couchbase (i.e. respond to the client only after the replica is successfully copied on a remote datacenter node/cluster? Please see notes section below

© 2011 Cisco All rights reserved.Cisco Confidential 4 Couchbase Server Node  Q4. Is this mode supported by Couchbase (i.e. respond to the client only after the replica is successfully copied on a remote datacenter node/cluster?  As a workaround from the application side – the write operation can be overloaded so that before it returns control to the app it does 2 gets. One against the current cluster and one against the remote cluster for the item that was written to determine when that write has persisted to the remote cluster.(data center write) – see pseduo-code in notes  We will support synchronous writes to a remote datacenter from the client side in the release of Couchbase Server – available Q  Lastly, XDCR is only necessary to span AWS Regions, a Couchbase cluster without XDCR configured can span AWS zones without issue.  Q5. Is there another option to accelerate this as is the case of local replicas? my understanding is that when the “replication queue” is used for local replications, Couchbase puts the data in the replication queue before writing to the disk. Is this correct? Why is this different for the “XDCR queue” case, i.e. write to disk first?  XDCR replication to a remote DC is built on a different technology from the in-memory intra-cluster replication. XDCR is done along with writing data to disk so that it is more efficient. Since the flusher that writes to disk, de-dups data, that is only write the last mutation for the document it is updating on disk, XDCR can benefit from this. This is particularly helpful for write heavy / update heavy workloads. The XDCR queue sends less data over the wire and hence is more efficient.  Lastly – we are also looking at a prioritized disk write queue as a roadmap item for This feature could be used to accelerate a writes persistence to disk for the purposes of reducing latencies for Indexes and XDCR. Q6. – per command

© 2011 Cisco All rights reserved.Cisco Confidential 5 Data Center Rack #1Rack #2 Rack #n Couch node1 Couch node5 Couch node7 Couch node2 Couch node3 Couch node4 Couch node6 Q6. Can couchbase support rack-aware replication (as in Cassandra)? If we can’t control where the replicas are placed, a rack failure could loose all replicas (i.e. docs become unavailable until the rack recovers)! Q7. How does couchbase deal with that today? We need at least one of the replicas to be on a different rack. Note that in actual deployments we can’t always assume that Couchbase nodes will be guaranteed to be placed in different racks. See the following link for AWS/Hadoop use case for example: See the following link for AWS/Hadoop use case for example: Write File-1 Replica-1 File-1 Replica-2 APP server Client library Please see notes section below

© 2011 Cisco All rights reserved.Cisco Confidential 6 Couchbase Server Sharding Approach

© 2011 Cisco All rights reserved.Cisco Confidential 7 Fail Over Node REPLICA ACTIVE Doc 5 Doc 2 Doc Doc 4Doc 1Doc SERVER 1 REPLICA ACTIVE Doc 4Doc 7Doc Doc 6Doc 3Doc SERVER 2 REPLICA ACTIVE Doc 1Doc 2Doc Doc 7Doc 9Doc SERVER 3SERVE R 4 SERVE R 5 REPLICA ACTIVE REPLICA ACTIVE Doc 9 Doc 8 DocDoc 6 Doc Doc 5Doc Doc 2 Doc 8Doc App servers accessing docs Requests to Server 3 fail Cluster detects server failed Promotes replicas of docs to active Updates cluster map Requests for docs now go to appropriate server Typically rebalance would follow Doc Doc 1Doc 3 APP SERVER 1 COUCHBASE Client Library CLUSTER MAP COUCHBASE Client Library CLUSTER MAP APP SERVER 2 User Configured Replica Count = 1 COUCHBASE SERVER CLUSTER

© 2011 Cisco All rights reserved.Cisco Confidential 8 Data Center Rack #1 Rack #2 Rack #3 Vb1 Active Vb1 Replica1 Vb1000 Replica 2 Vb256 Replica 2 Vb1000 Active Vb277 Active Vb1 Replica 2 APP server Client library Vb1000 Replica1 Vb900 Replica1 Rack #4 Vb1 Replica 3 Vb500 Replica3 Vb256 Active Couchbase supports Rack aware replication through the number of replica copies and limiting the number of Couchbase nodes in a given rack.

© 2011 Cisco All rights reserved.Cisco Confidential 9 AWS EAST Zone A Zone B Zone C Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Q8. If all the nodes of a single couchbase cluster (nodes 1-9) are distributed among 3X availability zones (AZ) as shown, can couchbase support AZ-aware replication? That is to insure that the replicas of a doc are distribute across different zones, so that a zone failure does not result in doc unavailability. Assume Inter-AZ latency ~1.5 ms

© 2011 Cisco All rights reserved.Cisco Confidential 10 AWS EAST Zone A Zone B Zone C Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Assume Inter-AZ latency ~1.5 ms Couchbase currently supports zone affinity through the number of replicas and limiting the number of Couchbase nodes in a given zone. Replica factor is applied at the bucket level and up to 3 replicas can be specified. Each replica is equal to 1 full copy of the data set that will be distributed across the available nodes in the cluster. With 3 replicas – the cluster contains 4 full copies of the data. 1 active + 3 replica By limiting the number of Couchbase nodes in a given zone to the replica count, losing a zone does not result in data loss as there is a full copy of data still in another zone. In the example above – in a worst case scenario. We have 3 replicas enabled. Active lives on node1, replica1 lives on node2, replica2 on node3 and replica3 on node4. If zone 1 goes down, those nodes can be automatically failed over which promotes replica3 on node4 to active Explicit Zone aware replication(affinity) is a roadmap item for 2013.

© 2011 Cisco All rights reserved.Cisco Confidential 11  We need performance tests showing scaling from 3 nodes to nodes in a single cluster (performance tests to show performance (throughput and latencies) in 5 or 10 nodes increments). During these tests, the following configurations/assumptions must be used: Nodes are physically distributed in 3 availability zones (e.g. AWS EAST zones). No data loss (of any acknowledged write) when a single node fails in any zone, or when an entire availability zone fail. Ok to loose un-acknowledged writes since clients can deal with that. To achieve this, we need: o Durable writes enabled (i.e. don’t ack client’s request to write until the write is physically done to disk on the local node and at least one more replica is written to disks of other nodes in different availability zones).  Even though the shared performance tests look great, unfortunately the test assumptions used (lack of reliable writes) are unrealistic for our videoscape deployments/use case scenarios.  We need performance test results that are close to our use case! Please see the following Netflix’s tests and write durability /multi zone assumptions which are very close to our use case (  Q9. Please advise if you are willing to conduct such tests!

© 2011 Cisco All rights reserved.Cisco Confidential 12  We need performance test results that are close to our use case! Please see the following Netflix’s tests and write durability /multi zone assumptions which are very close to our use case (  The Netflix test describes that writes can be durable across regions but does not specify if the latency and throughput they saw were from Quorum writes or Single Writes(Single writes are faster and map to a unacknowledged write). This is similar to our benchmark numbers as though we have the ability to do durable writes across regions and racks we do not specify it in the data. Can you confirm that the Netflix benchmark numbers for latency/throughput/cpu util were done wholly with Quorum writes?