© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication Queue Write Response Before writing to disk Couchbase Server Node To other node. replicas within a cluster Data is persisted to disk by Couchbase Server asynchronously, based on rules in the system. Great performance, but if node fails all data in queues will be lost forever! No chance to recover such data. This is not acceptable for some of the videoscape apps APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication Queue Write Response After writing to disk Couchbase Server Node APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication Queue Write Response After writing to disk and one or more replica? Couchbase Server Node Trading performance for better reliability. On a per- operation basis. In Couchbase, this is supported, but there is a performance impact that need to be quantified! (assumed to be significant, until proven otherwise) Node failure, can still result in the loss of data in the replication queue! Meaning local writes to disk may not have replicas on other nodes yet. Q1. Is the above mode supported by Couchbase (i.e. respond to write only after writing to local disk and insuring that at least one replica is successfully copied on another node within a cluster ? Q2. If supported, can this be done on a per operation basis? Q3. Do you have any performance data on this and the previous case? Please see notes section below
© 2011 Cisco All rights reserved.Cisco Confidential Replication vs Persistance 2 Managed Cache Disk Queue Disk Replication Queue App Server Server 2 Replication Queue 2 2 Managed Cache Disk Queue Disk Replication Queue 2 2 Managed Cache Disk Queue Disk Replication Queue Server 1 Server 3 Replication allows us to block the write while a node persists that write to the memcached layer of 2 other nodes. This is typically a very quick operation and it means we can return control back the app server near immediately instead of waiting on a write to disk
© 2011 Cisco All rights reserved.Cisco Confidential 3 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication Queue Write Couchbase Server Node To other node. replicas within a cluster XDCR Queue NIC Replicas to other DCs When are the replicas to other DCs put in the XDCR queue? Based on the user manual, this is done after writing to local disk! This will obviously add major latencies. Q5. Is there another option to accelerate this as is the case of local replicas? my understanding is that when the “replication queue” is used for local replications, couchbase puts the data in the replication queue before writing to the disk. Is this correct? Why is this different for the “XDCR queue” case, i.e. write to disk first? Response After writing replica to remote DC? Q4. Is this mode supported by Couchbase (i.e. respond to the client only after the replica is successfully copied on a remote datacenter node/cluster? Please see notes section below
© 2011 Cisco All rights reserved.Cisco Confidential 4 Couchbase Server Node Q4. Is this mode supported by Couchbase (i.e. respond to the client only after the replica is successfully copied on a remote datacenter node/cluster? As a workaround from the application side – the write operation can be overloaded so that before it returns control to the app it does 2 gets. One against the current cluster and one against the remote cluster for the item that was written to determine when that write has persisted to the remote cluster.(data center write) – see pseduo-code in notes We will support synchronous writes to a remote datacenter from the client side in the release of Couchbase Server – available Q Lastly, XDCR is only necessary to span AWS Regions, a Couchbase cluster without XDCR configured can span AWS zones without issue. Q5. Is there another option to accelerate this as is the case of local replicas? my understanding is that when the “replication queue” is used for local replications, Couchbase puts the data in the replication queue before writing to the disk. Is this correct? Why is this different for the “XDCR queue” case, i.e. write to disk first? XDCR replication to a remote DC is built on a different technology from the in-memory intra-cluster replication. XDCR is done along with writing data to disk so that it is more efficient. Since the flusher that writes to disk, de-dups data, that is only write the last mutation for the document it is updating on disk, XDCR can benefit from this. This is particularly helpful for write heavy / update heavy workloads. The XDCR queue sends less data over the wire and hence is more efficient. Lastly – we are also looking at a prioritized disk write queue as a roadmap item for This feature could be used to accelerate a writes persistence to disk for the purposes of reducing latencies for Indexes and XDCR. Q6. – per command
© 2011 Cisco All rights reserved.Cisco Confidential 5 Data Center Rack #1Rack #2 Rack #n Couch node1 Couch node5 Couch node7 Couch node2 Couch node3 Couch node4 Couch node6 Q6. Can couchbase support rack-aware replication (as in Cassandra)? If we can’t control where the replicas are placed, a rack failure could loose all replicas (i.e. docs become unavailable until the rack recovers)! Q7. How does couchbase deal with that today? We need at least one of the replicas to be on a different rack. Note that in actual deployments we can’t always assume that Couchbase nodes will be guaranteed to be placed in different racks. See the following link for AWS/Hadoop use case for example: See the following link for AWS/Hadoop use case for example: Write File-1 Replica-1 File-1 Replica-2 APP server Client library Please see notes section below
© 2011 Cisco All rights reserved.Cisco Confidential 6 Couchbase Server Sharding Approach
© 2011 Cisco All rights reserved.Cisco Confidential 7 Fail Over Node REPLICA ACTIVE Doc 5 Doc 2 Doc Doc 4Doc 1Doc SERVER 1 REPLICA ACTIVE Doc 4Doc 7Doc Doc 6Doc 3Doc SERVER 2 REPLICA ACTIVE Doc 1Doc 2Doc Doc 7Doc 9Doc SERVER 3SERVE R 4 SERVE R 5 REPLICA ACTIVE REPLICA ACTIVE Doc 9 Doc 8 DocDoc 6 Doc Doc 5Doc Doc 2 Doc 8Doc App servers accessing docs Requests to Server 3 fail Cluster detects server failed Promotes replicas of docs to active Updates cluster map Requests for docs now go to appropriate server Typically rebalance would follow Doc Doc 1Doc 3 APP SERVER 1 COUCHBASE Client Library CLUSTER MAP COUCHBASE Client Library CLUSTER MAP APP SERVER 2 User Configured Replica Count = 1 COUCHBASE SERVER CLUSTER
© 2011 Cisco All rights reserved.Cisco Confidential 8 Data Center Rack #1 Rack #2 Rack #3 Vb1 Active Vb1 Replica1 Vb1000 Replica 2 Vb256 Replica 2 Vb1000 Active Vb277 Active Vb1 Replica 2 APP server Client library Vb1000 Replica1 Vb900 Replica1 Rack #4 Vb1 Replica 3 Vb500 Replica3 Vb256 Active Couchbase supports Rack aware replication through the number of replica copies and limiting the number of Couchbase nodes in a given rack.
© 2011 Cisco All rights reserved.Cisco Confidential 9 AWS EAST Zone A Zone B Zone C Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Q8. If all the nodes of a single couchbase cluster (nodes 1-9) are distributed among 3X availability zones (AZ) as shown, can couchbase support AZ-aware replication? That is to insure that the replicas of a doc are distribute across different zones, so that a zone failure does not result in doc unavailability. Assume Inter-AZ latency ~1.5 ms
© 2011 Cisco All rights reserved.Cisco Confidential 10 AWS EAST Zone A Zone B Zone C Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Assume Inter-AZ latency ~1.5 ms Couchbase currently supports zone affinity through the number of replicas and limiting the number of Couchbase nodes in a given zone. Replica factor is applied at the bucket level and up to 3 replicas can be specified. Each replica is equal to 1 full copy of the data set that will be distributed across the available nodes in the cluster. With 3 replicas – the cluster contains 4 full copies of the data. 1 active + 3 replica By limiting the number of Couchbase nodes in a given zone to the replica count, losing a zone does not result in data loss as there is a full copy of data still in another zone. In the example above – in a worst case scenario. We have 3 replicas enabled. Active lives on node1, replica1 lives on node2, replica2 on node3 and replica3 on node4. If zone 1 goes down, those nodes can be automatically failed over which promotes replica3 on node4 to active Explicit Zone aware replication(affinity) is a roadmap item for 2013.
© 2011 Cisco All rights reserved.Cisco Confidential 11 We need performance tests showing scaling from 3 nodes to nodes in a single cluster (performance tests to show performance (throughput and latencies) in 5 or 10 nodes increments). During these tests, the following configurations/assumptions must be used: Nodes are physically distributed in 3 availability zones (e.g. AWS EAST zones). No data loss (of any acknowledged write) when a single node fails in any zone, or when an entire availability zone fail. Ok to loose un-acknowledged writes since clients can deal with that. To achieve this, we need: o Durable writes enabled (i.e. don’t ack client’s request to write until the write is physically done to disk on the local node and at least one more replica is written to disks of other nodes in different availability zones). Even though the shared performance tests look great, unfortunately the test assumptions used (lack of reliable writes) are unrealistic for our videoscape deployments/use case scenarios. We need performance test results that are close to our use case! Please see the following Netflix’s tests and write durability /multi zone assumptions which are very close to our use case ( Q9. Please advise if you are willing to conduct such tests!
© 2011 Cisco All rights reserved.Cisco Confidential 12 We need performance test results that are close to our use case! Please see the following Netflix’s tests and write durability /multi zone assumptions which are very close to our use case ( The Netflix test describes that writes can be durable across regions but does not specify if the latency and throughput they saw were from Quorum writes or Single Writes(Single writes are faster and map to a unacknowledged write). This is similar to our benchmark numbers as though we have the ability to do durable writes across regions and racks we do not specify it in the data. Can you confirm that the Netflix benchmark numbers for latency/throughput/cpu util were done wholly with Quorum writes?