Best Practices for Alfresco Replication, Backup and Disaster Recovery

Best Practices for Alfresco Replication, Backup and Disaster Recovery
Richard McKnight Principal Consultant Brian Long Principal Consultant

Agenda Intro and a bit of history BC/DR, Replication Primer
Our BC/DR Choices About our Implementation Demo At the end of this presentation you should have enough information to make an informed decision about BC/DR and Global Replication

A Bit of History Sub Caption

It’s Been a Long Time Coming!
First proposed in 2009 for Replication Good feedback but seen as being a lot of work. Socialized with Customers from 2011 through 2013 The general push back was rooted in the reluctance to pay for this as a bespoke engagement. Socialized Internally circa 2012 There was not enough demand to justify prioritizing this over other priorities. First Prototype Commissioned in 2013 We verified that Transfer Services was not designed with this use case in mind. First proposed in summer of 2009 Need to figure out which customer we discussed this with first

So What Changed? A wave of deals requiring global replication and BC/DR A couple of POC projects willing to try Active/Active replication. A couple of projects willing to use code built leveraging the POC projects. A framework to allow consulting to build and price reusable IP. Projects were willing to think outside of the box. Consulting team even willing to take a risk – need to re-word this a bit.

Why is Global Replication so Hard?
Global Replication ≠ WAN level HA Clustering You cannot guarantee that the system will run without failures all the time During outages trade-offs must be made between consistency and availability. Detecting and recovering from outages add complexity to global implementations Under normal operations, latency between data centers must be considered Will flesh this out a bit from previous presentations Mention the CAP theorem

BC/DR and Replication Primer
This section goes into some of the basics of Business Continuity, Disaster Recovery and Replication.

Disasters MUST Be Planned For
Outages Happen Interruptions in operations can be costly. Disaster Recovery costs money too! What BC/DR Method you chose is a balancing act! Power outages due to weather conditions human error on the grid ISP Service interruptions Component Failures Errors in Operations Sabotage Fire Earthquake Scheduled Maintenance Loss of Data Center due to Terrorism Loss of Data Center due to Flooding.

BC/DR Requirements Recovery Point Objective The time period for which missing data could be tolerated Recovery Time Objective The length of time that the organization could tolerate a service interruption. Examples Public Facing Transactions May have a RPO of 0 secs – if it is a financial or legal transaction – when the customer gets the confirmation they need to be confident that the transaction is safe Critical systems that must be up may have short RTO. It might be OK for a system to be unavailable for a period of time – as long as the users know that it is down and why Internal systems may have a more relaxed RPO as long as the work can be re-done. In some instances near Zero RPO may be achieved by using multiple systems – remote standby + a redundant cache

Systems Have Become Global
Companies want to collaborate across the globe. Global systems must operate properly even in the face of outages. The mechanisms that support global systems can also support BC/DR Outages can be at the network or system failures

Backing Up Alfresco An Alfresco repository consists of content, metadata and a search index. The various pieces of the repository must be backed up in a specific order. Some BC/DR strategies are simply a remote backup.

BC/DR Approaches Warm Standby The infrastructure is up and running with periodic backups. Alfresco not running. The backup site is a replica of the main site. Hot Standby The far system is running in read only mode. Replication happens at the Application level. Active/Active Replication at the application level. All sites in read/write mode. We will use these definitions to distinguish between the methods discussed. Hot standby and active/active use the same replication method. Need to check on these terms

Compare and Contrast Identical Instances Yes No Replication Method
Characteristic Warm Hot Identical Instances Yes No Replication Method Infrastructure components Application level Area of Complexity Synchronizing the snapshots Application code to support replication RTO Considerations DR site must be started DR site already running RPO Considerations Supports only fully replicated transactions Supports partially replicated transactions Identical Instances The Warm Standby is a replica with the same NodeRefs The Hot Standby is a separate instance with Different Node Refs – a distributed object Id is added Replication Method The Replication is handled by disk mirroring and tools like golden gate for DB mirroring Replication happens as part of the application code via an add on module Area of Complexity There is no easy way to synchronize the snapshots, and figuring out the earliest incomplete transaction would be somewhat difficult. Any application code that deals with distributed objects would not be able to use NodeRef RTO Considerations The warm standby would need to roll back to before the earliest incomplete transaction (completion would mean no missing files) The hot standby would just need to switch to R/W mode – in the case of active/active – some notion of a DC acting on behalf of an offline datacenter would need to be activated RPO Considerations Warm standby only supports full transactions – an upcoming illustration will explore the drawbacks with this Since replication is done at the application level and can do partial replication. The standby side could potentially be aware of all committed transactions – stub mode --

The Problem with Warm Standby
The biggest challenge to supporting really tight RPO criteria with warm standby is getting rid of integrity errors. Currently the only way to remove all integrity errors from a continuous back is to roll back the DB transactions until no content integrity errors exist. Pulled from --

Warm Standby at its Worst
# Transaction DB File 1 Add 10 files Completely Transferred 10 files Transferred 2 Add 100 more Files 99 Files transferred 3 Update 20 Objects All updates complete 4 Delete 20 Files N/A 5 Add 10 more files Partially Transferred 7 Files transferred In the current warm standby we would only be able to include transaction #1. Transaction #5 would always be thrown out because the database would not commit a partial transaction. Because transaction #2 would be incomplete (due to missing files), it and all subsequent transactions would need to be discarded. Pulled from --

What if…. We only had to throw out the bad object(s)?
Warm Standby #FAIL Using current methods, updates to 140 objects are sacrificed because of one missing file. In the absence of any coordination between the database and file system replication, this scenario is not far fetched The inability to synchronize the replication of content and metadata makes warm standby unattractive. What if…. We only had to throw out the bad object(s)? Add a bucket of apples with one bad apple….

Warm Standby Alternative
The following has been discussed as a way to handle a less than perfect backup (which is exactly what could happen after an incident). Accepting objects with “hanging content URLs” Adding a “Lost Content” aspect to identify the object as being “broken” and to record the original content URL. Updating the content URL to point to a “missing content” file. This sort of recovery process would allow the 139 objects that were updated to be included in the recovered repository. This would support very aggressive RTO and RPO criteria, but it has not been developed Add a stream of apples with on apple with a worm in it The wording on this slide needs a lot of re-working Additional points The SOLR index back up would have to lag other streams An alternate would be to allow SOLR to index fully processed transactions This could be accomplished if we had a special mode to run the repository in. The broken object might be easier to fix b/c all of the metadata would be available. This could be done by running a DR site in perpetual recovery mode.

Our BC/DR Choices All choices are trade offs

And the Nominees are Warm Standby Usually the first choice, but lack of synchronization between the database and file replication makes it unattractive Hot Standby and Active/Active Attractive because of the ability to support BC/DR and Global repositories. Early clients wanted this but as a supported part of the product. Modified Warm Standby This could potentially provide the best RTO/RPO for BC/DR only. Risky because it could potentially require changes to core Alfresco code. No one wanted to foot the bill for this. We finally had some customers willing to pay for a portion of it.

And the Winner is Support for both Global Repositories and BC/DR
Hot Standby – Active/Active Support for both Global Repositories and BC/DR Progressive replication to allow for soft failures. Met requirements for multiple customers and prospects with business critical use cases The soft failures allow for manual recovery – important in a BC/DR incident There was a critical mass of customers wanting this capability and willing to foot part of the bill for it.

The Runner Up & Supporting Cast
Modified Warm Standby Could work in concert with Active/Active Warm standby could provide preferred BC/DR Active/Active could be BC/DR if warm standby is unavailable Active/Active paired with Modified Warm Standby is a killer combination The advantage of having the warm standby is that the active running as a backup is a bit more complicated The modified warm standby may support a much tighter RPO than active/active The content transfer could be shared between the modified warm standby and active/active replication.

About our Implementation
This sections just has slides built from the Infosys Engagement report

Replication Overview The advantage of having the warm standby is that the active running as a backup is a bit more complicated The modified warm standby may support a much tighter RPO than active/active The content transfer could be shared between the modified warm standby and active/active replication.

Queues Node Stub Queue – Nodes on this queue are pending replication of their stubs. Node Metadata Queue – Nodes on this queue are pending replication of their metadata. Node Content Queue – Nodes on this queue are pending replication of their content. Node Parent Association Queue – Nodes on this queue are pending replication of their parent associations. Node Delete Queue – Nodes on this queue are pending deletion.

Source Repository Aspects
Replicate Stub – This Node is Pending Replication of its Stub Replicate Parent Assoc – This Node is Pending Replication of its Parent Associations Replicate Child Assoc – This Node is Pending Replication of its Child Associations Replicate Metadata – This Node is Pending Replication of its Metadata Replicate Content – This Node is Pending Replication of its Content

Target Repository Aspects
Replicated Stub – This partially replicated node originated elsewhere Replicated Metadata – This partially replicated node originated elsewhere and has valid metadata Replicated Content – This partially replicated node originated elsewhere and has valid content Once the node (which originated elsewhere) is completely replicated, the aspects listed above are removed.

Transport Layer The transport layer sends requests from the source to the target and waits for responses. The requests are for certain actions. The request header contains the action and a count of the sections of data that will be associated with the action. Each section has a length and a block of bytes. The structure of the requests and response are shown in the slides that follow.

Request Request Header/Multi Request Header Sections UUID Action
Count – number of sections (for Multiple Request Header) Sections Size (Long/8 Bytes) ByteBuffer – Certain MIME Types are compressed (configurable)

Response Response Header UUID Status Message Throwable

Actions Node Metadata Invalid – The source node’s metadata has been updated. Node Content Invalid – The source node’s content has been updated. Node Stub – A new node has been added at the source. Node Delete – A node at the source has been deleted. Node Metadata – The section contains metadata for the replica. Node Content – The section(s) contains content for the replica. Assoc Parent Child – The section has a parent child association. Node Pull – This is a request to pull a node from the source The Node Pull is the only request made from the target to the source. The Node Content action is the only action that has multiple sections. This is because a node can have multiple content streams and each content stream would be contained in its own section.

Demo Sub Caption

White: Section Head

White: Title and Content
Level 1 Level 2 Level 3 DO NOT USE

White: Two Column Level 1 Level 2 Level 3 DO NOT USE

White: Comparison Heading Left Heading Right Level 1 Level 2 Level 3
DO NOT USE

White: Title

White: Title (no brand)

Sample Code { // extract avm store id and path
var fullpath = url.extension.split("/"); if (fullpath.length == 0) status.code = 400; status.message = "Store id has not been provided."; status.redirect = true; break script; } var storeid = fullpath[0]; var path = (fullpath.length == 1 ? "/" : "/" + fullpath.slice(1).join("/"));

White: Blank

White: Blank (no brand)

White: Picture and Caption

Image Library

Color Palette Lemon Tangerine Sky Blueberry Leaf Chalkboard Soil Clay
Sand

Iconography

Logo and Tagline

Other Graphics

A Sample Slide This is a bullet Header without bullet
This is a level 2 bullet This is a level 3 bullet Header without bullet This is level 3, avoid going below this level. 20% Cloud 60% Hybrid 20% On-Prem Example Graphic

Fonts & Styling Helvetica Neue Medium Helvetica Neue Medium (Bold)
Helvetica Neue Light Helvetica Neue Light (Bold) Helvetica Neue Thin Helvetica Neue Thin (Bold)

Color Palette & Sample Shapes
Lemon Tangerine Sky Blueberry Leaf Chalkboard Soil Clay Sand

Government & Intelligence
Alfresco in information intensive industries Our sweet spot – industries that need process, control and collaboration Government & Intelligence Banking & Insurance Manufacturing Media & Publishing High Tech

Best Practices for Alfresco Replication, Backup and Disaster Recovery

Similar presentations

Presentation on theme: "Best Practices for Alfresco Replication, Backup and Disaster Recovery"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Best Practices for Alfresco Replication, Backup and Disaster Recovery

Similar presentations

Presentation on theme: "Best Practices for Alfresco Replication, Backup and Disaster Recovery"— Presentation transcript:

Similar presentations

About project

Feedback