Lessons Learned from disaster recovery Jinny Chien April 20, th APGridPMA in Taipei
Motivation ASGCCA encountered the accident in February and how to avoid the same situation happening Do we have sufficient backup procedure Is our CA server at the safe place Do we have the standard policy or incident response procedure to introduce the recovery
ASGCCA Event Time : 9:00 Feb UTC Event description : the unexpected accident on the part of data center at ASGC, all on-line services were shut down including ASGCCA web server. Result : ASGCCA certificate activities were down. CRL did not publish in time.
Process 9:00 Feb 25 UTC : Sent an EGEE broadcast to all ROC managers, VO managers, WLCG users, APGridPMA members, ASGCCA users 2:00 Feb 26 UTC : Sent an announcement to IGTF-RAT and IGTF-general lists. Try to recover ASGCCA web page 12:00 Feb 26 UTC: Moved ASGCCA web and CA server (offline) to the safe place and connected to the Internet. 16:00 Feb 26 UTC : ASGCCA web site was up. Sent the announcement to IGTF-RAT, IGTF-general and ASGCCA user lists
Review the process Feb UTC ASGCCA web was down and sent an announcement to APGridPMA, ASGCCA users, IGTF-RAT, IGTF-general Feb 26 UTC : Recovered and ASGCCA web site was up. Sent the final announcement and checked all CA activities well. Total process is two days
Basic Recovery procedure Evaluate the scope of this disaster and how many days to recover Send the notification to IGTF-RAT, IGTF-general, APGridPMA members and your end entities. The matter should be described the disaster and schedule Recovery activities Check all CA activities well and CRL will be published regularly Re-work and send the final announcement to IGTF-RAT, APGridPMA member, IGTF and your end entities.
IGTF-RAT The International Grid Trust Federation (IGTF) Risk Assessment Team (RAT) is responsible for assessing risk and setting time and deadlines for response and action for concerns and vulnerabilities.International Grid Trust Federation address: Members: APGridPMA: Yoshio Tanaka, Jinny Chien EUGridPMA: Jens Jensen, Willy Weisz, David Groep, Sajjad Asghar TAGPMA: Jim Basney, Vinod Rebello, Jim Marsteller Public webpage
Conclusion Please backup the CA server and web regularly The backup archive should be kept at the safe place Write down the recovery procedure for your CA activities
Discussion if the CA server and web destroy at the same time? To evaluate the disaster and plan a schedule Ask for help to IGTF-RAT Should we have the incident response procedure ? What is the time range if CA encounters any accident?
Thanks for your listening