Business Continuity Efforts at Fermilab Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.

Slides:



Advertisements
Similar presentations
Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.
Advertisements

Fermilab, the Grid & Cloud Computing Department and the KISTI Collaboration GSDC - KISTI Workshop Jangsan-ri, Nov 4, 2011 Gabriele Garzoglio Grid & Cloud.
IPv6 at Fermilab Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
MUNIS Platform Migration Project WELCOME. Agenda Introductions Tyler Cloud Overview Munis New Features Questions.
Skyward Disaster Recovery Options
Chapter 4 Infrastructure as a Service (IaaS)
Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.
Empowering Business in Real Time. © Copyright 2009, OSIsoft Inc. All rights Reserved. Virtualization and HA PI Systems: Three strategies to keep your PI.
Fast Path Data Center Relocation Presented by: AFCOM & Independence Blue Cross April 8, 2011.
Cold Fusion High Availability “Taking It To The Next Level” Presenter: Jason Baker, Digital North Date:
Dr. Zahid Anwar. Simplified Architecture of Linux Cluster Simplified Architecture of a Single Computer Simplified architecture of an enterprise cluster.
1 Northwestern University Information Technology Data Center Elements Research and Administrative Computing Committee Presented October 8, 2007.
Lesson 11-Virtual Private Networks. Overview Define Virtual Private Networks (VPNs). Deploy User VPNs. Deploy Site VPNs. Understand standard VPN techniques.
Use of Kerberos-Issued Certificates at Fermilab Kerberos  PKI Translation Matt Crawford & Dane Skow Fermilab.
Lesson 20 – OTHER WINDOWS 2000 SERVER SERVICES. DHCP server DNS RAS and RRAS Internet Information Server Cluster services Windows terminal services OVERVIEW.
ProjectWise Overview – Part 1 V8 XM Edition
Lesson 1: Configuring Network Load Balancing
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 7: Planning a DNS Strategy.
1 Disaster Recovery Planning & Cross-Border Backup of Data among AMEDA Members Vipin Mahabirsingh Managing Director, CDS Mauritius For Workgroup on Cross-Border.
Welcome Course 20410B Module 0: Introduction Audience
Data Center and Network Planning and Services Mark Redican IET CCFIT Update Feb 13, 2012.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN Business Continuity Overview Wayne Salter HEPiX April 2012.
Services Tailored Around You® Business Contingency Planning Overview July 2013.
Cloud Computing How secure is it? Author: Marziyeh Arabnejad Revised/Edited: James Childress April 2014 Tandy School of Computer Science.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
Module 1: Installing Active Directory Domain Services
1 SAMBA. 2 Module - SAMBA ♦ Overview The presence of diverse machines in the network environment is natural. So their interoperability is critical. This.
Questionaire answers D. Petravick P. Demar FNAL. 7/14/05 DLP -- GDB2 FNAL/T1 issues In interpreting the T0/T1 document how do the T1s foresee to connect.
Module 10: Designing an AD RMS Infrastructure in Windows Server 2008.
IT Business Continuity Briefing March 3,  Incident Overview  Improving the power posture of the Primary Data Center  STAGEnet Redundancy  Telephone.
Microsoft Active Directory(AD) A presentation by Robert, Jasmine, Val and Scott IMT546 December 11, 2004.
System Center 2012 Certification and Training May 2012.
IT Infrastructure Chap 1: Definition
David N. Wozei Systems Administrator, IT Auditor.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
October 8, 2015 University of Tulsa - Center for Information Security Microsoft Windows 2000 DNS October 8, 2015.
Module 4: Planning, Optimizing, and Troubleshooting DHCP
Co-location Sites for Business Continuity and Disaster Recovery Peter Lesser (212) Peter Lesser (212) Kraft.
Information Availability Brett Paulson Sr. VP and Chief Information Officer Board of Trade Clearing Corporation FIA – November 7, 2002.
Preventing Common Causes of loss. Common Causes of Loss of Data Accidental Erasure – close a file and don’t save it, – write over the original file when.
608D CloudStack 3.0 Omer Palo Readiness Specialist, WW Tech Support Readiness May 8, 2012.
Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Module 11: Implementing ISA Server 2004 Enterprise Edition.
Virtualization within FermiGrid Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
University of Palestine Faculty of Applied Engineering and Urban Planning Software Engineering Department INTRODUCTION TO COMPUTER NETWORKS Dr. Abdelhamid.
1 Administering Shared Folders Understanding Shared Folders Planning Shared Folders Sharing Folders Combining Shared Folder Permissions and NTFS Permissions.
Fermilab Site Report Spring 2012 HEPiX Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Clustering In A SAN For High Availability Steve Dalton, President and CEO Gadzoox Networks September 2002.
Module 1: Implementing Active Directory ® Domain Services.
VMware vSphere Configuration and Management v6
Adoption and Use of Electronic Medical Records (in Federally Qualified Health Centers) and Supporting an ASP Community Care Network of Virginia, Inc.
Data Center Facility & Services Overview. Data Center Facility.
WEEK 11 – TOPOLOGIES, TCP/IP, SHARING & SECURITY IT1001- Personal Computer Hardware System & Operations.
Install, configure and test ICT Networks
An Introduction to Campus Grids 19-Apr-2010 Keith Chadwick & Steve Timm.
FermiGrid Keith Chadwick. Overall Deployment Summary 5 Racks in FCC:  3 Dell Racks on FCC1 –Can be relocated to FCC2 in FY2009. –Would prefer a location.
Windows Certification Paths OR MCSA Windows Server 2012 Installing and Configuring Windows Server 2012 Exam (20410) Administering Windows Server.
Fermilab / FermiGrid / FermiCloud Security Update Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359 Keith Chadwick Grid.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
THE DOMAIN NAME SYSTEM AS AN ADDRESS DIRECTORY FOR THE WORLDWIDE WEB. 1.
Fermilab supports several authentication mechanisms for user and computer authentication. This talk will cover our authentication systems, design considerations,
INTRODUCTION TO COMPUTER NETWORKS BY: SAIKUMAR III MSCS, Nalanda College.
FermiGrid The Fermilab Campus Grid 28-Oct-2010 Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Enterprise Vitrualization by Ernest de León. Brief Overview.
What Do We Do? Managed IT services
Managing Clouds with VMM
Business Contingency Planning
Business Continuity Technology
Presentation transcript:

Business Continuity Efforts at Fermilab Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359

Outline Introduction Site Power Computing Facilities Networking Fermilab-Argonne DR Agreement DNS & Authentication, Data Distribution, FermiGrid and FermiCloud, Outsourced Services. 25-Apr-2012Business Continuity at Fermilab1

Introduction ITIL defines several Service Management processes that are formally involved in “Business Continuity”: Availability Management Continuity Management Capacity Management The work that I will be discussing in this talk largely predates the adoption of ITIL practices by the Fermilab Computing Sector. Many people at Fermilab have been involved in the development of these capabilities and plans, Please credit them with all the hard work and accomplishments, Please blame me for any errors or misunderstandings. 25-Apr-2012Business Continuity at Fermilab2

Site Power Two substations on the Commonwealth Edison 345 kV electrical grid “backbone”: –Master substation, –Main injector substation. Auxiliary 13.6 kV service to Fermilab: –Kautz road substation. 25-Apr-2012Business Continuity at Fermilab3

Fermilab Computing Facilities Feynman Computing Center (FCC): FCC-2 Computer Room, FCC-3 Computer Room(s), All rooms have refrigerant based cooling, UPS and Generators. Grid Computing Center (GCC): GCC-A, GCC-B, GCC-C Computer Rooms, GCC-TRR Tape Robot Room, GCC-Network-A, GCC-Network-B, All rooms have refrigerant based cooling, UPS and “quick connect” taps for Generators. Lattice Computing Center (LCC): LCC-107 & LCC-108 Computer Room, Refrigerant based cooling, No UPS or Generator. 25-Apr-2012Business Continuity at Fermilab4

GCC Load Shed Plans As was mentioned previously, Fermilab has encountered a cooling issue with the GCC-B and GCC-C computer rooms, There is work underway to remove the berm that obscures the heat exchangers, In the event that the berm removal does not completely address the issue(s), we have a three step “load shed” plan agreed with our stakeholders, whole racks have been identified that will participate: Phase % load shed racks, Phase % load shed racks (strict superset of 30% load shed racks), 100% load shed. If the “load shed” trigger conditions are met: During business hours, then the expectation is that the agreed racks will be shut down under program control in 30 minutes, During non-business hours, the shut down interval may be longer, If the racks are not shut down in the agreed upon time window, then the facilities personnel may elect to throw the corresponding breakers in the electrical distribution system. 25-Apr-2012Business Continuity at Fermilab5

FCC and GCC 25-Apr-2012Business Continuity at Fermilab6 FC C GC C The FCC and GCC buildings are separated by approximately 1 mile (1.6 km). FCC has UPS and Generator. GCC has UPS.

Chicago Area MAN Fermilab has multiple paths to Starlight via the Chicago Area Metropolitan Area Network (MAN). The Chicago Area MAN lands at two locations in Fermilab: FCC-2 Network Area, GCC-A Network Room, Multiple path buried fiber cross connect between the two MAN locations at Fermilab (see later slide). 25-Apr-2012Business Continuity at Fermilab7 Fermilab Argonne Starlight

Fermilab – Argonne ~20 miles (33 km) apart = ~24 miles (40 km) driving 25-Apr-2012Business Continuity at Fermilab8

Fermilab – Argonne DR Agreement Reciprocal agreement to host Financial DR services between Fermilab and Argonne, Agreement has been in place for multiple years, Lukewarm “backup” – Requires reinstall from backup tapes shipped offsite every week, Fermilab/Argonne personnel visit the corresponding DR site a “handful” of times a year, Most work performed via remote administration, DR plan is activated ~once a year to verify that it still works. 25-Apr-2012Business Continuity at Fermilab9

Buried Fiber Loop Path Diversity (as of Nov-2011) 25-Apr-2012Business Continuity at Fermilab10

Distributed Network Core Provides Redundant Connectivity 25-Apr-2012Business Continuity at Fermilab11 GCC-A Nexus 7010 Nexus 7010 Robotic Tape Libraries (4) Robotic Tape Libraries (4) Robotic Tape Libraries (3) Robotic Tape Libraries (3) Fermi Grid Fermi Grid Fermi Cloud Fermi Cloud Fermi Grid Fermi Grid Fermi Cloud Fermi Cloud Disk Servers 20 Gigabit/s L3 Routed Network 80 Gigabit/s L2 Switched Network 40 Gigabit/s L2 Switched Networks Note – Intermediate level switches and top of rack switches are not shown in the this diagram. Private Networks over dedicated fiber Grid Worker Nodes Grid Worker Nodes Nexus 7010 Nexus 7010 FCC-2 Nexus 7010 Nexus 7010 FCC-3 Nexus 7010 Nexus 7010 GCC-B Grid Worker Nodes Grid Worker Nodes Deployment completed in January 2012

DNS, Kerberos, Windows and KCA DNS at Fermilab is implemented using Infoblox DNSSEC appliances: Master system feeds the primary and secondary name servers, The primary and secondary name servers feed a network of distributed “slaves” located at key points around the Fermilab network, End systems can either use the primary/secondary name servers, the distributed slave in their area, or use the “anycast” address ( ). Offsite name servers hosted by DOE ESnet maintained via periodic zone transfer. MIT Kerberos is implemented with a single master KDC and multiple slave KDCs: In the event of a failure of the master KDCs the slave KDCs can take over most functions, If the master KDC failure is “terminal”, it is possible to “promote” a slave KDC to serve as the master KDC. The Windows Active Directory domain controllers are likewise distributed around the Fermilab site. The Fermilab Kerberos Certificate Authority (KCA) is implemented on two (load sharing) systems that are in separate buildings. 25-Apr-2012Business Continuity at Fermilab12

Minimal Internet Presence In the event of a triggering security incident, Fermilab has a predefined intermediate protection level - Minimal Internet Presence: –Only permit essential services that keep the lab “open for business” on the Internet, plus necessary infrastructure to support them, –Full disconnection is always an option. This plan stops short of “cutting the connection” to the Internet for Fermilab: –One web server with static content. – . This plan can be implemented at the discretion of the Fermilab computer security team with the concurrence of the Fermilab Directorate. 25-Apr-2012Business Continuity at Fermilab13

Data Distribution Raw Data LocationAnalyzed Data Location FCC Tape Robot(s)GCC Tape Robot(s) FCC Tape Robot(s) 25-Apr-2012Business Continuity at Fermilab14

Virtualization & Cloud Computing Virtualization is a significant component of our core computing (business) and scientific computing continuity planning, Virtualization allows us to consolidate multiple functions into a single physical server, saving power & cooling, Physical to virtual (P2V), virtual to virtual (V2V) and virtual to physical (V2P), It’s significantly easier and faster (and safer too) to “lift” virtual machines rather than physical machines. 25-Apr-2012Business Continuity at Fermilab15

Replication0 LVS Standby VOMS Active VOMS Active GUMS Active GUMS Active SAZ Active SAZ Active LVS Standby LVS Active MySQL Active MySQL Active LVS Active Heartbeat Heartbeat 0 Client FermiGrid-HA/FermiGrid-HA2 FermiGrid-HA uses three key technologies: Linux Virtual Server (LVS), Xen Hypervisor, MySQL Circular Replication. FermiGrid-HA2 added: Redundant services in both FCC-2 and GCC-B, Non-redundant services are split across both locations, and go to reduced capacity in the event of building or network outage. Deployment has been tested under real world conditions. 25-Apr-2012Business Continuity at Fermilab16

FermiCloud Services split across FCC-3 and GCC-B, Working towards a distributed & replicated SAN, FY2012 SAN hardware has just been delivered and installed in the GCC- B rack, We are evaluating GFS. 25-Apr-2012Business Continuity at Fermilab17

Outsourced Services Fermilab uses several outsourced services, including: Kronos – Timecard Reporting, Service-Now – ITIL Support. In each case where Fermilab has contracted for outsourced services, the business continuity plans of the service operators have been carefully evaluated. 25-Apr-2012Business Continuity at Fermilab18

Summary Utilities (power, cooling, etc.), Services distributed over multiple geographic locations, Distributed and redundant network core, Load sharing services - Active-Active, Standby services – Active-Standby, Published plan for load shed (if required), Virtualization & Cloud Services, Careful selection of outsourced services. 25-Apr-2012Business Continuity at Fermilab19

Thank You! Any Questions? 25-Apr Business Continuity at Fermilab