Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.

Slides:



Advertisements
Similar presentations
PRAGMA Application (GridFMO) on OSG/FermiGrid Neha Sharma (on behalf of FermiGrid group) Fermilab Work supported by the U.S. Department of Energy under.
Advertisements

Fermilab, the Grid & Cloud Computing Department and the KISTI Collaboration GSDC - KISTI Workshop Jangsan-ri, Nov 4, 2011 Gabriele Garzoglio Grid & Cloud.
IPv6 at Fermilab Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
SLA-Oriented Resource Provisioning for Cloud Computing
The FermiGrid Software Acceptance Process aka “So you want me to run your software in a production environment?” Keith Chadwick Fermilab
Cloud Computing to Satisfy Peak Capacity Needs Case Study.
Business Continuity Efforts at Fermilab Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
An Approach to Secure Cloud Computing Architectures By Y. Serge Joseph FAU security Group February 24th, 2011.
High Availability 24 hours a day, 7 days a week, 365 days a year… Vik Nagjee Product Manager, Core Technologies InterSystems Corporation.
Yingping Huang and Gregory Madey University of Notre Dame A W S utonomic eb-based imulation Presented by Tariq M. King Published by the IEEE Computer Society.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
CS526 Dr.Chow1 HIGH AVAILABILITY LINUX VIRTUAL SERVER By P. Jaya Sunderam and Ankur Deshmukh.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
Minerva Infrastructure Meeting – October 04, 2011.
Virtual Machine Hosting for Networked Clusters: Building the Foundations for “Autonomic” Orchestration Based on paper by Laura Grit, David Irwin, Aydan.
Current Job Components Information Technology Department Network Systems Administration Telecommunications Database Design and Administration.
HEPiX October 2009 Keith Chadwick. Outline Virtualization & Cloud Computing Physical Infrastructure Storage Monitoring Security ITIL HEPiX Conference.
Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.
OSG Public Storage and iRODS
High-Availability Linux.  Reliability  Availability  Serviceability.
The Fermilab Campus Grid (FermiGrid) Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
System Center 2012 Certification and Training May 2012.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 2.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
Apr 30, 20081/11 VO Services Project – Stakeholders’ Meeting Gabriele Garzoglio VO Services Project Stakeholders’ Meeting Apr 30, 2008 Gabriele Garzoglio.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Virtualization within FermiGrid Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Metrics and Monitoring on FermiGrid Keith Chadwick Fermilab
FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper
June 6, 2007TeraGrid '071 Clustering the Reliable File Transfer Service Jim Basney and Patrick Duda NCSA, University of Illinois This material is based.
Grid and Cloud Computing Globus Provision Dr. Guy Tel-Zur.
Fermilab Site Report Spring 2012 HEPiX Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.
Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU.
 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
4/25/2006Condor Week 1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center.
Metrics and Monitoring on FermiGrid Keith Chadwick Fermilab
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
High Availability in DB2 Nishant Sinha
Building a secondary site for.HK Ben Lee HKIRC 02 March 2010.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
An Introduction to Campus Grids 19-Apr-2010 Keith Chadwick & Steve Timm.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
FermiGrid Keith Chadwick. Overall Deployment Summary 5 Racks in FCC:  3 Dell Racks on FCC1 –Can be relocated to FCC2 in FY2009. –Would prefer a location.
The Fermilab Campus Grid (FermiGrid) Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
FermiCloud Project Overview and Progress Keith Chadwick Grid & Cloud Computing Department Head Fermilab Work supported by the U.S. Department of Energy.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Improving resilience of T0 grid services Manuel Guijarro.
Auxiliary services Web page Secrets repository RSV Nagios Monitoring Ganglia NIS server Syslog Forward FermiCloud: A private cloud to support Fermilab.
Fermilab / FermiGrid / FermiCloud Security Update Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359 Keith Chadwick Grid.
FermiGrid Keith Chadwick Fermilab Computing Division Communications and Computing Fabric Department Fabric Technology Projects Group.
Development of the Fermilab Open Science Enclave Policy and Baseline Keith Chadwick Fermilab Work supported by the U.S. Department of.
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
April 18, 2006FermiGrid Project1 FermiGrid Project Status April 18, 2006 Keith Chadwick.
OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.
FermiGrid Virtualization and Xen Steven Timm Feb 28, 2008 Fermilab Computing Techniques Seminar.
FermiCloud Review Response to Questions Keith Chadwick Steve Timm Gabriele Garzoglio Work supported by the U.S. Department of Energy under contract No.
FermiGrid The Fermilab Campus Grid 28-Oct-2010 Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
FermiGrid Highly Available Grid Services Eileen Berman, Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
July 18, 2011S. Timm FermiCloud Enabling Scientific Computing with Integrated Private Cloud Infrastructures Steven Timm.
FermiGrid - PRIMA, VOMS, GUMS & SAZ Keith Chadwick Fermilab
Understanding The Cloud
High Availability 24 hours a day, 7 days a week, 365 days a year…
f f FermiGrid – Site AuthoriZation (SAZ) Service
HEPiX October 2009 Keith Chadwick.
Presentation transcript:

Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359

FermiGrid – The People Keith Chadwick Neha Sharma Steve Timm Dan Yocum 02-Mar-2009Virtualization within FermiGrid1

Previous Work Previous talks on “FermiGrid High Availability” HEPiX 2007 in St. Louis: OSG All Hands 2008 at RENCI: Fermilab detailed documentation: Mar-2009Virtualization within FermiGrid2

02-Mar FermiGrid-HA - Highly Available Grid Services The majority of the services listed in the FermiGrid service catalog are deployed in high availability (HA) configuration that is collectively know as “FermiGrid-HA”. FermiGrid-HA utilizes three key technologies:  Linux Virtual Server (LVS).  Xen Hypervisor.  MySQL Circular Replication. Virtualization within FermiGrid

02-Mar HA Services Deployment FermiGrid employs several strategies to deploy HA services:  Trivial monitoring or information services (such as Ganglia and Zabbix) are deployed on two independent virtual machines.  Services that natively support HA operation (Condor Information Gatherer, FermiGrid internal ReSS deployment) are deployed in the standard service HA configuration on two independent virtual machines.  Services that maintain intermediate routing information (Linux Virtual Server) are deployed in an active/passive configuration on two independent virtual machines. A periodic heartbeat process is used to perform any necessary service failover.  Services that do not maintain intermediate context (i.e. are pure request/response services such as GUMS and SAZ) are deployed using a Linux Virtual Server (LVS) front end to active/active servers on two independent virtual machines.  Services that support active-active database functions (circularly replicating MySQL servers) are deployed on two independent virtual machines. Virtualization within FermiGrid

02-Mar HA Services Communication Replication LVS Standby VOMS Active VOMS Active GUMS Active GUMS Active SAZ Active SAZ Active LVS Standby LVS Active MySQL Active MySQL Active LVS Active Heartbeat Client Virtualization within FermiGrid

FermiGrid – Organization of Physical Hardware and Virtual Services       Mar-2009Virtualization within FermiGrid6

02-Mar Virtualized Non-HA Services The following services are virtualized, but not currently implemented as HA services:  Globus gatekeeper services (such as the CDF and D0 experiment globus gatekeeper services) are deployed in segmented pools. –Loss of any single pool will reduce the available resources by approximately 50%.  MyProxy  OSG Gratia Accounting service [Gratia] –not currently implemented as an HA service. –If the service fails, then the service will not be available until appropriate manual intervention is performed to restart the service.  OSG Resource Selection Service [ReSS] –not currently implemented as an HA service. –If the service fails, then the service will not be available until appropriate manual intervention is performed to restart the service. We are working to address these services as part of the FermiGrid FY2009 activities. Virtualization within FermiGrid

Measured Service Availability FermiGrid actively measures the service availability of the services in the FermiGrid service catalog:   The above URLs are updated on an hourly basis. The goal for FermiGrid-HA is > % service availability.  Not including Building or Network failures.  These will be addressed by FermiGrid-RS (redundant services) in FY2010/11. For the period 01-Dec-2007 through 30-Jun-2008, we achieved a service availability of %. For the period 01-Jul-2008 through the present, we have achieved a service availability of % (and climbing…). 02-Mar-2009Virtualization within FermiGrid8

FermiGrid Service Level Agreement Authentication and Authorization Services:  The service availability goal for the critical Grid authorization and authentication services provided by the FermiGrid Services Group shall be 99.9% (measured on a weekly basis) for the periods that any supported experiment is actively involved in data collection and 99% overall. Incident Response:  FermiGrid has deployed an extensive automated service monitoring and verification infrastructure that is capable of automatically restarting failed (or about to fail) services as well as performing notification to a limited pager rotation.  It is expected that the person that receives an incident notification shall attempt to respond to the incident within 15 minutes if the notification occurs during standard business hours (Monday through Friday 8:00 through 17:00), and within 1 (one) hour for all other times, providing that this response interval does not create a hazard. FermiGrid SLA Document:  Mar-2009Virtualization within FermiGrid9

Why %? A service availability of % corresponds to 5m 15s of downtime in a year.  This is a challenging availability goal.  The SLA only requires 99.9% service availability = 8.76 hours. So, really - Why target five 9’s?  Well if we try for five 9’s, and miss then we are likely to hit a target that is better than the SLA.  The hardware has shown that it is capable of supporting this goal.  The software is also capable of meeting this goal (modulo denial of service attacks from some members of the user community…).  The critical key is to carefully plan the service upgrades and configuration changes. 02-Mar-2009Virtualization within FermiGrid10

FermiGrid Persistent ITB Gatekeepers are Xen VMs. Worker nodes are also partitioned with Xen VMs:  Condor  PBS  Sun Grid Engine  + A couple of “extra” CPUs for future cloud investigation work Mar-2009Virtualization within FermiGrid11

Cloud Computing FermiGrid is also looking at Cloud Computing. We have a proposal in this FY, that if funded, will allow us to deploy an initial cloud computing capability:  Dynamic provisioning of computing resources for test, development and integration efforts.  Allow the retirement of several racks of out of warranty systems. –Both of the above help improve the “green-ness” of the computing facility  Additional capacity for the GP Grid cluster. 02-Mar-2009Virtualization within FermiGrid12

Conclusions Virtualization is working well within FermiGrid.  All services are deployed in Xen virtual machines.  The majority of the services are also deployed in a variety of high availability configurations. We are actively working on:  The configuration modifications necessary to deploy the non-HA services as HA services.  The necessary foundation work to allow us to move forward with a cloud computing initiative (if funded). 02-Mar-2009Virtualization within FermiGrid13

Fin Any questions? 02-Mar-2009Virtualization within FermiGrid14