FermiGrid Keith Chadwick. Overall Deployment Summary 5 Racks in FCC:  3 Dell Racks on FCC1 –Can be relocated to FCC2 in FY2009. –Would prefer a location.

Slides:



Advertisements
Similar presentations
Fermilab, the Grid & Cloud Computing Department and the KISTI Collaboration GSDC - KISTI Workshop Jangsan-ri, Nov 4, 2011 Gabriele Garzoglio Grid & Cloud.
Advertisements

IPv6 at Fermilab Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
The FermiGrid Software Acceptance Process aka “So you want me to run your software in a production environment?” Keith Chadwick Fermilab
Cloud Computing to Satisfy Peak Capacity Needs Case Study.
Business Continuity Efforts at Fermilab Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Lesson 17: Configuring Security Policies
ERCOT Nodal Protocols Telemetry Requirements Proposal
MCSE Guide to Microsoft Exchange Server 2003 Administration Chapter 14 Upgrading to Exchange Server 2003.
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
Jan 2010 Current OSG Efforts and Status, Grid Deployment Board, Jan 12 th 2010 OSG has weekly Operations and Production Meetings including US ATLAS and.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
Database Infrastructure Major Current Projects –CDF Connection Metering, codegen rewrite, hep w/ TRGSim++ – Dennis –CDF DB Client Monitor Server and MySQL.
Implementing High Availability
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
Server Virtualization: Navy Network Operations Centers
Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.
Chapter 2 System Administration - 1. Overview  Introduction to system administration  Importance of system administration to information security 
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
The Fermilab Campus Grid (FermiGrid) Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Site Networking Anna Jordan April 28, 2009.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
Apr 30, 20081/11 VO Services Project – Stakeholders’ Meeting Gabriele Garzoglio VO Services Project Stakeholders’ Meeting Apr 30, 2008 Gabriele Garzoglio.
Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Virtualization within FermiGrid Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Metrics and Monitoring on FermiGrid Keith Chadwick Fermilab
FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper
Laboratório de Instrumentação e Física Experimental de Partículas GRID Activities at LIP Jorge Gomes - (LIP Computer Centre)
Dec07-02: Prototype Parking Meter Phase 8 Bret Schuring: Team Leader Pooja Ramesh: Communications Wilson Kwong, Matt Swanson, Alex Wernli.
Fermilab Site Report Spring 2012 HEPiX Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
1 Development of a High-Throughput Computing Cluster at Florida Tech P. FORD, R. PENA, J. HELSBY, R. HOCH, M. HOHLMANN Physics and Space Sciences Dept,
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
4/25/2006Condor Week 1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 12: Planning and Implementing Server Availability and Scalability.
The OSG and Grid Operations Center Rob Quick Open Science Grid Operations Center - Indiana University ATLAS Tier 2-Tier 3 Meeting Bloomington, Indiana.
OSG Area Coordinators Meeting Operations Rob Quick 1/11/2012.
Building a secondary site for.HK Ben Lee HKIRC 02 March 2010.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
1 Configuring Sites Configuring Site Settings Configuring Inter-Site Replication Troubleshooting Replication Maintaining Server Settings.
OSG Deployment Preparations Status Dane Skow OSG Council Meeting May 3, 2005 Madison, WI.
An Introduction to Campus Grids 19-Apr-2010 Keith Chadwick & Steve Timm.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
70-294: MCSE Guide to Microsoft Windows Server 2003 Active Directory, Enhanced Chapter 6: Active Directory Physical Design.
Area Coordinator Report for Operations Rob Quick 4/10/2008.
TCD Site Report Stuart Kenny*, Stephen Childs, Brian Coghlan, Geoff Quigley.
1 Chapter Overview Using Standby Servers Using Failover Clustering.
The Fermilab Campus Grid (FermiGrid) Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
FermiCloud Project Overview and Progress Keith Chadwick Grid & Cloud Computing Department Head Fermilab Work supported by the U.S. Department of Energy.
OSG Status and Rob Gardner University of Chicago US ATLAS Tier2 Meeting Harvard University, August 17-18, 2006.
U N C L A S S I F I E D LA-UR Leveraging VMware to implement Disaster Recovery at LANL Anil Karmel Technical Staff Member
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
April 18, 2006FermiGrid Project1 FermiGrid Project Status April 18, 2006 Keith Chadwick.
OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.
FermiGrid Virtualization and Xen Steven Timm Feb 28, 2008 Fermilab Computing Techniques Seminar.
Computing Division Helpdesk Activity Report Rick Thies October 10, 2006.
FermiGrid The Fermilab Campus Grid 28-Oct-2010 Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
FermiGrid Highly Available Grid Services Eileen Berman, Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Computing Infrastructure Arthur Kreymer 1 ● Power status in FCC (UPS1) ● Bluearc disk purchase – coming soon ● Planned downtimes – none ! ● Minos.
High Availability Planning for Tier-2 Services
High Availability Linux (HA Linux)
Building a Virtual Infrastructure
f f FermiGrid – Site AuthoriZation (SAZ) Service
Disaster Recovery Technical Infrastructure at George Mason University
FY09 Tactical Plan Status Report for Site Networking
Presentation transcript:

FermiGrid Keith Chadwick

Overall Deployment Summary 5 Racks in FCC:  3 Dell Racks on FCC1 –Can be relocated to FCC2 in FY2009. –Would prefer a location on FCC2 that can be served by panels fed from 2 UPS for the FermiGrid-HA services (requires 4 x L5-30R).  2 Racks on FCC2 –Will be retiring 1 “soon”. 1 Dell Rack in WH8 “falcon’s nest” systems area Several Racks in GCC:  GCCA  GCCB  Expect to add additional racks in FY2009 and FY2010 January 5, 20091CD Facility Planning

Physical Services January 5, 2009CD Facility Planning2

FCC1 OSG [Gratia] Services Rack  8 Dell 2850 Systems, many near end of warranty.  Support: 8x5 + Best Effort  These systems are proposed to be upgraded/replaced this FY.  New systems for OSG ReSS will also be added to this rack this FY.  Old systems will be moved to WH and retained for development/integration/test. FermiGrid Gatekeeper Services Rack  15 systems, various vendors.  Support: 8x5 + Best Effort  Many systems have been recently upgraded. FermiGrid HA Services Rack  7 Dell 2950 Systems with dual power supplies.  4 power controllers spread across 2 panels.  Support: 24x7 for GUMS and SAZ services, all others 8x5 + Best Effort. January 5, 20093CD Facility Planning

FCC2 OSG Resource Selection Services Rack  ReSS systems overdue for hardware refresh.  Replacement systems (2 x Dell 2950) will be installed in OSG Services Rack (on FCC1).  OSG Bestman Test nodes will be installed in this FCC2 rack. FermiGrid ITB Worker Node Rack  In the process of being replaced/retired.  Replacement systems already in GCC-B. January 5, 20094CD Facility Planning

Wilson Hall 8 FermiGrid Test System Dell Rack  4 x Dell 2850 (old production fermigrid[1-4]).  1 x Dell 2950 (old production fermigrid5).  3 x 1U systems. January 5, 20095CD Facility Planning

GCC These systems are mostly managed by FEF. GCC-A: GCC-B: January 5, 20096CD Facility Planning

January 5, FermiGrid-HA - Highly Available Grid Services The majority of the services listed in the FermiGrid service catalog are deployed in high availability (HA) configuration that is collectively know as “FermiGrid-HA”.  The goal for FermiGrid-HA is > % service availability. –Not including Building or Network failures. –These will be addressed by FermiGrid-RS (redundant services) in FY2010/11.  For the period of 01-Dec-2007 through 30-Jun-2008, FermiGrid achieved % service availability. FermiGrid-HA utilizes three key technologies:  Linux Virtual Server (LVS).  Xen Hypervisor.  MySQL Circular Replication. CD Facility Planning

January 5, HA Services Deployment FermiGrid employs several strategies to deploy HA services:  Trivial monitoring or information services (such as Ganglia and Zabbix) are deployed on two independent virtual machines.  Services that natively support HA operation (Condor Information Gatherer, FermiGrid internal ReSS deployment) are deployed in the standard service HA configuration on two independent virtual machines.  Services that maintain intermediate routing information (Linux Virtual Server) are deployed in an active/passive configuration on two independent virtual machines. A periodic heartbeat process is used to perform any necessary service failover.  Services that do not maintain intermediate context (i.e. are pure request/response services such as GUMS and SAZ) are deployed using a Linux Virtual Server (LVS) front end to active/active servers on two independent virtual machines.  Services that support active-active database functions (circularly replicating MySQL servers) are deployed on two independent virtual machines. CD Facility Planning

January 5, HA Services Communication Replication LVS Standby VOMS Active VOMS Active GUMS Active GUMS Active SAZ Active SAZ Active LVS Standby LVS Active MySQL Active MySQL Active LVS Active Heartbeat Client CD Facility Planning

January 5, Non-HA Services The following services are not currently implemented as HA services:  Globus gatekeeper services (such as the CDF and D0 experiment globus gatekeeper services) are deployed in segmented pools. –Loss of any single pool will reduce the available resources by approximately 50%.  MyProxy  OSG Gratia Accounting service [Gratia] –not currently implemented as an HA service. –If the service fails, then the service will not be available until appropriate manual intervention is performed to restart the service.  OSG Resource Selection Service [ReSS] –not currently implemented as an HA service. –If the service fails, then the service will not be available until appropriate manual intervention is performed to restart the service. We are working to address these services as part of the FermiGrid FY2009 activities. CD Facility Planning

FermiGrid Service Level Agreement Authentication and Authorization Services:  The service availability goal for the critical Grid authorization and authentication services provided by the FermiGrid Services Group shall be 99.9% (measured on a weekly basis) for the periods that any supported experiment is actively involved in data collection and 99% overall. Incident Response:  FermiGrid has deployed an extensive automated service monitoring and verification infrastructure that is capable of automatically restarting failed (or about to fail) services as well as performing notification to a limited pager rotation.   It is expected that the person that receives an incident notification shall attempt to respond to the incident within 15 minutes if the notification occurs during standard business hours (Monday through Friday 8:00 through 17:00), and within 1 (one) hour for all other times, providing that this response interval does not create a hazard. FermiGrid SLA Document:  January 5, 2009CD Facility Planning11

January 5, Measured Service Availability Last WeekLast Month Last Quarter Since 10- Jul-2008 Hardware % Core Services %99.992%99.995%99.976% Gatekeepers %99.864%98.474%99.082% Batch Services %99.801%99.813%99.844% ReSS % %99.717% Gratia % %99.812% The above figures are through the week ending Saturday 03-Jan :59:59. The goal for FermiGrid-HA is > % service availability. The SLA for GUMS and SAZ is 99.9% (during data taking) / 99.0% (otherwise). CD Facility Planning