Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.

Slides:

Advertisements

Similar presentations

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.

Advertisements

CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.

Fabric Management at CERN BT July 16 th 2002 CERN.ch.

Introduction to Systems Management Server 2003 Tyler S. Farmer Sr. Technology Specialist II Education Solutions Group Microsoft Corporation.

Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data

An Example of IPv6 Necessity in the Greek School Network Athanassios Liakopoulos Greek Research & Technology Network.

1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.

Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.

Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank

Copyright 2009 FUJITSU TECHNOLOGY SOLUTIONS PRIMERGY Servers and Windows Server® 2008 R2 Benefit from an efficient, high performance and flexible platform.

The CERN Computer Centres October 14 th 2005 CERN.ch.

Configuration Management IACT 418/918 Autumn 2005 Gene Awyzio SITACS University of Wollongong.

1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.

Cambodia-India Entrepreneurship Development Centre - : :.... :-:-

Installing software on personal computer

Microsoft ® Application Virtualization 4.5 Infrastructure Planning and Design Series.

Automating Linux Installations at CERN G. Cancio, L. Cons, P. Defert, M. Olive, I. Reguero, C. Rossi IT/PDP, CERN presented by G. Cancio.

Simplify your Job – Automatic Storage Management Angelo Session id:

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.

Introduction To Windows Azure Cloud

Term 2, 2011 Week 3. CONTENTS The physical design of a network Network diagrams People who develop and support networks Developing a network Supporting.

Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.

Version 4.0. Objectives Describe how networks impact our daily lives. Describe the role of data networking in the human network. Identify the key components.

INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.

Chapter Fourteen Windows XP Professional Fault Tolerance.

Weekly Report By: Devin Trejo Week of May 30, > June 5, 2015.

Cluster Reliability Project ISIS Vanderbilt University.

Planning the LCG Fabric at CERN openlab TCO Workshop November 11 th 2003 CERN.ch.

1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.

SONIC-3: Creating Large Scale Installations & Deployments Andrew S. Neumann Principal Engineer, Progress Sonic.

Fabric Infrastructure LCG Review November 18 th 2003 CERN.ch.

Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.

20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.

Data Sharing. Data Sharing in a Sysplex Connecting a large number of systems together brings with it special considerations, such as how the large number.

SC2012 Infrastructure Components Management Justin Cook (Data # 3) Principal Consultant, Systems Management Noel Fairclough (Data # 3) Consultant, Systems.

P3 - prepare a computer for installation/upgrade By Ridjauhn Ryan.

May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.

CERN.ch 1 Issues  Hardware Management –Where are my boxes? and what are they?  Hardware Failure –#boxes  MTBF + Manual Intervention = Problem!

Phase II Purchasing LCG PEB January 6 th 2004 CERN.ch.

SONIC-3: Creating Large Scale Installations & Deployments Andrew S. Neumann Principal Engineer Progress Sonic.

Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.

Virtual Machines Created within the Virtualization layer, such as a hypervisor Shares the physical computer's CPU, hard disk, memory, and network interfaces.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.

 Cachet Technologies 1998 Cachet Technologies Technology Overview February 1998.

CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype

Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.

High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.

The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.

David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.

Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.

CERN Disk Storage Technology Choices LCG-France Meeting April 8 th 2005 CERN.ch.

Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

Oracle & HPE 3PAR.

Review of IT General Controls

Bentley Systems, Incorporated

High Availability 24 hours a day, 7 days a week, 365 days a year…

Clouds , Grids and Clusters

High Availability Linux (HA Linux)

BA Continuum India Pvt Ltd

Maximum Availability Architecture Enterprise Technology Centre.

Oracle Solaris Zones Study Purpose Only

Storage Virtualization

The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:

Presentation transcript:

Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch

CERN.ch 2 The Problem ~6,000 PCs Another ~1,000 boxes c.f. ~1,500 PCs and ~200 disk servers at CERN today. But! Affected by: Ramp up profile System lifetime I/O Performance … Uncertainty factor: 2x

CERN.ch 3 Issues  Hardware Management –Where are my boxes? and what are they?  Hardware Failure –#boxes  MTBF + Manual Intervention = Problem!  Software Consistency –Operating system and managed components –Experiment software  State Management –Evolve configuration with high level directives, not low level actions.  Maintain service despite failures –or, at least, avoid dropping catastrophically below expected service level.

CERN.ch 4 Hardware Management  We are not used to handling boxes on this scale. –Essential databases were designed in the ’90s for handling a few systems at a time. »2FTE-weeks to enter 450 systems! –Chain of people involved »prepare racks, prepare allocations, physical install, logical install »and people make mistakes…

CERN.ch 5 Connection Management Boxes and cables must all be in the correct place or our network management system complains about the MAC/IP address association. One or two errors not unlikely if 400 systems are installed. Correct? Or correct database? (Or buy pre-racked systems with single 10Gb/s uplink. But CERN doesn’t have the money for these at present…)

CERN.ch 6 Hardware Management  We are not used to handling boxes on this scale. –Essential databases were designed in the ’90s for handling a few systems at a time. »2FTE-weeks to enter 450 systems! –Chain of people involved »prepare racks, prepare allocations, physical install, logical install »and people make mistakes…  Developing a Hardware Management System to track systems –1 st benefit has been to understand what we do!

CERN.ch 7

8 Hardware Management  We are not used to handling boxes on this scale. –Essential databases were designed in the ’90s for handling a few systems at a time. »2FTE-weeks to enter 450 systems! –Chain of people involved »prepare racks, prepare allocations, physical install, logical install »and people make mistakes…  Developing Hardware Management System to track systems –1 st benefit has been to understand what we do! –Being used to track systems as we migrate to our new machine room. –Would now like SOAP interfaces to all databases.

CERN.ch 9 Hardware Failure  MTBF is high, but so is the box count. –2400 CERN today: 3.5  10 6 disk-hours/week »1 disk failure per week  Worse, these problems need human intervention.  Another role for the Hardware Management System –Manage list of systems needing local intervention. »Expect this to be prime shift activity only; maintain list overnight and present for action in the morning. –Track systems scheduled for vendor repair »Ensure vendors meet contractual obligations for intervention »Feedback subsequent system changes (e.g. new disk, new MAC address) into configuration databases.

CERN.ch 10 Software Consistency - Installation  System Installation requires knowledge of hardware configuration and use  There will be many different system configurations »different functions (CPU vs disk servers) and acquisition cycles »Hardware drift over time (40 different cpu/memory/disk combinations today) Fortunately, there are major groupings »Batch of 350 systems bought last year; 450 being installed now »800 production batch systems should have identical software  Use a Configuration Management Tool that allows definition of high level groupings –EDG/WP4 CDB & SPM tools are being deployed now »but much work still required to integrate all config information and software packages.

CERN.ch 11 Software Consistency - Updates  Large scale software updates pose problems –Deployment Rapidity –Ensuring consistency across all targets  Deployment Rapidity –EDG/WP4 SPM tool enables predeployment of software packages to local cache with delayed activation. »reduces peak bandwidth demand—good for networks and central software repository infrastructure.  Consistency –Similar to installation, accurate configuration information needed. –But, also need tight feedback between configuration database, monitoring tools and software installation system.

CERN.ch 12 Keeping nodes in order Node Configuration System Monitoring System Installation System Fault Mgmt System

CERN.ch 13 Deployment of Experiment Software  We still need to understand the requirements for deployment of experiment software. –Today: in shared filesystem or downloaded per job. Neither solution satisfactory in large scale clusters.

CERN.ch 14 State Management  Clusters are not static –OS upgrades –reconfiguration for special assignments »c.f. Higgs analysis for LEP –load dependent reconfiguration »but best handled by common configuration!  Today: –Human identification of nodes to be moved, manual tracking of nodes through required steps.  Tomorrow: –Give me 200, any 200. Make them like this. By then. –A State Management System. »Development starting now. »Again, needs tight coupling to monitoring & configuration systems.

CERN.ch 15 Grace under Pressure  The pressure: –Hardware failures –Software failure »1 mirror failure per day »1% of CPU server nodes fail per day –Infrastructure failure »e.g. AFS servers  We need a Fault Tolerance System –Repair simple local failures »and tell the monitoring system… –Recognise failures with wide impact and take action »e.g. temporarily suspend job submission –Complete system would be highly complex, but we are starting to address simple cases.

CERN.ch 16 Conclusions  The scale of the Tier0/Tier1 centre amplifies simple problems. –Physical and logical installation –Maintaining operations –System interdependencies.  Some basic tools are now being deployed, e.g. –A Hardware Management System –EDG/WP4 developed configuration and installation tools  Much work still to do, though, especially for –a State Management System, and –a Fault Tolerance System.