Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Automatic server registration and burn-in framework HEPIX’13 28.

Slides:



Advertisements
Similar presentations
Software change management
Advertisements

Configuration management
UPnP Device Management Andre Bottaro France Telecom Group UPnP DM co-chairman End User Device Management panel Sunday, January 11th, 2009 CCNC'09.
14.1 © 2004 Pearson Education, Inc. Exam Planning, Implementing, and Maintaining a Microsoft Windows Server 2003 Active Directory Infrastructure.
MSIS 110: Introduction to Computers; Instructor: S. Mathiyalakan1 Systems Design, Implementation, Maintenance, and Review Chapter 13.
Date: 03/05/2007 Vendor Management and Metrics. 2 A.T. Kearney X/mm.yyyy/00000 AT Kearney’s IT/Telecom Vendor Facts IT/Telecom service, software and equipment.
AutoMAC: A Tool for Automating Network Moves, Adds, and Changes Christopher J. Tengi Princeton University.
Distributed Systems Management What is management? Strategic factors (planning, control) Tactical factors (how to do support the strategy practically).
© 2010 VMware Inc. All rights reserved VMware ESX and ESXi Module 3.
Library Automation: Planning and Implementation
Professional Informatics & Quality Assurance Software Lifecycle Manager „Tools that are more a help than a hindrance”
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Computer Associates Solutions Managing eBusiness Catalin Matei, April 12, 2005
HEPiX 21/05/2014 Olof Bärring, Marco Guerri – CERN IT
Term 2, 2011 Week 3. CONTENTS The physical design of a network Network diagrams People who develop and support networks Developing a network Supporting.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Software Software essential is coded programs that perform a serious of algorithms. Instructions loaded into primary memory (RAM) from secondary storage.
Configuration Management (CM)
Event Management & ITIL V3
Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Experience with procuring, deploying and maintaining hardware at remote co-location centre CHEP’13 14 th October 2013 Afroditi XAFI, Alain GENTIT, Anthony.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
BZUPAGES.COM BOOTP and DHCP The Bootstrap Protocol (BOOTP) is a client/server protocol that configures a diskless computer or a computer that is booted.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Update on Windows 7 at CERN & Remote Desktop.
Asset Management Know your environment. Inventory Why Inventory?
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.
Mark E. Fuller Senior Principal Instructor Oracle University Oracle Corporation.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
1 Copyright © 2015 Pexus LLC Patriot PS Personal Server Importing Virtual Appliance Image.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
CERN-IT/DB After-C5 Presentation EDMS Service Migration Monica Marinucci Lopez October 4th 2002 EDMS Document:
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Chapter 8: Installing Linux The Complete Guide To Linux System Administration.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
Network management Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance,
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Drupal at CERN Juraj Sucik Jarosław Polok.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG meeting 12 th September.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Hardware failures Wayne Salter on behalf of Olof B ärring.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Juraj Sucik, Michal Kwiatek, Rafal.
CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN IT Facility Planning and Procurement HEPiX Fall 2010 Workshop.
IBM Software Group © 2008 IBM Corporation IBM Tivoli Provisioning Manager 7.1 OS Management with TPM for OS Deployment.
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
Scientific Linux Inventory Project (SLIP) Troy Dawson Connie Sieh.
ProStoria DATA-AS-A-SERVICE FOR DEVOPS. Agenda: ProStoria presentation Contact data.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
CERN IT Department CH-1211 Genève 23 Switzerland M.Schröder, Hepix Vancouver 2011 OCS Inventory at CERN Matthias Schröder (IT-OIS)
1 Remote Installation Service Windows 2003 Server Prof. Abdul Hameed.
BY: SALMAN 1.
VMware ESX and ESXi Module 3.
BY: SALMAN.
Barbara Martelli INFN - CNAF
Inter-site issues
10 – Workstation Fleet Logistics
OPNFV Arno Installation & Validation Walk-Through
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Experience with an IT Asset Management System
The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:
Pete Gronbech, Kashif Mohammad and Vipul Davda
Presentation transcript:

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28 th October 2013 Speaker: Afroditi XAFI Co-authors: Olof B Ä RRING, Eric BONFILLOU, Liviu VALSAN

Computing Facilities Outline Motivation Preparation Implementation Workflow Results of 1k+ bulk delivery: –Network Registration –Burn-in & Performance Tests Conclusions Future work Automatic server registration and burn-in framework - 2

Computing Facilities Motivation Up to the beginning of this year running acceptance tests meant : –Registering manually the servers in the network database and in the system administration toolkit Error prone: based on input given by the suppliers in Excel format (cells not in the right format) Not being able to register the servers would prevent the acceptance tests to start –Installing the servers with Linux SLC For very large deliveries, the parallel installation could fail - the installation servers were overloaded –Reviewing the test results was not straightforward It was a semi automated log analysis, no dashboards It required significant effort to follow up a given delivery: –On average one person was assigned full time per delivery –Every single error had to be understood and addressed manually Automatic server registration and burn-in framework - 3

Computing Facilities Motivation Ultimately, the goals we wanted to achieve by automating the process were to : –Reduce the amount of errors at network registration time, and detect them better –Avoid unnecessary installation and early registration in the system administration toolkits –Minimize the amount of effort needed to carry on the acceptance –Ease the analysis of the results –Deliver the resources quicker to the users (provided there are no generic hardware issues) Automatic server registration and burn-in framework - 4

Computing Facilities Preparation We had to define more systematically our requirements to the vendors: Infrastructure requirements prior to delivery: –Sticker of unique ID in barcode format, and location on the chassis to ease asset management –Provided IO ports schema to ease the physical installation and cabling process Remote access given by the suppliers to the first production systems prior to delivery: –Allows procurement team to define the desired hardware configuration of the systems (e.g. bios settings, boot list order) Automatic server registration and burn-in framework - 5 Purchase Order Serial Number

Computing Facilities Implementation Python application running on the live image Monitors hardware and software failures –Lemon agent running on the live image embedding all the necessary hardware sensors Reporting events to Splunk Maintain hardware profile of each server in a DB x86 architecture, soon ARM Automatic server registration and burn-in framework - 5

Computing Facilities Process Steps – Registration Automatic server registration and burn-in framework - 6 PXE boot Network DB DHCP Temporary IP Load Live image Discover MAC addresses Register Permanent IP HW Discovery HW Inventory Register asset info Start burn-in Get Certificates

Computing Facilities Burn-in & performance tests Run as part of the live (in memory) image 1.Memory (memtest) and CPU (burnK7 or burnP6, and burn MMX) endurance tests 2.Disks endurance tests (badblocks, smart self-tests) 3.Disk and CPU performance tests (HEP-SPEC06, FIO) Based on HATS, presented in Hepix Spring ‘13 –Performance tests aimed at certifying the conformance to the technical specifications, quite efficient at finding hardware failures: Automatic server registration and burn-in framework - 7

Computing Facilities Results – Registration Automatic server registration and burn-in framework - 8

Computing Facilities Results - Registration Automatic server registration and burn-in framework - 9

Computing Facilities Results – Registration Some reasons for the failures and retries in the process: –Faulty cabling, i.e. wrong port cabled, or cable not fully plugged in –Faulty switch ports or settings –Faulty main-board –Not a failure, few racks missing switch up-links at CERN prevented PXE boot of some servers until problem fixed Automatic server registration and burn-in framework - 10

Computing Facilities Results – Burn-in & Performance tests Burn-in tests HEPSPEC06 –Total Hepspec of ~260k Automatic server registration and burn-in framework - 11

Computing Facilities Conclusions Impact to our procurement activities: The current framework allows to run acceptance tests over a very short period –1000+ servers and attached storage went through the process in about 1.5 week (instead of 3 to 4 months) It requires a minimal amount of efforts and resources –One person follows up what is happening using dashboards for about one hour per day – if no errors detected However it can only work that well if the servers are delivered as requested –Preparation is a key to the success! Automatic server registration and burn-in framework - 12

Computing Facilities Future work Functionality that we plan to add in the future to further automate the process: Integration of a fully automated P2P network test Better integration of RAID controllers –They require 3 rd party tools and specific hardware sensors to detect errors Automation of the allocation process –If the server is error free, direct registration to Foreman Decouple it from CERN infrastructure so we can distribute it Automatic server registration and burn-in framework - 13

Computing Facilities Thank you Questions?

Computing Facilities contact: