Experience with procuring, deploying and maintaining hardware at remote co-location centre CHEP’13 14 th October 2013 Afroditi XAFI, Alain GENTIT, Anthony.

Slides:



Advertisements
Similar presentations
STOCK CONTROL Inventory is often referred to as the graveyard of business because over investment in stock is a frequent cause of business failure.
Advertisements

Module 3: Business Information Systems
IT Asset Management Status Update 02/15/ Agenda What is Asset Management and What It Is Not Scope of Asset Management Status of Key Efforts Associated.
Experience with an IT Asset Management System
Operating Systems Concepts 1/e Ruth Watson Chapter 11 Chapter 11 Network Maintenance Ruth Watson.
This courseware is copyrighted © 2011 gtslearning. No part of this courseware or any training material supplied by gtslearning International Limited to.
Cisco – Semester III Documentation. What is it most important component of a good network?  Documentation.
SGI Confidential Application Processor. SGI Confidential Application Processor Overview Application Field Replaceable Units The Application.
BSIT 4106 Airline Travel System 11 – A.  The goal of the study is to develop an automated procurement and inventory system  The system will focused.
Managing a computerised PO Operating environment 1.
RFID in the Supply Chain Primary Sources: EPC_S.pdf
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN Business Continuity Overview Wayne Salter HEPiX April 2012.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 14: Problem Recovery.
Cloud Computing How secure is it? Author: Marziyeh Arabnejad Revised/Edited: James Childress April 2014 Tandy School of Computer Science.
Assisting Enterprise iAMT Activation Infrastructure Specialist EDS, an HP Company.
April WebEx Intel ® Active Management Technology (AMT) LANDesk Provisioning LANDesk Server Manager.
HEPiX 21/05/2014 Olof Bärring, Marco Guerri – CERN IT
Overview Print and Document Services Print Management console Printer properties Troubleshooting.
SOE and Application Delivery Gwenael Moreau, Abbotsleigh.
Enterprise Asset Management
1 Chapter Overview Computer Cases Motherboards ROM BIOS.
IT Department 29 October 2012 LHC Resources Review Board2 LHC Resources Review Boards Frédéric Hemmer IT Department Head.
IT:NETWORK:MICROSOFT SERVER 2 DHCP AND WINDOWS DEPLOYMENT SERVICES.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN Remote Hosting First Experiences Wayne Salter (with input.
Inventory:OCSNG + GLPI Monitoring: Zenoss 3
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status Tony Cass (With thanks to Miguel Coelho dos Santos & Alex Iribarren) LCG-LHCC.
Weekly Report By: Devin Trejo Week of May 30, > June 5, 2015.
University of Montana - Missoula Adam Ormesher & Chase Maier.
ITEC 275 Computer Networks – Switching, Routing, and WANs Week 12 Chapter 14 Robert D’Andrea Some slides provide by Priscilla Oppenheimer and used with.
BASIC CONCEPTS OF COMPUTING.  What is a computer? What is a computer?  An expanded model of a computer An expanded model of a computer  The role of.
Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 PDSF Host Database HEPiX Fall 2002 Cary Whitney
Large Scale Parallel Print Service Ivan Deloose – David Foster – Ignacio Reguero CHEP 2000 – 8 February 2000 – Padova (I) Presented by Ivan Deloose -
1 Microsoft Windows 2000 Network Infrastructure Administration Chapter 10 Implementing Dynamic Host Configuration Protocol (DHCP)
Week #3 Objectives Partition Disks in Windows® 7 Manage Disk Volumes Maintain Disks in Windows 7 Install and Configure Device Drivers.
CIT 470: Advanced Network and System AdministrationSlide #1 CIT 470: Advanced Network and System Administration Disaster Recovery.
Oct 8-9, 2005ACS Collaboration Meeting – Archamps, France The MicroIOC From Custom To Production First customer: PSI 25 pieces.
Asset Management Know your environment. Inventory Why Inventory?
1 Installation Training Everything you need to know to get up and running.
Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.
Hostnames used in CERN IT data centres AI forum 9 th of January 2014 Procurement team IT CF/FPP.
Eric Wagner Mike Taylor Phil Joseph Copyright © 2009 Catavolt, Inc. All rights reserved.
Virtual Machines Created within the Virtualization layer, such as a hypervisor Shares the physical computer's CPU, hard disk, memory, and network interfaces.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.
Computer Centre Shutdown Post-Mortem Tim Smith FIO/IS (Presented at HEPiX by A.Silverman)
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
T. KurtykaLinac4 meeting 12 March 2009 Guidelines from the sLHC Specification Committee 1). sLHC Specification Committee 2). Schedule of tenders for Linac4.
Week1: Introduction to Computer Networks. Copyright © 2012 Cengage Learning. All rights reserved.2 Objectives 2 Describe basic computer components and.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Hardware failures Wayne Salter on behalf of Olof B ärring.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
ITEC 275 Computer Networks – Switching, Routing, and WANs Week 12 Chapter 14 Robert D’Andrea Some slides provide by Priscilla Oppenheimer and used with.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CERN IT Facility Planning and Procurement HEPiX Fall 2010 Workshop.
Virtual Server Server Self Service Center (S3C) JI July.
Education Solution.
The Troubleshooting Process. Hardware Maintenance Make sure that the hardware is operating properly.  Check the condition of parts.  Repair or replace.
The KN-BAS: Kiyawana Nuwana Bookshop Automation System is to automate all operations in the Kiyawana Nuwana bookshop. It includes the Store Sales Management,
BY: SALMAN 1.
SQL Database Management
BY: SALMAN.
Performance monitoring framework for the technical infrastructure
Module 2: DriveScale architecture and components
Designing a Production Plan
EdgeX System Management Nov 6th 2017
Olof Bärring LCG-LHCC Review, 22nd September 2008
Renovation of the Accelerators Controls Infrastructure and its Assets Management Asset and Maintenance Management Workshop November 14th, 2013 Cl.Dehavay.
Chapter 9: IOS Images and Licensing
Experience with an IT Asset Management System
DriveScale Log Collection Method of Procedure
The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:
BladeCenter Open Fabric Manager (BOFM)
Presentation transcript:

Experience with procuring, deploying and maintaining hardware at remote co-location centre CHEP’13 14 th October 2013 Afroditi XAFI, Alain GENTIT, Anthony GROSSIR, Benoit CLEMENT, Eric BONFILLOU, Liviu VALSAN (since May 2013), Miguel COELHO DOS SANTOS, Olof B Ä RRING, Vincent DORE, Wayne SALTER

Outline Background: why remote co-location? Preparation Hands-on access: why, when, who? First remote deployment Ramp up remote operation Conclusions Experience with procuring, deploying and maintaining hardware at remote co-location centre - 2

Background Experience with procuring, deploying and maintaining hardware at remote co-location centre - 3 Initial forecast predicted exceeding available power (2.5MW) in ~ New DC? Containers? Start with local area co-location: 17 racks, <100kW safe power Tender for contract for co-location up to 2.5MW equipment for a duration years Contract signed with Wigner Data Centre in Budapest (*) First deployment: 400 servers 80 disk arrays (6PB) Exercise ‘remote’ operation In local co-location Construction Updated forecast: exceed available power in ~ (*) See

Preparation Review main processes –Delivery requirements –Hardware handling –Stock management –Inventory –Network registration –Burn-in –Production deployment –Remote console –Onsite maintenance Experience with procuring, deploying and maintaining hardware at remote co-location centre - 4

Preparation Review main processes –Delivery requirements –Hardware handling –Stock management –Inventory –Network registration –Burn-in –Production deployment –Remote console –Onsite maintenance Experience with procuring, deploying and maintaining hardware at remote co-location centre - 5

Delivery requirements Delivery requirements stipulated in RFP spec and purchase order: –Firmware versions & settings –Labeling stickers (s/n, MAC, IO ports, disk,,...) Wrong settings tend to break procedures and automation –boot order, NIC with PXE disabled, … –struggle with suppliers Remote console access before delivery –Check and determine detailed settings for supplier Custom barcode stickers Experience with procuring, deploying and maintaining hardware at remote co-location centre - 6

Network registration Custom Asset Identifier set by supplier –FRU attributes in BMC Contract number in ‘Product Asset Tag’ (PAT) Serial number in ‘Product Serial’ (PS) –On chassis Bar-code sticker “PAT-PS” Network registration: –Host generates its name from Asset ID in BMC 1.Asset ID too long for Windows NETBIOS name. Compromise: 2.Host name – Asset id association is stored in the network registration database (LANDB) Experience with procuring, deploying and maintaining hardware at remote co-location centre - 7 Position Example‘P’‘0’‘9’‘4’‘7’‘2’‘9’‘6’‘4’‘7’‘5’‘3’‘2’‘7’‘9’ ‘P’Contract CERN doc numberRandom decimal number

Automated registration Experience with procuring, deploying and maintaining hardware at remote co-location centre - 8 PXE boot Network DB DHCP Temporary IP Load Live image Discover MAC addresses Register ‘p abcd’ Permanent IP HW Discovery HW Inventory Register asset info Start burn-in

Burn-in & performance tests Runs as part of the live (in memory) image 1.Memory (memtest) and CPU (burnK7 or burnP6, and burn MMX) endurance tests 2.Disks endurance tests (badblocks) 3.CPU and Disk performance tests (HEP-SPEC06, FIO) Network endurance & performance tests (netperf) currently require manual start-up Experience with procuring, deploying and maintaining hardware at remote co-location centre - 9 HEP-SPEC06 too low (expected >280)… traced to wrong BIOS settings

Automation Guiding principles: If some process can be fully described in a manual procedure… –…it might also be scripted –Not always worthwhile in short term Resilience is paramount –Failures unavoidable and usual require manual action –BUT, it might be possible to carry-on anyway Experience with procuring, deploying and maintaining hardware at remote co-location centre - 10

Onsite maintenance <2012 Experience with procuring, deploying and maintaining hardware at remote co-location centre - 11 Service company X Service company Y Service company Z … Service contracts Repair tickets Supplier ASupplier BSupplier C … CERN Supply contracts Repair tickets

Experience with procuring, deploying and maintaining hardware at remote co-location centre - 12 Supplier ASupplier BSupplier C … CERN Supply contracts Stock of spares (part of supply) Stock mgmt Service company Service contract Repair tickets Failed/replace parts shipping Contract with one service company at each location (Geneva, Budapest)

Hands-on access Why, When, Who? –In principle only to rack mount, cable and repair But… –Remote console missing or not enough –BMC stuck or remote access not working Use of switched PDUs helps –Wrong settings improve delivery process! –Cabling Unavoidable! Risk mitigation: –Label with i/o ports –Cabling diagrams –Resilient automation Restricting physical access –Unpopular but unclear why… –… say hello? Experience with procuring, deploying and maintaining hardware at remote co-location centre - 13

First remote deployment In autumn 2012 we sent out two RFPs –Servers: 300’000 HEP-SPEC06 –Storage: 28PB raw disk in JBODs Delivery to CERN and Wigner Two supply contracts per tender –Deliver 35% of servers to Wigner –Deliver 25% of storage to Wigner –+ stock of spare parts for on-site repairs Experience with procuring, deploying and maintaining hardware at remote co-location centre - 14

Power on Experience with procuring, deploying and maintaining hardware at remote co-location centre - 15

Registration + burn-in Experience with procuring, deploying and maintaining hardware at remote co-location centre - 16 Power up 400 servers + 80 JBODs Ran un-assisted Whole process completed in 2 weeks 99% success

Status of remote operation Hardware handling –Delivery notification –VAT exemption –Goods reception Scan bar codes Inventory –Rack mounting Hardware repair –Notification tickets (Service Now) –Training and documentation –Scheduling –Stock management (Infor EAM) Experience with procuring, deploying and maintaining hardware at remote co-location centre - 17 Starting now

Conclusions Remote co-location is our way to scale beyond local power limitation Wigner contract awarded following competitive tender Preparation had positive impact also on local operation –Design workflows and automation with remote operation in mind Production service is up and running –But work still required to finalise operational procedures Started preparations for large scale (90%) deployment of new deliveries in Experience with procuring, deploying and maintaining hardware at remote co-location centre - 18