Download presentation
Presentation is loading. Please wait.
Published byGervase Lane Modified over 9 years ago
1
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Automatic server registration and burn-in framework HEPIX’13 28 th October 2013 Speaker: Afroditi XAFI Co-authors: Olof B Ä RRING, Eric BONFILLOU, Liviu VALSAN
2
Computing Facilities Outline Motivation Preparation Implementation Workflow Results of 1k+ bulk delivery: –Network Registration –Burn-in & Performance Tests Conclusions Future work Automatic server registration and burn-in framework - 2
3
Computing Facilities Motivation Up to the beginning of this year running acceptance tests meant : –Registering manually the servers in the network database and in the system administration toolkit Error prone: based on input given by the suppliers in Excel format (cells not in the right format) Not being able to register the servers would prevent the acceptance tests to start –Installing the servers with Linux SLC For very large deliveries, the parallel installation could fail - the installation servers were overloaded –Reviewing the test results was not straightforward It was a semi automated log analysis, no dashboards It required significant effort to follow up a given delivery: –On average one person was assigned full time per delivery –Every single error had to be understood and addressed manually Automatic server registration and burn-in framework - 3
4
Computing Facilities Motivation Ultimately, the goals we wanted to achieve by automating the process were to : –Reduce the amount of errors at network registration time, and detect them better –Avoid unnecessary installation and early registration in the system administration toolkits –Minimize the amount of effort needed to carry on the acceptance –Ease the analysis of the results –Deliver the resources quicker to the users (provided there are no generic hardware issues) Automatic server registration and burn-in framework - 4
5
Computing Facilities Preparation We had to define more systematically our requirements to the vendors: Infrastructure requirements prior to delivery: –Sticker of unique ID in barcode format, and location on the chassis to ease asset management –Provided IO ports schema to ease the physical installation and cabling process Remote access given by the suppliers to the first production systems prior to delivery: –Allows procurement team to define the desired hardware configuration of the systems (e.g. bios settings, boot list order) Automatic server registration and burn-in framework - 5 Purchase Order Serial Number
6
Computing Facilities Implementation Python application running on the live image Monitors hardware and software failures –Lemon agent running on the live image embedding all the necessary hardware sensors Reporting events to Splunk Maintain hardware profile of each server in a DB x86 architecture, soon ARM Automatic server registration and burn-in framework - 5
7
Computing Facilities Process Steps – Registration Automatic server registration and burn-in framework - 6 PXE boot Network DB DHCP Temporary IP Load Live image Discover MAC addresses Register Permanent IP HW Discovery HW Inventory Register asset info Start burn-in Get Certificates
8
Computing Facilities Burn-in & performance tests Run as part of the live (in memory) image 1.Memory (memtest) and CPU (burnK7 or burnP6, and burn MMX) endurance tests 2.Disks endurance tests (badblocks, smart self-tests) 3.Disk and CPU performance tests (HEP-SPEC06, FIO) Based on HATS, presented in Hepix Spring ‘13 –Performance tests aimed at certifying the conformance to the technical specifications, quite efficient at finding hardware failures: Automatic server registration and burn-in framework - 7
9
Computing Facilities Results – Registration Automatic server registration and burn-in framework - 8
10
Computing Facilities Results - Registration Automatic server registration and burn-in framework - 9
11
Computing Facilities Results – Registration Some reasons for the failures and retries in the process: –Faulty cabling, i.e. wrong port cabled, or cable not fully plugged in –Faulty switch ports or settings –Faulty main-board –Not a failure, few racks missing switch up-links at CERN prevented PXE boot of some servers until problem fixed Automatic server registration and burn-in framework - 10
12
Computing Facilities Results – Burn-in & Performance tests Burn-in tests HEPSPEC06 –Total Hepspec of ~260k Automatic server registration and burn-in framework - 11
13
Computing Facilities Conclusions Impact to our procurement activities: The current framework allows to run acceptance tests over a very short period –1000+ servers and attached storage went through the process in about 1.5 week (instead of 3 to 4 months) It requires a minimal amount of efforts and resources –One person follows up what is happening using dashboards for about one hour per day – if no errors detected However it can only work that well if the servers are delivered as requested –Preparation is a key to the success! Automatic server registration and burn-in framework - 12
14
Computing Facilities Future work Functionality that we plan to add in the future to further automate the process: Integration of a fully automated P2P network test Better integration of RAID controllers –They require 3 rd party tools and specific hardware sensors to detect errors Automation of the allocation process –If the server is error free, direct registration to Foreman Decouple it from CERN infrastructure so we can distribute it Automatic server registration and burn-in framework - 13
15
Computing Facilities Thank you Questions?
16
Computing Facilities contact: it-dep-cf-fpp@cern.ch
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.