Managing managed storage CERN Disk Server operations HEPiX 2004 / BNL Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon Lee, Tony Osborne, Tim Smith
FIO/DS2 Outline Which are our Data Services? Disk server CERN Management tools Whats next?
FIO/DS3 A lot of hardware Disk storage 350 storage in a box Linux diskservers 6700 disks 550 TeraBytes of raw disk space Tape storage 2 robotic installations each with 5 STK 9310 silos B drives, tapes, 2.8 PB drives, 8000 tapes, 160 TB
FIO/DS4 Many applications 200 CASTOR! 40 Oracle 20 CDR 10 AFS scratch dCache, LCG, OpenLab, EGEE, data challenges 40 in repair/spare A very heterogeneous environment! And very dynamic too
FIO/DS5 Players Many teams involved: Application responsibles / Users Service managers System administrators team Suppliers Software often not redundant… need to minimize downtime! …so the hardware should be!
FIO/DS6 Storage in a box 13 different hardware configurations: 8 – 26 IDE disks, hot-swappable trays 2 – 4 3-Ware RAID controllers 2 CPUs 2 – 3 power supplies GigE network card Should be redundant…
FIO/DS7 hardware interventions 55 interventions since Sep 1 disk replacements (70%) trays, cables, fans, PSU 33% involve (un)scheduled downtime Older hardware harder to maintain One supplier out of business Incidents to spice up life…
FIO/DS8 Disk replacement 10 months before case agreed: Head instabilities 4 weeks to execute 1224 disks exchanged (=18%); And the cages as well Christmas
FIO/DS9 65 Jumbos 1 – 1.5 TB raw disk space Ware controllers 600 MHz PIII No PXE Becoming hard to maintain Many still under warranty Make good mini-bars!
FIO/DS U servers 4U (5U) rack mounted 1 – 1.5 TB 2 * 3-Ware 7000 series currently upgrading firmware 2 * 1 GHz PIIIs No PXE (yet) Various maintenance issues
FIO/DS U servers 8U rack mounted 2 – 2.5 Tb 3 – 4 * 3-Ware 7500(6)-8 2 * 2.4 GHz Xeon Well controlled, well maintained, well behaved, after disk replacements
FIO/DS12 Diskserver evolution
FIO/DS13 That was then… HW RAID1 Ext2 filesystems many of them 13 different kernels! RedHat 6.1/6.2, 7.2/7.3, 2.1ES Need for automation + standardization ELFms toolsuite Quattor – installation + configuration LEMON – performance + exception monitoring LEAF – Hardware and State Management
FIO/DS14 …this is now RedHat 7.3, preparing for SLC3 Oracle: RHEL 2.1, preparing RHEL 3 kernel has old 3-Ware driver HW RAID5 + hot spare disk Up to 50% more usable space On 3-Ware 7000 controller with up-to-date firmware SW RAID0 + XFS Improved performance expected iozone benchmark Old XFS version Improved kernel / elevator tuning
FIO/DS15 Updating the toolbox SMART – to predict disk failure daily and weekly self-tests, on every disk IPMI v1.5 HW monitoring and event control Power control, resets Lm_sensors – temperature monitoring Hardware and software specific All data flows into Lemon repository
FIO/DS16 Wintertime?
FIO/DS17 This is now Quattorized + Lemonized Rely on Operator and SysAdmin teams Operated in same way as PC farms Getting more out of suppliers BIOS upgrade necessary for PXE enabling BTW: most applies to tapeservers as well
FIO/DS18 Whats next? New hardware 360 TB SATA in a box, 2 different suppliers 140 TB FC attached external SATA disk arrays New software SLC3, RHEL 3 New CASTOR stager New challenges Oracle SAN setup Alice data challenge
FIO/DS19 Conclusions A lot of work has been done to Stabilize Hardware and Software Automate + hand over basic operations Integrate into standard work flows Get more out of available hardware Achieved pro-active data management
FIO/DS20 Useful links Standing on the shoulders of giants Tim Smith CHEP 2004 CHEP Helge Meinhard CHEP 2004 CHEP Peter Kelemen CERN IT After C5 CERN IT After C5 Jan Iven HEPiX 2004 EdinburghHEPiX 2004 Edinburgh