Managing managed storage CERN Disk Server operations HEPiX 2004 / BNL Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David.

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

Tony Doyle - University of Glasgow GridPP EDG - UK Contributions Architecture Testbed-1 Network Monitoring Certificates & Security Storage Element R-GMA.
Site report: CERN Helge.Meinhard (at) cern.ch HEPiX/HEPNT spring 2004 Edinburgh.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
CERN Data Services Update HEPiX 2004 / NeSC Edinburgh Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon.
The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.
12th September 2002Tim Adye1 RAL Tier A Tim Adye Rutherford Appleton Laboratory BaBar Collaboration Meeting Imperial College, London 12 th September 2002.
Storage Procurements Some random thoughts on getting the storage you need Martin Bly Tier1 Fabric Manager.
1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware and software on Prague farms Brief statistics about running LHC experiments.
CCTracker Presented by Dinesh Sarode Leaf : Bill Tomlin IT/FIO URL
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
12. March 2003Bernd Panzer-Steindel, CERN/IT1 LCG Fabric status
The CERN Computer Centres October 14 th 2005 CERN.ch.
Computer Hardware and Procurement at CERN Helge Meinhard (at) cern ch HEPiX fall SLAC.
Site report: CERN Helge Meinhard (at) cern ch HEPiX fall SLAC.
BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.
CERN IT Department CH-1211 Genève 23 Switzerland t R&D Activities on Storage in CERN-IT’s FIO group Helge Meinhard / CERN-IT HEPiX Fall 2009.
Day 10 Hardware Fault Tolerance RAID. High availability All servers should be on UPSs –2 Types Smart UPS –Serial cable connects from UPS to computer.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Computing/Tier 3 Status at Panjab S. Gautam, V. Bhatnagar India-CMS Meeting, Sept 27-28, 2007 Delhi University, Delhi Centre of Advanced Study in Physics,
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
Site report: CERN Helge Meinhard (at) cern ch HEPiX spring CASPUR.
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
Site report: CERN Helge Meinhard (at) cern ch HEPiX fall BNL.
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.
Farm Management D. Andreotti 1), A. Crescente 2), A. Dorigo 2), F. Galeazzi 2), M. Marzolla 3), M. Morandin 2), F.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.
Linux Servers with JASMine K. Edwards, A. Kowalski, S. Philpott HEPiX May 21, 2003.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
10/22/2002Bernd Panzer-Steindel, CERN/IT1 Data Challenges and Fabric Architecture.
1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware on Prague farms Statistics about running LHC experiment’s DC Experience.
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
Storage and Storage Access 1 Rainer Többicke CERN/IT.
CERN IT Department CH-1211 Genève 23 Switzerland Introduction to CERN Computing Services Bernd Panzer-Steindel, CERN/IT.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
IDE disk servers at CERN Helge Meinhard / CERN-IT CERN OpenLab workshop 17 March 2003.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel.
Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.
RAL Site Report John Gordon HEPiX/HEPNT Catania 17th April 2002.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Site Report Helge Meinhard / CERN-IT HEPiX Fall 2009 LBNL 26 October 2009.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.
Linux IDE Disk Servers Andrew Sansum 8 March 2000.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
Operation of the CERN Managed Storage environment; current status and future directions CHEP 2004 / Interlaken Data Services team: Vladimír Bahyl, Hugo.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
Mass Storage at SARA Peter Michielse (NCF) Mark van de Sanden, Ron Trompert (SARA) GDB – CERN – January 12, 2005.
The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.
10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.
Virtual Server Server Self Service Center (S3C) JI July.
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
Site Report: CERN Helge Meinhard / CERN-IT HEPiX, Jefferson Lab 09 October 2006.
CERN Disk Storage Technology Choices LCG-France Meeting April 8 th 2005 CERN.ch.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Status and plans of central CERN Linux facilities
Castor services at the Tier-0
Presentation transcript:

Managing managed storage CERN Disk Server operations HEPiX 2004 / BNL Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon Lee, Tony Osborne, Tim Smith

FIO/DS2 Outline Which are our Data Services? Disk server CERN Management tools Whats next?

FIO/DS3 A lot of hardware Disk storage 350 storage in a box Linux diskservers 6700 disks 550 TeraBytes of raw disk space Tape storage 2 robotic installations each with 5 STK 9310 silos B drives, tapes, 2.8 PB drives, 8000 tapes, 160 TB

FIO/DS4 Many applications 200 CASTOR! 40 Oracle 20 CDR 10 AFS scratch dCache, LCG, OpenLab, EGEE, data challenges 40 in repair/spare A very heterogeneous environment! And very dynamic too

FIO/DS5 Players Many teams involved: Application responsibles / Users Service managers System administrators team Suppliers Software often not redundant… need to minimize downtime! …so the hardware should be!

FIO/DS6 Storage in a box 13 different hardware configurations: 8 – 26 IDE disks, hot-swappable trays 2 – 4 3-Ware RAID controllers 2 CPUs 2 – 3 power supplies GigE network card Should be redundant…

FIO/DS7 hardware interventions 55 interventions since Sep 1 disk replacements (70%) trays, cables, fans, PSU 33% involve (un)scheduled downtime Older hardware harder to maintain One supplier out of business Incidents to spice up life…

FIO/DS8 Disk replacement 10 months before case agreed: Head instabilities 4 weeks to execute 1224 disks exchanged (=18%); And the cages as well Christmas

FIO/DS9 65 Jumbos 1 – 1.5 TB raw disk space Ware controllers 600 MHz PIII No PXE Becoming hard to maintain Many still under warranty Make good mini-bars!

FIO/DS U servers 4U (5U) rack mounted 1 – 1.5 TB 2 * 3-Ware 7000 series currently upgrading firmware 2 * 1 GHz PIIIs No PXE (yet) Various maintenance issues

FIO/DS U servers 8U rack mounted 2 – 2.5 Tb 3 – 4 * 3-Ware 7500(6)-8 2 * 2.4 GHz Xeon Well controlled, well maintained, well behaved, after disk replacements

FIO/DS12 Diskserver evolution

FIO/DS13 That was then… HW RAID1 Ext2 filesystems many of them 13 different kernels! RedHat 6.1/6.2, 7.2/7.3, 2.1ES Need for automation + standardization ELFms toolsuite Quattor – installation + configuration LEMON – performance + exception monitoring LEAF – Hardware and State Management

FIO/DS14 …this is now RedHat 7.3, preparing for SLC3 Oracle: RHEL 2.1, preparing RHEL 3 kernel has old 3-Ware driver HW RAID5 + hot spare disk Up to 50% more usable space On 3-Ware 7000 controller with up-to-date firmware SW RAID0 + XFS Improved performance expected iozone benchmark Old XFS version Improved kernel / elevator tuning

FIO/DS15 Updating the toolbox SMART – to predict disk failure daily and weekly self-tests, on every disk IPMI v1.5 HW monitoring and event control Power control, resets Lm_sensors – temperature monitoring Hardware and software specific All data flows into Lemon repository

FIO/DS16 Wintertime?

FIO/DS17 This is now Quattorized + Lemonized Rely on Operator and SysAdmin teams Operated in same way as PC farms Getting more out of suppliers BIOS upgrade necessary for PXE enabling BTW: most applies to tapeservers as well

FIO/DS18 Whats next? New hardware 360 TB SATA in a box, 2 different suppliers 140 TB FC attached external SATA disk arrays New software SLC3, RHEL 3 New CASTOR stager New challenges Oracle SAN setup Alice data challenge

FIO/DS19 Conclusions A lot of work has been done to Stabilize Hardware and Software Automate + hand over basic operations Integrate into standard work flows Get more out of available hardware Achieved pro-active data management

FIO/DS20 Useful links Standing on the shoulders of giants Tim Smith CHEP 2004 CHEP Helge Meinhard CHEP 2004 CHEP Peter Kelemen CERN IT After C5 CERN IT After C5 Jan Iven HEPiX 2004 EdinburghHEPiX 2004 Edinburgh