Presentation is loading. Please wait.

Presentation is loading. Please wait.

Operation of the CERN Managed Storage environment; current status and future directions CHEP 2004 / Interlaken Data Services team: Vladimír Bahyl, Hugo.

Similar presentations


Presentation on theme: "Operation of the CERN Managed Storage environment; current status and future directions CHEP 2004 / Interlaken Data Services team: Vladimír Bahyl, Hugo."— Presentation transcript:

1 Operation of the CERN Managed Storage environment; current status and future directions CHEP 2004 / Interlaken Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon Lee, Tony Osborne, Tim Smith

2 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch2 of 18 Managed Storage Dream  Free to open…Instant access  Any time later…Unbounded recall  Find exact same coinsGoods integrity 0011010 1010011 0011101 1111011

3 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch3 of 18 Managed Storage Reality  Maintain + upgrade, innovate + technology refresh  Ageing equipment, escalating requirements Dynamic store / Active Data Management 0011010 1010011 0011101 1111011 Tape Store Disk Cache

4 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch4 of 18 CASTOR Service CERN Managed Storage Disk Cache Tape Store Stage Servers CASTOR Servers CASTOR Grid Service GRIDftp servers SRM Service Reliability Uniformity Automation New Service Scalability Redundancy Scalability Tape Store Disk Cache Stage Servers Disk Cache Stage Servers Disk Cache Stage Servers Disk Cache Stage Servers Disk Cache Stage Servers Disk Cache Stage Servers Disk Cache Stage Servers Tape Store 42 stager/disk caches 370 disk servers 6,700 spinning disks 70 tape servers 35,000 tapes Highly Distributed System

5 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch5 of 18 CASTOR Service  Running experiments  CDR for NA48, COMPASS, Ntof  Experiment peaks of 120MB/s  Combined average 10TB/day  Sustained 10MB/s per dedicated 9940B drive  Record 1.5 PB in 2004  Pseudo-online analysis  Experiments in the analysis phase  LEP and Fixed Target  LHC experiments in construction  Data production / analysis (Tier0/1 operations)  Test beam CDR

6 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch6 of 18 Quattor-ising  Motivation: Scale(See G.Cancio’s talk)  Uniformity; Manageability; Automation  Configuration Description (into CDB)  HW and SW; nodes and services  Reinstallation  Quiescing a server ≠ draining a client!  Gigabit cards gymnastics; BIOS upgrades for PXE  Eliminate peculiarities from CASTOR nodes  Switches misconfigurations, firmware upgrades  (ext2 -> ext3) Manageable servers

7 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch7 of 18 LEMON-ising  Lemon agent everywhere  Linux box monitoring and alarms  Automatic HW static checks  Adding  CASTOR server specific  Service monitoring  HW Monitoring  temperatures, voltages, fans etc  lm_sensors -> IPMI (see tape section)  disk errors; SMART  smartmontools  auto checks; predictive monitoring  tape drive errors; SMART Uniformly monitored servers

8 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch8 of 18 Warranties

9 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch9 of 18 Disk Replacement  10 months before case agreed: Head instabilities  4 weeks to execute  1224 disks exchanged (=18%); And the cages Unacceptably high failure rate!

10 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch10 of 18 Disk Storage Developments  Disk Configurations / File systems  HW.Raid-1/ext3 -> HW.Raid-5+SW.Raid-0/XFS  IPMI: HW health monit. + remote access  Remote reset + power-on/off (indep. of OS)  Serial console redirection over LAN  LEAF: Hardware and State Management  Next generations (see H.Meinhard’s talk)  360 TB SATA in a box  140 TB external SATA disk arrays  New CASTOR stager (JD.Durand’s talk)

11 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch11 of 18 Tape Service  70 tape servers (Linux)  (mostly) Single FibreChannel attached drives  2 symmetric robotic installations  5 x STK 9310 Silos in each Drives Media Bulk physics Fast Access Backup

12 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch12 of 18 Chasing Instabilities  Tape server temperatures?

13 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch13 of 18 Media Migration  Technology generations  Migrate data to avoid obsolescence and reliability issues in drives  19863480 / 3490  1995Redwood  20019940  Financial  Capacity gain in sub generations

14 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch14 of 18 1% of A tapes unreadable on B drives – keep A drives (drive head tolerances) Media Migration 9940A 60GB 12MB/s 9940B 200GB 30MB/s Replace A drives by B drives Capacity, Performance, Reliability 9 months; 25% of B resources Migrate A to B format

15 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch15 of 18 Tape Service Developments  Removing tails…  Tracking of all tape errors (18 months)  Retiring of problematic media  Proactive retiring of heavily used media (>5000 mounts)  repack on new media  Checksums  Populated writing to tape  Verified loading back to disk  Drive testing  Commodity LTO-2; High end IBM3592/STK-NG  New Technology; SL8500 library / Indigo

16 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch16 of 18 CASTOR Central Servers  Combined Oracle DB and Application Daemons node  Assorted helper applications distributed (historically) across ageing nodes  FrontEnd / BackEnd split  FE: Load balanced applications servers  Eliminate interference with DB  Load distribution, overload localisation  BE: (developing) clustered DB  Reliability, security

17 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch17 of 18 GRID Data Management  GridFTP + SRM servers (Former)  Standalone / experiment dedicated  Hard to intervene; not scalable  New load-balanced shared 6 node Service  castorgrid.cern.ch  DNS hacks for Globus reverse lookup issues  SRM modifications to support operation behind load balancer  GridFTP standalone client  Retire ftp and bbftp access to CASTOR

18 2004/09/29CERN Managed Storage: Tim.Smith@cern.ch18 of 18 Conclusions Stabilising HW and SW Automation Monitoring and control Reactive -> Proactive Data Management


Download ppt "Operation of the CERN Managed Storage environment; current status and future directions CHEP 2004 / Interlaken Data Services team: Vladimír Bahyl, Hugo."

Similar presentations


Ads by Google