Download presentation
Presentation is loading. Please wait.
Published byAlannah Ray Modified over 9 years ago
1
Operation of the CERN Managed Storage environment; current status and future directions CHEP 2004 / Interlaken Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon Lee, Tony Osborne, Tim Smith
2
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch2 of 18 Managed Storage Dream Free to open…Instant access Any time later…Unbounded recall Find exact same coinsGoods integrity 0011010 1010011 0011101 1111011
3
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch3 of 18 Managed Storage Reality Maintain + upgrade, innovate + technology refresh Ageing equipment, escalating requirements Dynamic store / Active Data Management 0011010 1010011 0011101 1111011 Tape Store Disk Cache
4
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch4 of 18 CASTOR Service CERN Managed Storage Disk Cache Tape Store Stage Servers CASTOR Servers CASTOR Grid Service GRIDftp servers SRM Service Reliability Uniformity Automation New Service Scalability Redundancy Scalability Tape Store Disk Cache Stage Servers Disk Cache Stage Servers Disk Cache Stage Servers Disk Cache Stage Servers Disk Cache Stage Servers Disk Cache Stage Servers Disk Cache Stage Servers Tape Store 42 stager/disk caches 370 disk servers 6,700 spinning disks 70 tape servers 35,000 tapes Highly Distributed System
5
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch5 of 18 CASTOR Service Running experiments CDR for NA48, COMPASS, Ntof Experiment peaks of 120MB/s Combined average 10TB/day Sustained 10MB/s per dedicated 9940B drive Record 1.5 PB in 2004 Pseudo-online analysis Experiments in the analysis phase LEP and Fixed Target LHC experiments in construction Data production / analysis (Tier0/1 operations) Test beam CDR
6
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch6 of 18 Quattor-ising Motivation: Scale(See G.Cancio’s talk) Uniformity; Manageability; Automation Configuration Description (into CDB) HW and SW; nodes and services Reinstallation Quiescing a server ≠ draining a client! Gigabit cards gymnastics; BIOS upgrades for PXE Eliminate peculiarities from CASTOR nodes Switches misconfigurations, firmware upgrades (ext2 -> ext3) Manageable servers
7
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch7 of 18 LEMON-ising Lemon agent everywhere Linux box monitoring and alarms Automatic HW static checks Adding CASTOR server specific Service monitoring HW Monitoring temperatures, voltages, fans etc lm_sensors -> IPMI (see tape section) disk errors; SMART smartmontools auto checks; predictive monitoring tape drive errors; SMART Uniformly monitored servers
8
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch8 of 18 Warranties
9
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch9 of 18 Disk Replacement 10 months before case agreed: Head instabilities 4 weeks to execute 1224 disks exchanged (=18%); And the cages Unacceptably high failure rate!
10
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch10 of 18 Disk Storage Developments Disk Configurations / File systems HW.Raid-1/ext3 -> HW.Raid-5+SW.Raid-0/XFS IPMI: HW health monit. + remote access Remote reset + power-on/off (indep. of OS) Serial console redirection over LAN LEAF: Hardware and State Management Next generations (see H.Meinhard’s talk) 360 TB SATA in a box 140 TB external SATA disk arrays New CASTOR stager (JD.Durand’s talk)
11
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch11 of 18 Tape Service 70 tape servers (Linux) (mostly) Single FibreChannel attached drives 2 symmetric robotic installations 5 x STK 9310 Silos in each Drives Media Bulk physics Fast Access Backup
12
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch12 of 18 Chasing Instabilities Tape server temperatures?
13
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch13 of 18 Media Migration Technology generations Migrate data to avoid obsolescence and reliability issues in drives 19863480 / 3490 1995Redwood 20019940 Financial Capacity gain in sub generations
14
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch14 of 18 1% of A tapes unreadable on B drives – keep A drives (drive head tolerances) Media Migration 9940A 60GB 12MB/s 9940B 200GB 30MB/s Replace A drives by B drives Capacity, Performance, Reliability 9 months; 25% of B resources Migrate A to B format
15
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch15 of 18 Tape Service Developments Removing tails… Tracking of all tape errors (18 months) Retiring of problematic media Proactive retiring of heavily used media (>5000 mounts) repack on new media Checksums Populated writing to tape Verified loading back to disk Drive testing Commodity LTO-2; High end IBM3592/STK-NG New Technology; SL8500 library / Indigo
16
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch16 of 18 CASTOR Central Servers Combined Oracle DB and Application Daemons node Assorted helper applications distributed (historically) across ageing nodes FrontEnd / BackEnd split FE: Load balanced applications servers Eliminate interference with DB Load distribution, overload localisation BE: (developing) clustered DB Reliability, security
17
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch17 of 18 GRID Data Management GridFTP + SRM servers (Former) Standalone / experiment dedicated Hard to intervene; not scalable New load-balanced shared 6 node Service castorgrid.cern.ch DNS hacks for Globus reverse lookup issues SRM modifications to support operation behind load balancer GridFTP standalone client Retire ftp and bbftp access to CASTOR
18
2004/09/29CERN Managed Storage: Tim.Smith@cern.ch18 of 18 Conclusions Stabilising HW and SW Automation Monitoring and control Reactive -> Proactive Data Management
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.