ASGC Site Report Jason Shih ASGC Grid Ops CASTOR External Operation Face to Face Meeting.

Slides:

Advertisements

Similar presentations

Vorlesung Speichernetzwerke Teil 2 Dipl. – Ing. (BA) Ingo Fuchs 2003.

Advertisements

PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory.

DSN-6000 series Introduction. ShareCenter Pulse DNS-320 SMB/ Entry SME SOHO/ Consume r SOHO DNS-315 Capacity and Performance D-Link Storage Categories.

1 RAL Status and Plans Carmine Cioffi Database Administrator and Developer 3D Workshop, CERN, November 2009.

BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.

© Hitachi Data Systems Corporation All rights reserved. 1 1 Det går pænt stærkt! Tony Franck Senior Solution Manager.

IBM Storwize v3700 More performance. More efficiency. No compromises.

Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.

1 © Copyright 2009 EMC Corporation. All rights reserved. Agenda Storing More Efficiently  Storage Consolidation  Tiered Storage  Storing More Intelligently.

Castor F2F Meeting Barbara Martelli Castor Database CNAF.

Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.

Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.

Introducing Snap Server™ 700i Series. 2 Introducing the Snap Server 700i series Hardware −iSCSI storage appliances with mid-market features −1U 19” rack-mount.

ASGC 1 ASGC Site Status 3D CERN. ASGC 2 Outlines Current activity Hardware and software specifications Configuration issues and experience.

CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.

RAL Site Report Castor F2F, CERN Matthew Viljoen.

Storage Systems Market Analysis Dec 04. Storage Market & Technologies.

Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.

LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.

WLCG Service Report ~~~ WLCG Management Board, 24 th November

Site Report: Tokyo Tomoaki Nakamura ICEPP, The University of Tokyo 2013/12/13Tomoaki Nakamura ICEPP, UTokyo1.

LCG-2 Plan in Taiwan Simon C. Lin and Eric Yen Academia Sinica Taipei, Taiwan 13 January 2004.

GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh

CASTOR Databases at RAL Carmine Cioffi Database Administrator and Developer Castor Face to Face, RAL February 2009.

IST Storage & Backup Group 2011 Jack Shnell Supervisor Joe Silva Senior Storage Administrator Dennis Leong.

IBM TotalStorage © 2005 IBM Corporation IBM System Storage™ Run Rate Business Executive Brief.

Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.

ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.

CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.

25-29 May 2009, HEPiX Spring ASGC Site Report Jason Shih ASGC/OPS HEPiX Fall 2009 Umea, Sweden.

VMware vSphere Configuration and Management v6

CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.

CERN Database Services for the LHC Computing Grid Maria Girone, CERN.

IHEP(Beijing LCG2) Site Report Fazhi.Qi, Gang Chen Computing Center,IHEP.

CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.

Reliability of KLOE Computing Paolo Santangelo for the KLOE Collaboration INFN LNF Commissione Scientifica Nazionale 1 Roma, 13 Ottobre 2003.

RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,

Future Plans at RAL Tier 1 Shaun de Witt. Introduction Current Set-Up Short term plans Final Configuration How we get there… How we plan/hope/pray to.

SA1 operational policy training, Athens 20-21/01/05 Presentation of the HG Node “Isabella” and operational experience Antonis Zissimos Member of ICCS administration.

BNL Oracle database services status and future plans Carlos Fernando Gamboa, John DeStefano, Dantong Yu Grid Group, RACF Facility Brookhaven National Lab,

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

BaBar Cluster Had been unstable mainly because of failing disks Very few (

CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.

01. December 2004Bernd Panzer-Steindel, CERN/IT1 Tape Storage Issues Bernd Panzer-Steindel LCG Fabric Area Manager CERN/IT.

Database CNAF Barbara Martelli Rome, April 4 st 2006.

CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

AMS02 Data Volume, Staging and Archiving Issues AMS Computing Meeting CERN April 8, 2002 Alexei Klimentov.

Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team

ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.

IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.

Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.

CommVault Architecture

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

E2800 Marco Deveronico All Flash or Hybrid system

Road map SC3 preparation

Status and plans Giuseppe Lo Re INFN-CNAF 8/05/2007.

Tape Operations Vladimír Bahyl on behalf of IT-DSS-TAB

CC - IN2P3 Site Report Hepix Spring meeting 2011 Darmstadt May 3rd

Update on Plan for KISTI-GSDC

CERN Lustre Evaluation and Storage Outlook

Luca dell’Agnello INFN-CNAF

ASGC Status report ASGC/OPS Jason Shih Nov 26th 2009

Castor services at the Tier-0

Ákos Frohner EGEE'08 September 2008

The INFN Tier-1 Storage Implementation

Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017

Presentation transcript:

ASGC Site Report Jason Shih ASGC Grid Ops CASTOR External Operation Face to Face Meeting

Outline Current infrastructure The Incident Release and arch Resource level Monitoring services, alarm and automation Operation overview Challenges and issues Plans to Q4 2k9 and 2010

The incident … Move out all facilities for cleaning Container as storage and humidification Protect Racks from Dust Ceiling Removal

Lost all tape drives Snapshots of decommissioned tape drives after the incident

IDC collocation Facility install complete at Mar 27 th Tape system delay after Apr 9 th Realignment RMA for faulty parts

Current infrastructure (I) Shared cores services Atlas and CMS Stager, ns, dlf, repack, and lsf Same DB cluster, 3 RAC nodes SRM + stager/ns/dlf Disk pools & servers 80 disk servers (+6 will be online end of 3 rd w Oct) Total capacity: 1.67PB (0.3PB allocate dynamically) Current usage: 0.79PB (~58% usage) 14 diskpools (8 for atlas and 3 for CMS, another three for bio, SAM, and dynamic) Total capacity: 0.63PB and 0.7PB for CMS and Atlas resp. Current usage: 63% and 44% for CMS and Atlas.

Current infrastructure (II) Shared tape drives 12 before incident – all decommissioned 7 during STEP (2 loan LTO3 + 5 LTO4) 18 drives add 1 st w Oct. 24 (drives in tot.)

Monitoring services Std. nagios probes NRPE + customized plugins SMS to OSE/SM for all types of critical alarms Availability metrics Tape metrics (SLS) Throughput, capacity & scheduler per VO and diskpool

Tape system – during STEP Before incident: LTO3 * 8 + LTO4 * 4 720TB with LTO3 530TB with LTO4 May 2009: Two loan LOT3 drives MES: 6 LTO4 drives end of May Capacity: 1.3PB (old, LTO3,4 mixed) + 0.8PB (LTO4) New S54 model introduced 2K slots with tier model Required: Upgrade ALMS Enhanced gripper

Current resource level (I) Atlas - Space Tokens SpaceTokenCap./JobLimitDiskServersTapePool/Cap. atlasMCDISK163TB/7908- atlasMCTAPE38TB/802atlasMCtp/39TB atlasPrdD1T0278TB/ atlasPrdD0T161TB/2103atlasPrdtp/105TB atlasGROUPDISK19T/401- atlasScratchDisk28TB/801- atlasHotDisk2/40TB2- Total950TB/

Current resource level (II) CMS Tape pools: Disk PoolCap./JobLimitDiskServersTapePool/Cap. cmsLTD0T1278T/4889* cmsPrdD1T0284T/ cmsWanOut72T/2204 * Dep. on tape family. Tape PoolCap(TB)/UsageDrive dedicationLTO3/4 mixed atlasMCtp8.98/40%NY atlasPrdtp101/65%NY cmsCSA08cruzet15.6/46%NN cmsCSA08reco5/0%NN cmsCSAtp639/99%NY cmsLTtp34.4/44%NN dteamTest3.5/1%NN

Castor rel. overview Services type OS levelReleaseRemark CoreSLC 4.7/x Stager/ns/dlf SRMSLC 4.7/x headnodes Disk Svr.SLC 4.7/x Q3 2k9 (20+ in Q4) Tape Svr.SLC 4.7/ X86-64 OS deployed for new tape server

Storage performance Env. Sequential I/O Dual channels Diff. cache size Results (IOPS) With 0.5kB IO size: 76.4k and 54k for read & write resp. Slightly decrease around 9% for both read and write when inc. IO size to 4kB.

Roadmap – Host I/F 2009 Q1Q2Q3Q4 4G FC ( ≈ 400 MB/sec) 8G FC ( ≈ 800 MB/sec) SAS 3G (4-lane ≈ 1200 MB/sec) iSCSI – 1Gb U320 - SCSI ( ≈ 320 MB/sec) iSCSI – 10 Gb SAS 6G (4-lane ≈ 2400 MB/sec) 3U16bay FC-SAS in May, 2U/12 and 4U/24 bay in June

Roadmap – Drive I/F 2009 Q1Q2Q3Q4 4G FC SAS 3G SAS 6G U320 - SCSI SATA-II 2.5” SSD (B12F series)

Est. Density 2009 H1 1TB, 1 rack (42U)= 240TB 2009 H2 2TB, 1 rack (42U)= 480TB 2010 H1 2TB, 1 rack (42U)= 480TB 2010 H2 3TB, 1 rack (42U)= 720TB TB…..

Pledged and future expansion 2k8 0.5PB expansion of Tape system in Q2 Meet MOU target mid of Nov. 1.3MSI2k per rack base on recent E5450 processor. 2k9 Q1 150 QC blade servers 2TB per drives for raid subsystem 42TB net capacity per chassis and 0.75PB in total 2k9 Q LTO4 drives – mid of Oct 330 Xeon QC (SMP, Intel 5450) blades servers 2nd phase TAPE MES - 5 LTO4 drives + HA 3rd phase TAPE MES – 6 LTO4 drives ETA 0.8PB expansion delivery: mid of Nov

Tape HA considered – Q4 Considering the faulty accessor inc. associated controllers active accessor takes over all work requests including any in progress when the fault occurred IBMIBM IBMIBM Medium Changer Controller Operator Panel Controller Active Frame 1 Active Frame 2 Active Frame 3 Medium Changer Controller Active Frame 4 Service Bay A Service Bay B Accessor Controller XY Controller Medium Changer Controller Accessor Controller XY Controller

Network Infrastructure

Issues Network infrastructure Split edge level serving T1 MSS and T2 DPM disk servers Rotation shift +1 FTE 24x7 operation Review instruction manuals Regular meeting between OSE (shifters) and SM Evaluate performance (eLog, and ticket escalated) Release upgrade Consider gradually core service upgrade Recommend components to startup? Mimic on CTB

Incidents Power surge cause critical services crash twice last 4month Disk, tape servers, RAC and all core services Tape migration problem (CMS) Wrong label type (0.2k cartridges) Relabel empty tapes Controller failures no regular pattern - kernel dump at f/w level archiver log error 4T for SRM and CASTOR backup (1.6T/week) New backup scratch attach to SAN

Upcoming expansion 1K LTO4 cartridges + 11 LTO4 drives Move to different datacenter area. Tape system HA setup Vdqm2 Priority and better group reservation Evaluate platform for RAC NFS base on NAS 2nd Tier backup Regular restore practices (2nd backup on disk cache) TSM setup complete, POC continue end of Nov.