NERSC Reliability Data

Slides:

Advertisements

Similar presentations

Computing Infrastructure

Advertisements

IBM 1350 Cluster Expansion Doug Johnson Senior Systems Developer.

Skyward Server Management Options Mike Bianco. Agenda: Managed Services Overview OpenEdge Management / OpenEdge Explorer OpenEdge Managed Demo.

Smart Storage and Linux An EMC Perspective Ric Wheeler

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

Introduction to the new mainframe: Large-Scale Commercial Computing © Copyright IBM Corp., All rights reserved. Chapter 7: Systems Management.

Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.

Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.

Backup Rationalisation Reorganisation of the CERN Computer Centre Backups David Asbury IT/DS Friday 6 December 2002.

Secondary Storage Unit 013: Systems Architecture Workbook: Secondary Storage 1G.

Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.

1 Introduction To The New Mainframe Stephen S. Linkin Houston Community College ©HCCS & IBM® 2008 Stephen Linkin.

1 The Virtual Reality Virtualization both inside and outside of the cloud Mike Furgal Director – Managed Database Services BravePoint.

Day 10 Hardware Fault Tolerance RAID. High availability All servers should be on UPSs –2 Types Smart UPS –Serial cable connects from UPS to computer.

Data oriented job submission scheme for the PHENIX user analysis in CCJ Tomoaki Nakamura, Hideto En’yo, Takashi Ichihara, Yasushi Watanabe and Satoshi.

Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.

Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.

Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.

Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.

6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.

1/14/2005Yan Huang - CSCI5330 Database Implementation – Storage and File Structure Storage and File Structure.

Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.

HPSS for Archival Storage Tom Sherwin Storage Group Leader, SDSC

Copyright ©2003 Digitask Consultants Inc., All rights reserved Cluster Concepts Digitask Seminar November 29, 1999 Digitask Consultants, Inc.

Business Data Communications, Fourth Edition Chapter 11: Network Management.

CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)

1 The NERSC Global File System NERSC June 12th, 2006.

COSC 3330/6308 Solutions to the Third Problem Set Jehan-François Pâris November 2012.

Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. 1 Main Frame Computing Objectives Explain why data resides on mainframe.

US Planck Data Analysis Review 1 Julian BorrillUS Planck Data Analysis Review 9–10 May 2006 Computing Facilities & Capabilities Julian Borrill Computational.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.

Office of Science U.S. Department of Energy NERSC Site Report HEPiX October 20, 2003 TRIUMF.

PDSF and the Alvarez Clusters Presented by Shane Canon, NERSC/PDSF

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Security Operations Chapter 11 Part 2 Pages 1262 to 1279.

Surviving a Mainframe Upgrade APPA – Business & Finance Conference Minneapolis, Minnesota September 19, 2006 Wayne Turnbow IS Department Manager.

The NERSC Global File System and PDSF Tom Langley PDSF Support Group NERSC at Lawrence Berkeley National Laboratory Fall HEPiX October 2006.

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

12/19/01MODIS Science Team Meeting1 MODAPS Status and Plans Edward Masuoka, Code 922 MODIS Science Data Support Team NASA’s Goddard Space Flight Center.

The Economics of Notes and Domino 8.5: How to Decrease Cost & Increase Productivity by Optimizing your Infrastructure.

High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.

Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.

File-System Management

Compute and Storage For the Farm at Jlab

Introduction to Operating Systems

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

WP18, High-speed data recording Krzysztof Wrona, European XFEL

High Availability Linux (HA Linux)

Lesson Objectives Aims Key Words

Andy Wang COP 5611 Advanced Operating Systems

PC Farms & Central Data Recording

Course Introduction Dr. Eggen COP 6611 Advanced Operating Systems

Computing Facilities & Capabilities

SAM at CCIN2P3 configuration issues

Luca dell’Agnello INFN-CNAF

3.2 Virtualisation.

Introduction to Networks

Computing Infrastructure for DAQ, DM and SC

Storage Virtualization

San Diego Supercomputer Center

National Energy Research Scientific Computing Center (NERSC)

Andy Wang COP 5611 Advanced Operating Systems

Chapter 2: Operating-System Structures

Backup Monitoring – EMC NetWorker

Chapter-5 Traffic Engineering.

Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.

Chapter 2: Operating-System Structures

Presentation transcript:

NERSC Reliability Data NERSC - PDSI Bill Kramer Jason Hick Akbar Mokhtarani PDSI BOF, FAST08 Feb. 27, 2008

Production Systems Studied at NERSC HPSS: 2 High Performance Storage Systems Seaborg: IBM SP RS/6000, AIX, 416 nodes (380 compute) Bassi: IBM p575 POWER 5, AIX, 122 nodes DaVinci: SGI Altrix 350 (SGI PropPack 4 64-bit Linux) Jacquard: Opteron Cluster, Linux, 356 nodes PDSF: Networked distributed computing, Linux NERSC Global File-system: Shared file-system based on IBM’s GPFS

Datasets Data were extracted from problem tracking database and paper records kept by the operations staff, and Vendor’s repair records Coverage is from 2001 - 2006, in some cases a subset of that period Preliminary results on systems availability and component failure were presented at HEC FSIO Workshop last Aug. Have done a more detailed analysis trying to classify the underlying causes of outage and failure Produced statistics for the NERSC Global File-system (NGF) and uploaded to the CMU website. This is different from fsstats; used fsstas to cross check results on some smaller directory tree on NGF Made workload characterization of selected NERSC applications available. They were produced by IPM, a performance monitoring tool. Made trace data for selected applications Results from a number of I/O related studies done by other groups at NERSC were posted to the website.

Results Overall systems availability is 96% - 99% Seaborg and HPSS have comprehensive data for the 6 year period and show availability of 97% - 98.5% (scheduled and unscheduled outage) Disk drives failure rate for Seaborg show rates consistent with “aging” and “infant mortality”, average of 1.2% Tape drives failure for HPSS show the same pattern, average rate (~19%) - higher than manufacturer stated 3% MTBF

Average Annual Outage Un-scheduled outages include center-wide power outage (~ 96 hours) and some network outage The fairly low downtime of the PDSF system can be attributed to its distributed nature; components can be upgraded or updated without a system wide outage. There are only three critical components that can cause the whole system become unavailable: the /home file-system, master batch scheduler, and application file-system. That is to say that outages are not designated system-wide unless one of the critical components fails. Other compute systems share a single global file-system (the NERSC Global File System or NGF) mounted under a common directory, and any down time of this file system limits the use of the compute systems for production runs. Data since: Seaborg(2001), Bassi(Dec. 2005), Jacquard(July 2005), DaVanci(Sept. 2005), HPSS(2003), NGF(Oct. 2005), PDSF(2001)

Seaborg and HPSS Annual Outage HPSS had a major software upgrade in Feb. 2006 (~3.5 days down time) Seaborg had a major security breach in fall of 2006

Disk and Tape Drives Failure

Extra Slides

Outage Classifications The large outage due to file system does not reflect problems with file system per se. Other component failures, such as a single virtual Shared Disk connected to each no I/O node would exhibit itself as file system problem (single point of failure) Note that system wide outages for accounting, benchmarking, and dedicated test render the system unavailable to users, but the system is still operational.

Seaborg Data IBM SP RS/6000, AIX 5.2 416 nodes; 380 compute nodes 4280 disk drives (4160 SSA, 120 Fibre Channel) Large disks failure in 2003 can be attributed to “aging” of older drives and “infant mortality” of newer disks IBM SP RS/6000, AIX 5.2 416 nodes; 380 compute nodes 16 CPUs per node 16-64 GB shared memory Total disk space: 44 TeraBytes

HPSS Data Two HPSS systems available at NERSC Eight tape silos with 100 tape drives attached Tape drives seem to show the same failure pattern as seaborg’s disk drives, “aging” and “infant mortality”. Two HPSS systems available at NERSC Eight tape silos with 100 tape drives attached 650 disk drives providing 100 Terabytes of buffer cache Theoretical capacity: 22 Petabytes. Theoretical maximum throughput: 3.2 GB/sec. ~1500 tape mounts/dismounts per day

NGF stats

Seaborg Outages

HPSS Outages