National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19,

Slides:



Advertisements
Similar presentations
Chapter 7: Deadlocks Adapted by Donghui Zhang from the original version by Silberschatz et al.
Advertisements

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
Katie Antypas NERSC User Services Lawrence Berkeley National Lab NUG Meeting 1 February 2012 Best Practices for Reading and Writing Data on HPC Systems.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Overview of Midrange Computing Resources at LBNL Gary Jung March 26, 2002.
The Protein Folding Problem David van der Spoel Dept. of Cell & Mol. Biology Uppsala, Sweden
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
1 Scientific Cluster Support Program Steering Committee August 25, 2003 SCS Project Team.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CS-334: Computer.
SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.
IPPS 981 Berkeley FY98 Resource Working Group David E. Culler Computer Science Division U.C. Berkeley
Device Management.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
A “Java Fun For Everyone” Interactive Quiz
The Origin of the VM/370 Time-sharing system Presented by Niranjan Soundararajan.
Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.
Abstract Load balancing in the cloud computing environment has an important impact on the performance. Good load balancing makes cloud computing more.
Day 10 Hardware Fault Tolerance RAID. High availability All servers should be on UPSs –2 Types Smart UPS –Serial cable connects from UPS to computer.
Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.
Operating Systems.
Research Support Services Research Support Services.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
UTA Site Report Jae Yu UTA Site Report 4 th DOSAR Workshop Iowa State University Apr. 5 – 6, 2007 Jae Yu Univ. of Texas, Arlington.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
Mobile Relay Configuration in Data-Intensive Wireless Sensor Networks.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Report on CSU HPC (High-Performance Computing) Study Ricky Yu–Kwong Kwok Co-Chair, Research Advisory Committee ISTeC August 18,
Scientific Computing Experimental Physics Lattice QCD Sandy Philpott May 20, 2011 IT Internal Review 12GeV Readiness.
HEPiX Karlsruhe May 9-13, 2005 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National.
Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.
Wright Technology Corp. Minh Duong Tina Mendoza Tina Mendoza Mark Rivera.
Experience with the Thumper Wei Yang Stanford Linear Accelerator Center May 27-28, 2008 US ATLAS Tier 2/3 workshop University of Michigan, Ann Arbor.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
STAR Off-line Computing Capabilities at LBNL/NERSC Doug Olson, LBNL STAR Collaboration Meeting 2 August 1999, BNL.
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National Laboratory.
8 October 1999 BaBar Storage at CCIN2P3 p. 1 Rolf Rumler BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler,
Brent Gorda LBNL – SOS7 3/5/03 1 Planned Machines: BluePlanet SOS7 March 5, 2003 Brent Gorda Future Technologies Group Lawrence Berkeley.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.
IDE disk servers at CERN Helge Meinhard / CERN-IT CERN OpenLab workshop 17 March 2003.
Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.
11 January 2005 High Performance Computing at NCAR Tom Bettge Deputy Director Scientific Computing Division National Center for Atmospheric Research Boulder,
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
ATLAS Midwest Tier2 University of Chicago Indiana University Rob Gardner Computation and Enrico Fermi Institutes University of Chicago WLCG Collaboration.
Office of Science U.S. Department of Energy NERSC Site Report HEPiX October 20, 2003 TRIUMF.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 6 TB SSA Disk StorageTek Tape Libraries 830 GB MaxStrat.
Tackling I/O Issues 1 David Race 16 March 2010.
LBNL/NERSC/PDSF Site Report for HEPiX Catania, Italy April 17, 2002 by Cary Whitney
Oct. 6, 1999PHENIX Comp. Mtg.1 CC-J: Progress, Prospects and PBS Shin’ya Sawada (KEK) For CCJ-WG.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
EGEE is a project funded by the European Union under contract IST Generic Applications Requirements Roberto Barbera NA4 Generic Applications.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
What is HPC? High Performance Computing (HPC)
Low-Cost High-Performance Computing Via Consumer GPUs
UBUNTU INSTALLATION
Computing Facilities & Capabilities
National Energy Research Scientific Computing Center (NERSC)
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Operating System Introduction.
Introduction to High Performance Computing Using Sapelo2 at GACRC
2 Disks and Things.
Presentation transcript:

National Energy Research Scientific Computing Center (NERSC) HPC In a Production Environment Nicholas P. Cardo NERSC Center Division, LBNL November 19, 2003

Scientific Computing Climate Chemistry Physics Nano-Science Genomics Molecular Modeling Materials Simulation of Large Systems Algorithms Development

System Configuration 184 Compute Nodes 16 GPFS Nodes 4 Service Nodes 3 Login Nodes 1 Network/Admin Nodes 24.7 TB Formatted SSA 13 ~500 GB ~13 TB 4 64 GB GB GB

System Utilization Hours

Job Size Breakdown Hours Scaling Efforts

Large Jobs Percent Scaling Efforts 50%

System Expanded March 2003 The System Doubled Difficult Decision: –Change in operating model, single large scale production system –Cable length limitations required existing hardware to be relocated –Integration with minimal disruption of service

System Configuration 380 Compute Nodes 20 GPFS Nodes 8 Service Nodes 6 Login Nodes 2 Network/Admin Nodes 44.7 TB SSA Disk ~33 TB Scratch +106% +25% +100% +80% +153%

SCSI Disks 2 x 36.4 GB SCSI drives Mirrored for availability 36.4 GB available space rootvg (36.4 GB) 36.4 GB

SSA Disks Hot Spare hdisk x hdisk y hdisk z 16 drives per drawer RAID 5 for RAS Each node twintailed to five other nodes node in the same frame 3 Groups per drawer

Networking Login Node Network Node Jumbo Frame Production Jumbo Frame Production

Fun Facts 39,936 DIMMS 7.7 TB Memory 832 SCSI Disks 29.6 TB SCSI Disks 6,656 Processors 35 Miles of Cable 30 Gigabit Adapters 210 SSA Adapters 3,440 SSA Disks 65.4 TB raw SSA

System Utilization Hours

Job Size Breakdown Hours

New Batch Configuration premium regular low interactive debug pre_128 pre_32 pre_1 reg_128 reg_32 reg_1 reg_1l interactive debug low Class Of Service Job Class high low Priority

System Utilization Hours

Job Size Breakdown Hours

Large Jobs allocation depletion Percent 50%

Job Efficiency Hours

Performance Variation Performance variation problem detected. Original nodes appeared to performed slower than nodes added into the system. Hardware swapped between original nodes and new nodes, no improvement. Accounting showed occurrence of specific commands significantly higher on original nodes. Four problem management definitions found to be deactivated but still executing constantly on original nodes. Analysis performed by NERSC’s David Skinner

FY04 System Utilization Hours

FY04 Job Size Breakdown Hours

FY04 Large Jobs 50% Percent

Job Efficiency