NCCS User Forum September 25, 2012. Agenda Introduction Discover Updates NCCS Operations & User Services Updates Question & Answer NCCS User Forum, Sep.

Slides:

Advertisements

Similar presentations

XenData SX-520 LTO Archive Servers A series of archive servers based on IT standards, designed for the demanding requirements of the media and entertainment.

Advertisements

NCCS User Forum June 25, 2013.

IBM 1350 Cluster Expansion Doug Johnson Senior Systems Developer.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SDSC RP Update October 21, 2010.

NCCS User Forum June 19, Agenda Introduction Discover Updates NCCS Operations & User Services Updates Question & Answer Breakout Session: –Climate.

Near-Term NCCS & Discover Cluster Changes and Integration Plans: A Briefing for NCCS Users October 30, 2014.

Discover Cluster Upgrades: Hello Haswells and SLES11 SP3, Goodbye Westmeres February 3, 2015 NCCS Brown Bag.

NCCS User Forum September 14, Agenda – September 14, 2010 Welcome & Introduction (Phil Webster, CISTO Chief) Current System Status (Fred Reitz,

SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently online data 100 TB work area (not controlled by SUMS) 2 PB near-line.

Academic and Research Technology (A&RT)

Title US-CMS User Facilities Vivian O’Dell US CMS Physics Meeting May 18, 2001.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.

User Forum NASA Center for Climate Simulation High Performance Science July 22, 2014.

High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

Illinois Campus Cluster Program User Forum October 24, 2012 Illini Union Room 210 2:00PM – 3:30PM.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.

Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>

High Performance Computing G Burton – ICG – Oct12 – v1.1 1.

Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

SDSC RP Update TeraGrid Roundtable Reviewing Dash Unique characteristics: –A pre-production/evaluation “data-intensive” supercomputer based.

Earth Observing System Data and Information System (EOSDIS) provides access to more than 3,000 types of Earth science data products and specialized services.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

Site Lightning Report: MWT2 Mark Neubauer University of Illinois at Urbana-Champaign US ATLAS Facilities UC Santa Cruz Nov 14, 2012.

NCCS User Forum 15 May NCCS User Forum5/15/20082 Agenda Welcome & Introduction Phil Webster NCCS Current System Status Fred Reitz, Operations Manager.

NLIT May 26, 2010 Page 1 Computing Jefferson Lab Users Group Meeting 8 June 2010 Roy Whitney CIO & CTO.

NCCS NCCS User Forum 24 March NCCS Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager.

NCCS User Forum June 15, Agenda Current System Status Fred Reitz, HPC Operations NCCS Compute Capabilities Dan Duffy, Lead Architect User Services.

Scientific Computing Experimental Physics Lattice QCD Sandy Philpott May 20, 2011 IT Internal Review 12GeV Readiness.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.

NCCS User Forum March 20, Agenda – March 20, 2012 Welcome & Introduction (Lynn Parnell) Discover Update (Dan Duffy) Dirac DMF Growth Rate – Issue.

Looking Ahead: A New PSU Research Cloud Architecture Chuck Gilbert - Systems Architect and Systems Team Lead Research CI Coordinating Committee Meeting.

PDSF at NERSC Site Report HEPiX April 2010 Jay Srinivasan (w/contributions from I. Sakrejda, C. Whitney, and B. Draney) (Presented by Sandy.

EGEE is a project funded by the European Union under contract IST HellasGrid Hardware Tender Christos Aposkitis GRNET EGEE 3 rd parties Advanced.

NCCS User Forum 11 December GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

NCCS User Forum July 19, Agenda Introduction (Lynn Parnell, NCCS HPC Lead) Discover Update Archive Update User Services Update Data Services & Analysis.

NICS RP Update TeraGrid Round Table March 10, 2011 Ryan Braby NICS HPC Operations Group Lead.

GSFC NCCS NCCS User Forum 25 September GSFC NCCS NCCS User Forum9/25/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Scott Wallace,

ClinicalSoftwareSolutions Patient focused.Business minded. Slide 1 Opus Server Architecture Fritz Feltner Sept 7, 2007 Director, IT and Systems Integration.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Neutrino Program (MINOS, MINERvA, General) Margaret Votava April 21, 2009 Tactical plan.

Welcome to the PRECIS training workshop

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

Introduction to Hartree Centre Resources: IBM iDataPlex Cluster and Training Workstations Rob Allan Scientific Computing Department STFC Daresbury Laboratory.

Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.

NCCS User Forum December 7, Agenda – December 7, 2010 Welcome & Introduction (Phil Webster, CISTO Chief) Current System Status (Fred Reitz, NCCS.

Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability September 2010 Phil Andrews Patricia Kovatch Victor Hazlewood Troy Baer.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.

Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.

A Solution for Maintaining File Integrity within an Online Data Archive Dan Scholes PDS Geosciences Node Washington University 1.

Advanced Computing Facility Introduction

Compute and Storage For the Farm at Jlab

HPC usage and software packages

Experiences and Outlook Data Preservation and Long Term Analysis

Architecture & System Overview

NERSC Reliability Data

Advanced Computing Facility Introduction

Presentation transcript:

NCCS User Forum September 25, 2012

Agenda Introduction Discover Updates NCCS Operations & User Services Updates Question & Answer NCCS User Forum, Sep. 25, 20122

Accomplishments Discover SCU8 augmentation. – Intel Xeon Sandy Bridge: ~160 TFLOPs (480 nodes), pioneer users this week. – Many Integrated Core (MIC): expected October 2012 (240 units). ~4 Petabytes (usable) Discover nobackup disk. – Testing ongoing, starting to provide some project nobackup later this week. Ten NCCS Brown Bag seminars so far, more to come. NCCS Earth System Grid Federation (ESGF) Data Nodes: over 141 TB and 6 million data sets served (April 2011 to September 2012). – NASA’s CMIP5/IPCC AR5 climate simulation contributions. – Obs4MIPS – observations formatted for use in climate/ocean/weather model intercomparison studies (e.g., CERES EBAF, TRMM, …). – ana4MIPS – analyses, initially MERRA monthly means. NCCS User Forum, Sep. 25, 20123

Staff Additions Brandon Rives, Intern NCCS User Forum, Sep. 25, 20124

Discover Update Dan Duffy, NCCS Lead Architect

NCCS Compute Capacity Evolution NCCS User Forum, Sep. 25, 20126

NCCS Architecture NCCS User Forum, Sep. 25, Discover-Fabric 1 139TF, 13K Cores Management Servers License Servers GPFS Management GPFS Disk Subsystems ~ 8.0 PB Other Services Discover-Fabric 2 650TF, 31K Cores 39K GPU Cores 14K MIC Cores FY13 Upgrade Login Nodes ARCHIVE Data Gateways Data Portal Disk ~970 TB Tape ~30 PB Planned for FY13 Internal Services Existing GPFS I/O Nodes PBS Servers Data Management NCCS LAN (1 GbE and 10 GbE) Dali Analysis Nodes Dali-GPU Analysis Nodes GPFS I/O Nodes Viz Wall

Discover SCU8 – Sandy Bridge Nodes NCCS User Forum, Sep. 25, Running some final system level tests Ready for pioneer users later this week It’s Here! 480 IBM iDataPlex Nodes, each configured with – Dual Intel SandyBridge 2.6 GHz processors (E5-2670) 20 MB Cache – 16 cores per node (8 cores per socket) – 32 GB of RAM (maintain ratio of 2 GB/core) – 8 floating point operations per clock cycle – Quad Data Rate Infiniband – SLES11 SP1 Advanced Vector Extensions (AVX) – New instruction set ( – Just have to recompile

Discover SCU8 – Many Integrated Cores (MIC) NCCS User Forum, Sep. 25, The NCCS will be integrating 240 Intel MIC Processors later this year (October) – ~1 TFLOP per co-processor unit – PCI-E Gen3 connected – Will start with 1 per node in half of SCU8 How do you program for the MIC? – Full suite of Intel Compilers – Doris Pan and Hamid Oloso have access to a prototype version and have developed experience over the past 6 months or so – Different usage modes; common ones are “offload” and “native” – Expectation: Significant performance gain for highly parallel, highly vectorizable applications – Easier code porting using native mode, but potential for better performance using offload mode – NCCS/SSSO will host Brown Bags and training sessions soon!

Sandy Bridge Memory Bandwidth Performance NCCS User Forum, Sep. 25, STREAMS Copy Benchmark comparison of the last three processors – Nehalem (8 cores/node) – Westmere (12 cores/node) – SandyBridge (16 cores/node)

SCU8 Finite Volume Cubed-Sphere Performance NCCS User Forum, Sep. 25, Comparison of the performance of the GEOS-5 FV-CS Benchmark 4 shows an improvement of 1.3x to 1.5x over the previous systems’ processors. JCSDA (Jibb): Westmere Discover SCU3+/SCU4+: Westmere Discover SCU8: Sandy Bridge

Discover: Large “nobackup” augmentation NCCS User Forum, Sep. 25, Discover NOBACKUP Disk Expansion – 5.4 Petabytes RAW (about 4 Petabytes usable) – Doubles the disk capacity in Discover NOBACKUP – NetApp racks and 6 controller pairs (2 per rack) 1,800 by 3 TB disk drives (near line SAS) 48 by 8 GB FC connections Have performed a significant amount of performance testing on these systems First file systems to go live this week If you need some space or have an outstanding request waiting, please let us know (

NCCS Operations & User Services Update Ellen Salmon

Upcoming New Resources: – Initial “NetApp” Discover nobackup, starting later this week. – Discover SCU8 Sandy Bridge: some nodes available to all for “early use” later this week. Use“sp1” queue (specify “ncpus=16” to get SCU8 nodes). – Discover Intel MIC (240 units) installed in October; user training, testing period later fall/winter. Planned Outages (to date): – Goddard’s multi-building electrical outage: NCCS and all of B28, plus Bldgs 5, 18, 19, 20, 28D, 29. NCCS: ~2000 ET Friday, Oct. 12, through Sun., Oct. 14, or Mon., Oct. 15 (TBD—Oct ?). – Maybe all Discover, definitely SCUs 5, 6, 7, 8: Required InfiniBand OFED (software stack) upgrade Likely mid-October, post-HS3 Field Campaign. – Half of SCU8 (only) will be taken offline to install MIC units (mid-October?). – All of SCU8 (only) downtime to meet benchmark commitments (mid/late October weekend). Required Changes: – Rolling SLES11 SP1 upgrades for Discover in next few weeks! (mostly transparent) Try now! Use “sp1” queue (PBS); ssh discover-test, dali-test from Discover or Dali (interactive). – All codes must be recompiled after required InfiniBand OFED (software stack) upgrade, likely mid-October (after HS3 Field Campaign finishes). NCCS User Forum, Sep. 25,

NCCS User Survey – Coming Soon Ten minute online survey via SurveyMonkey. Provides a more systematic way for NCCS to gauge what’s working well for you, and what needs more work. We intend to repeat survey annually so we can evaluate progress. NCCS User Forum, Sep. 25, Sample NCCS User Survey screen. Your frank opinions and suggestions are very much appreciated.

Discover SCU8 Sandy Bridge: AVX NCCS User Forum, Sep. 25, The Sandy Bridge processor family features: Intel Advanced Vector eXtensions Intel AVX is a wider, new 256-bit instruction set extension to Intel SSE (Streaming 128-bit SIMD Extensions), hence higher peak FLOPS with good power efficiency. Designed for applications that are floating point intensive.

Discover SCU8 Sandy Bridge: User Changes Compiler flags to take advantage of Intel AVX (for Intel compilers 11.1 and up) -xavx: Generate an optimized executable that runs on the Sandy Bridge processors ONLY -axavx –xsse4.2: Generate an executable that runs on any SSE4.2 compatible processors but with additional specialized code path optimized for AVX compatible processors (i.e., run on all Discover processors) Application performance is affected slightly compared to with “-xavx” due to the run-time checks needed to determine which code path to use NCCS User Forum, Sep. 25,

Sandy Bridge vs.Westmere: Application Performance Comparison – Preliminary Sandy Bridge Execution Speedup Compared to Westmere WRF NMM 4km Same executable Different executable (compiled with –xavx on Sandy Bridge) Core to CoreNode to NodeCore to CoreNode to Node GEOS5 GCM half degree Same executable Different executable (compiled with –xavx on Sandy Bridge) Core to CoreNode to NodeCore to CoreNode to Node NCCS User Forum, Sep. 25,

NCCS Brown Bag Seminars ~Twice a month in GSFC Building 33 (as available). Content is available on the NCCS web site following seminar: Next talk: “Tips for Monitoring Memory in PBS Jobs” Tue., 16 October, 12:30-1:30, Building 33, Room E125 Prioritize topics of interest (and add your suggestions) on today’s feedback or signup sheet, or via to We will repeat seminars upon request. NCCS User Forum, Sep. 25,

Ongoing Investigations Discover: intermittent PBS slowness (qstat, qsub delays). Problem severity escalated with Altair, the PBS vendor. Discover GPFS hangs due to jobs exhausting available node memory. Continuing to refine automated monitoring, PBS slowness complicates this. Dirac (/archive): occasional hangs in NFS exports to Discover, SGI investigating. Workaround: use scp or sftp to dirac – /archive files are available, just not via NFS. Data Portal: resolving problem with one of four disk arrays. Updated disk array microcode applied, file sanity checking is ongoing. Discover: investigating hardware options for heavy GPFS metadata workloads (many concurrent small, random I/O actions for directories, filenames, etc.). NCCS User Forum, Sep. 25,

Questions & Answers NCCS User Services:

Contact Information NCCS User Services: Thank you NCCS User Forum, Sep. 25,

Supporting Slides NCCS User Forum, Sep. 25,

Resolved Issues Dirac (Archive) Database Problem, Aug : Files on the Dirac DMF Archive cluster were unavailable for several days while NCCS staff worked with SGI to recover from a rare-occurrence database corruption, which did not affect the integrity of the content of files stored in the archive. SGI has identified the cause of the corruption and has created a remedy to prevent future occurrences. Long PBS Startup: The NCCS resolved a longstanding and vexing problem with very long initialization times at startup for Discover’s PBS batch system by installing PBS 11 and applying a newly available bugfix patch from PBS vendor Altair (June 2012). Problematic “HDE” Discover nobackup Disks: Replaced 540+ problematic disks via staff bucket-brigade, and began migrating data back to repaired disk array (July 2012). Post-Derecho Recovery: Severe storms the evening of Friday, June 29 led to a Goddard-wide power outage starting Saturday, June 30. The power outage and subsequent restoration led to the failure of a breaker required to power the air conditioning units in one of the two main NCCS computer rooms as well as multiple Discover DDN disk storage hardware errors. The extensive recovery efforts included replacing the failed breaker and two DDN disk controllers, as well as performing multiple disk rebuilds on failed storage tiers. The Discover cluster was restored to service on Thursday, July 5. NCCS User Forum, Sep. 25,

NASA Center for Climate Simulation Supercomputing Environment IB SAN GPFS I/O Servers JIBB JCSDA (batch) ~39 TF peak Westmere 10 GbE JIBB Login Data Portal Data Sharing GPFS I/O Servers IB SAN Dirac Archive Parallel DMF Cluster SANs 10 GbE Storage Area Network (SAN) GPFS I/O Servers InfiniBand (IB) ~ 3 TF peak Base (offline) ~139 TF peak SCUs1,2,3,4 Westmere Dali Analysis Nodes (interactive) Discover Login ~92 TF peak SCU5 & 6 Nehalem ~161 TF peak SCU 7 Westmere Dali-GPU Analysis Nodes (interactive) ~160 TF peak SCU 8 Sandy Bridge (late summer 2012) Tape Libraries GPFS I/O Servers InfiniBand (IB) Discover Login Discover (batch) Supported by HQ’s Science Mission Directorate ① Discover Linux Supercomputer Summer 2012: ~3,400 nodes (35,560 cores total) ~400 TFLOPS peak 78 TB memory (2 or 3 GB per core) 3.6 PB disk Fall 2012 additions: 480 nodes (7,680 cores) ~160 TFLOPS peak 15 TB memory ~3 PB disk ② Dali and Dali-gpu Analysis 12- and 16-core nodes 16 GB memory per core Dali-GPU has NVIDIA GPUs ③ Dirac Archive 0.9 PB disk ~60 PB robotic tape library Data Management Facility (DMF) space management ④ Data Portal Data Sharing Services Earth System Grid OPeNDAP Data download: http, https, ftp Web Mapping Services (WMS)server ⑤ JIBB Linux cluster for Joint Center for Satellite Data Assimilation community NCCS User Forum, Sep. 25,

Twice-a-Month NCCS Brown Bag Seminars: Delivered & Proposed Topics Intel Many Integrated Core (MIC) Prototype Experiences Code Debugging Using TotalView on Discover, Parts 1 and 2 Code Optimization Using the TAU Profiling Tool Climate Data Analysis & Visualization using UVCDAT NCCS Archive Usage & Best Practices Tips, & dmtag Using winscp with Discover, Dali, and Dirac SIVO-PyD Python Distribution for Scientific Data Analysis Monitoring PBS Jobs and Memory Introduction to Using Matlab with GPUs Best Practices for Using Matlab with GPUs Using CUDA with NVIDIA GPUs Using OpenCL Introduction to the NCCS Discover Environment Scientific Computing with Python Using GNU Octave Using Database Filesystems for Many Small Files NCCS User Forum, Sep. 25,

Slides from June 19, 2012 NCCS User Forum

Coming Discover Changes: Linux SLES 11 SP1 Reason for upgrades: – SLES 11 SP1 is required to maintain current Linux security patches. SLES 11 SP1: – NCCS staff tested & documented the few changes (updated libraries and Linux kernel). – Planning for phased, rolling deployments with minimal downtime. NCCS User Forum, Sep. 25,

Dirac Archive Single Tape Copy Default Started May 31, 2012 Default became single tape copy of NCCS archive files: – …for all newly created files and – …all existing archive files. You can request a second tape copy of your critical archive files at any time*: – dmtag –t 2 – *Second tape copy will be made within a few hours to a few days. See the following for more details: – – – Please contact ( ) if you have questions or NCCS User Forum, Sep. 25,

NCCS Observed “Get_File” Tape Operation Error Rates, Temporary and Permanent, May 2010 – May 2012 Mode (most frequent situation): No Get_File errors (in 98 of 731 days) Median: 1 error in 2,371 Get_File operations Mean: 1 error in 258 Get_File operations NCCS User Forum, Sep. 25,

NCCS Metrics Slides (Through August 31, 2012)

NCCS Discover Linux Cluster Utilization Normalized to 30-Day Month NCCS User Forum, Sep. 25,

Discover Linux Cluster Expansion Factor Expansion Factor = (Queue Wait + Runtime) / Runtime NCCS User Forum, Sep. 25,

Discover Linux Cluster Downtime 25% 15% 10% 5% 0% 20% 9/1110/1111/1112/111/122/123/124/125/126/127/128/12 Scheduled Unscheduled July 2012 unavailability reflects downtime and recovery from multi-day weather-related Goddard-wide power outage. NCCS User Forum, Sep. 25,

NCCS Mass Storage ① “Tape Library Potential Capacity” decreased beginning in March 2012 because, at NCCS request, users began deleting unneeded data, and that tape capacity had not all been reclaimed so that new data could be written to the tapes. ② In late May, 2012, NCCS changed the Mass Storage default so that two tape copies are made only for files for which two copies have been explicitly requested. NCCS is gradually reclaiming second- copy tape space from legacy files for which two copies have not been requested NCCS User Forum, Sep. 25,