Near-Term NCCS & Discover Cluster Changes and Integration Plans: A Briefing for NCCS Users October 30, 2014.

Slides:

Advertisements

Similar presentations

Rhea Analysis & Post-processing Cluster Robert D. French NCCS User Assistance.

Advertisements

IBM 1350 Cluster Expansion Doug Johnson Senior Systems Developer.

Capacity and Chargeback Virtual Appliance for VMware ESX October 23, 2007 Alex Bakman.

NCCS User Forum June 19, Agenda Introduction Discover Updates NCCS Operations & User Services Updates Question & Answer Breakout Session: –Climate.

IDC HPC User Forum Conference Appro Product Update Anthony Kenisky, VP of Sales.

LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME SESAME – LinkSCEEM.

Technology Steering Group January 31, 2007 Academic Affairs Technology Steering Group February 13, 2008.

Discover Cluster Upgrades: Hello Haswells and SLES11 SP3, Goodbye Westmeres February 3, 2015 NCCS Brown Bag.

NCCS User Forum September 14, Agenda – September 14, 2010 Welcome & Introduction (Phil Webster, CISTO Chief) Current System Status (Fred Reitz,

NHSmail: Technology Refresh 26 th February 2008 Mike Dolan NHSmail Migration Manager.

Academic and Research Technology (A&RT)

3D Computer Rendering Kevin Ginty Centre for Internet Technologies

Technology Steering Group January 31, 2007 Academic Affairs Technology Steering Group February 13, 2008.

User Forum NASA Center for Climate Simulation High Performance Science July 22, 2014.

● Dondi Vigesaa ● Operations Engineer ● Microsoft Corporation ● How Microsoft IT Deploys Windows Server 2008.

Presented by Jacob Wilson SharePoint Practice Lead Bross Group 1.

Illinois Campus Cluster Program User Forum October 24, 2012 Illini Union Room 210 2:00PM – 3:30PM.

HEPIX - Spring 2015 Tony Wong (BNL).  Yearly purchase cycle of hardware for RACF timed with U.S. gov’t fiscal year (October to September)  Aim for delivery.

11 The Ultimate Upgrade Nicholas Garcia Bell Helicopter Textron.

Effectively Explaining the Cloud to Your Colleagues.

Hands-On Microsoft Windows Server 2008

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

All Experimenters Meetings Windows 7 Migration 1 April 18, 2011 W7 AEM Presentation.

DBS to DBSi 5.0 Environment Strategy Quinn March 22, 2011.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.

The National Grid Service User Accounting System Katie Weeks Science and Technology Facilities Council.

-SIG Information Systems & Computing University of Pennsylvania December 16, /13.

Planning and Designing Server Virtualisation.

U.S. Department of Agriculture eGovernment Program August 14, 2003 eAuthentication Agency Application Pre-Design Meeting eGovernment Program.

Outline IT Organization SciComp Update CNI Update

Computing and IT Update Jefferson Lab User Group Roy Whitney, CIO & CTO 10 June 2009.

By Marcus Woodward. List Objectives For The Chapter Identify Problems that can occur if hardware is not properly maintained. Identify routine maintenance.

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

The Birmingham Environment for Academic Research Setting the Scene Peter Watkins, School of Physics and Astronomy (on behalf of the Blue Bear team)

JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.

CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)

Holland High School Construction Update June 8, 2011.

Computer Systems Week 14: Memory Management Amanda Oddie.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

GSFC NCCS NCCS User Forum 25 September GSFC NCCS NCCS User Forum9/25/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Scott Wallace,

Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.

Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.

PDSF and the Alvarez Clusters Presented by Shane Canon, NERSC/PDSF

Seaborg Decommission James M. Craw Computational Systems Group Lead NERSC User Group Meeting September 17, 2007.

Tackling I/O Issues 1 David Race 16 March 2010.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.

NCCS User Forum December 7, Agenda – December 7, 2010 Welcome & Introduction (Phil Webster, CISTO Chief) Current System Status (Fred Reitz, NCCS.

Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability September 2010 Phil Andrews Patricia Kovatch Victor Hazlewood Troy Baer.

EMI is partially funded by the European Commission under Grant Agreement RI EMI SA2 Report Andres ABAD RODRIGUEZ, CERN SA2.4, Task Leader EMI AHM,

Office of Administration Enterprise Server Farm September 2008 Briefing.

Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server

AFS Home Directory Migration Details Andy Romero Core Computing Division.

Surviving a Mainframe Upgrade APPA – Business & Finance Conference Minneapolis, Minnesota September 19, 2006 Wayne Turnbow IS Department Manager.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

Information Technology Virtualized Server Update

CyberPatriots 2016 Student Handbook.

ServiceNow Implementation Project Campus Stakeholders and UIT Change Champions Meeting October 5, 2017.

Lattice QCD Computing Project Review

Appro Xtreme-X Supercomputers

Low-Cost High-Performance Computing Via Consumer GPUs

Creating a Windows Server 2012 R2 Datacenter Virtual machine

Creating a Windows Server 2016 Datacenter Virtual machine

ALICE Computing Upgrade Predrag Buncic

Creating a Windows 7 Professional SP1 Virtual machine

Migration Strategies – Business Desktop Deployment (BDD) Overview

CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster

WordPress Unit Web Coordinators

Expanding the PHENIX Reconstruction Universe

IT Next – Transformation Program

Presentation transcript:

Near-Term NCCS & Discover Cluster Changes and Integration Plans: A Briefing for NCCS Users October 30, 2014

NASA Center for Climate Simulation Agenda Storage Augmentations Discover Cluster Hardware Changes SLURM Changes Q & A NCCS Discover Changes, October 31, 20142

NASA Center for Climate Simulation Storage Augmentations Dirac (Mass Storage) Disk Augmentation – 4 Petabytes usable (5 Petabytes “raw”) –Arriving November 2014 –Gradual integration (many files, “inodes” to move) Discover Storage Expansion –8 Petabytes usable (10 Petabytes “raw”) –Arriving December 2014 –For both targeted “Climate Downscaling” project and general use –Begin operational use in mid/late December NCCS Discover Changes, October 31, 20143

NASA Center for Climate Simulation Motivation for Near-Term Discover Changes Due to demand for more resources, we’re undertaking major Discover cluster augmentations.  Result will be: 3x computing capacity increase, net increase of 20,000 cores for Discover! But – we have floor space and power limitations, so we need to do phased removal of oldest Discover processors (12-core Westmeres). Interim reduction in Discover cores will be partly relieved by the addition of previously-dedicated compute nodes. Prudent use of SLURM features can help optimize your job’s turnaround during the “crunch time” (More on this later…) NCCS Discover Changes, October 31, 20144

NASA Center for Climate Simulation Discover Hardware Changes What we have now (October 2014): –12-core ‘Westmere’ and 16-core ‘Sandy Bridge’ What’s being delivered near term: –28-core ‘Haswell’ What’s being removed to make room: –12-core ‘Westmere’ Impacts for Discover users: there will be an interim “crunch time” with fewer nodes/cores available. (Transition schedule is subject to change.) NCCS Discover Changes, October 31, 20145

NASA Center for Climate Simulation Discover Compute Nodes, Early October 2014 (Peak ~600 TFLOPS) “Westmere” nodes, 12 cores per node, 2 GB memory per core –SLES11 SP1 –SCU7 1,200 nodes, 14,400 cores total, 161 TFLOPS peak –SCU1, SCU2, SCU3, SCU4 1,032 nodes, 12,384 cores total, 139 TFLOPS peak “Sandy Bridge” nodes, 16 cores per node –SLES11 SP1 –SCU8, 2 GB memory per core 480 nodes, 7,680 cores, 160 TFLOPS peak –SCU9, 4 GB memory per core 480 nodes, 7,680 cores, 160 TFLOPS peak NCCS Discover Changes, October 31, 20146

NASA Center for Climate Simulation Discover Compute Nodes, Late January 2015 (Peak 2,200 TFLOPS) No remaining “Westmere” nodes “Sandy Bridge” nodes, 16 cores per node (no change) –SLES11 SP1 –SCU8, 2 GB memory per core 480 nodes, 7,680 cores, 160 TFLOPS peak –SCU9, 4 GB memory per core 480 nodes, 7,680 cores, 160 TFLOPS peak “Haswell” nodes, 28 cores per node (new) –SLES11 SP3 –SCU10, 4.5 GB memory per core 1,080 nodes, 30,240 cores total, 1,229 TFLOPS peak –SCU11, 4.5 GB memory per core ~600 nodes, 16,800 cores total, 683 TFLOPS peak NCCS Discover Changes, October 31, 20147

NASA Center for Climate Simulation Total Discover Peak Computing Capability as a Function of Time (Intel Xeon Processors Only) NCCS Discover Changes, October 31, 20148

NASA Center for Climate Simulation Total Number of Discover Intel Xeon Processor Cores as a Function of Time NCCS Discover Changes, October 31, 20149

NASA Center for Climate Simulation Projected Weekly Detail (Subject to Change): Discover Processor Cores for General Work NCCS Discover Changes, October 31,

NASA Center for Climate Simulation Discover “Crunch Time” Transition: SLURM Job Turnaround Tips (1) See for many helpful tips: – Time Limits: –Specify both a preferred maximum time limit, and a minimum time limit as well, if your workflow performs self- checkpointing. –In this example, if you know that your job will save its intermediate results within the first 4 hours, these specifications will cause SLURM to schedule your job in the earliest available time window of 4 hours or longer, up to 12hrs: #SBATCH --time=12:00:00 #SBATCH --time-min=04:00:00 NCCS Discover Changes, October 31,

NASA Center for Climate Simulation Discover “Crunch Time” Transition: SLURM Job Turnaround Tips (2) See for many additional helpful tips: – Don't specify any SLURM partition, unless you are trying to access specialized hardware, such as datamove or co-processor nodes. Do specify memory requirements explicitly, either as memory per node, or as memory per CPU, e.g.: –#SBATCH --mem=12G –#SBATCH --mem-per-cpu=3G Don’t specify any processor architecture (e.g., ‘west’ or ‘sand’) if your job can run on any of the processors. NCCS's Slurm configuration ensures that each job will only run on one type of processor architecture. NCCS Discover Changes, October 31,

Questions & Answers NCCS User Services: Thank you

NASA Center for Climate Simulation SUPPLEMENTAL SLIDES NCCS Discover Changes, October 31,

NASA Center for Climate Simulation SLURM Quality of Service (‘qos’) Quality of serviceTime limitMax CPUs/user Max running jobs/user allnccs (default)12 hrs4096N/A debug1 hr5121 long24 hrs516N/A serial12 hrs40961 NCCS Discover Changes, October 31, Example SBATCH directive for qos: –#SBATCH --qos=long Don’t (no need to) specify the default qos (allnccs). See for more details (plus more SLURM info will be coming soon): –

NASA Center for Climate Simulation FY14-FY15 Cluster Upgrade Combined funding from FY14 and FY15 –Taking advantage of new Intel processors – double the floating point operations over SandyBridge –Decommission SCU7 (Westmeres) Scalable Unit 10 –Target to effectively double the NCCS compute capability –128 GB of RAM per node with FDR IB (56 Gbps) or greater –Benchmarks used in procurement include GEOS5 and WRF Target delivery date ~Oct 2014 NCCS User Forum July 22,

NASA Center for Climate Simulation Letter to NCCS Users NCCS Discover Changes, October 31, The NCCS is committed to providing the best possible high performance solutions to meet the NASA science requirements. To this end, the NCCS is undergoing major integration efforts over the next several months to dramatically increase both the overall compute and storage capacity within the Discover cluster. The end result will increase the total numbers of processors by over 20,000 cores while increasing the peak computing capacity by almost a factor of 3x! Given the budgetary and facility constraints, the NCCS will be removing parts of Discover to make room for the upgrades. The charts shown on this web page (PUT URL FOR INTEGRATION SCHEDULE HERE) show the current time schedules and the impacts for changes to the cluster environment. The decommissioning of Scalable Compute Unit 7 (SCU7) has already begun and will be complete by early November. After the availability of the new Scalable Compute Unit 10 (SCU10), the removal of Scalable Compute Units 1 through 4 will occur later this year. The end result will be the removal of all Intel Westmere processors from the Discover environment by the end of the 2014 calendar year. While we are taking resources out of the environment, users may run into longer wait times as the new systems are integrated into operations. In order to alleviate this potential issue, the NCCS has coordinated with projects that are currently using dedicated systems in order to free up resources for general processing. Given the current workload, we are confident that curtailing the dedication of resources for specialized projects will keep the wait times at their current levels. The NCCS will be communicating frequently with our user community throughout the integration efforts. will be sent out with information about the systems that are being taken off line and added. This web page, while subject to change, will be updated frequently, and as always, users are welcome to contact the support desk with any questions. There is never a good time to remove computational capabilities, but the end result will be a major boost to the overall science community. Throughout this process, we are committed to doing everything possible to work with you to get your science done. We are asking for your patience as we work through these changes to the environment, and we are excited about the future science that will be accomplished using the NCCS resources! Sincerely, The NCCS Leadership Team

NASA Center for Climate Simulation Oct Nov Nov Nov Nov Dec. 1-5 Dec. 1-5 Dec Dec Dec Dec Dec SCU8 No changes will be made to SCU8 throughout this time period. SCU8 will be available for general use. SLES11, SP1 480 Nodes 7,680 Cores Intel SandyBridge 160 TF Peak COMPUTE SCU9 SLES11, SP1 480 Nodes 7,680 Cores Intel SandyBridge 160 TF Peak No changes will be made to SCU9 throughout this time period. This system has been dedicated for a specific project, but will be made available for general use starting in early November. Discover Xeon Sandy Bridge Nodes: SCU8 and SCU9 NCCS Discover Changes, October 31,

NASA Center for Climate Simulation Oct Nov Nov Nov Nov Dec. 1-5 Dec. 1-5 Dec Dec Dec Dec Dec SCU7 Decommissioning 200 Nodes 1,000 Nodes First 200 nodes of Scalable Unit 7 (SCU7 – installed 2010) will be removed the week of October 27 th. Rest of SCU7 will be removed the week of November 3 rd. Space being vacated by SCU7 will be used to house new SCU10 compute nodes. SCU11 Integration Schedule for the next 600 Haswell nodes (SCU11) is still being worked. The NCCS is targeting the delivery by the middle of December. This is subject to change. Removal of the final 516 Westmere nodes will coincide with SCU11 integration. Equipment Delivery SCU10 Integration Equipment Delivery Pioneer Access General Access The delivery of the system is scheduled for November 12 th. It will take about 1 week for the vendor to cable the system and another week to perform the initial burn-in of the equipment. After that, the NCCS will provision the system with Discover images and integrate it with the storage. The target for general access will be mid December. SLES11, SP3 1,080 Nodes 30,240 Cores Intel Haswell 1,229 TF Peak SLES11, SP3 600 Nodes 16,800 Cores Intel Haswell 683 TF Peak SLES11, SP1 1,032 Nodes 12,384 Cores Intel Westmere 139 TF Peak SCU 1, 2, 3, 4 Decommissioning 516 Nodes 516 Nodes To make room for the new SCU11 compute nodes, the nodes of Scalable Units 1, 2, 3, and 4 (installed in 2011) will be removed from operations during the middle of December. The removal of half of these nodes will coincide with the general access to SCU10. SLES11, SP1 1,200 Nodes 14,400 Cores Intel Westmere 161 TF Peak Discover COMPUTE NCCS Discover Changes, October 31,

NASA Center for Climate Simulation Oct Nov Nov Nov Nov Dec. 1-5 Dec. 1-5 Dec Dec Dec Dec Dec Dirac – Mass Storage Disk Expansion Equipment Delivery Additional disk capacity for the mass storage will be delivered at the beginning of November. This equipment will run through a variety of tests before being put into operations. Once in operations, user directories on Dirac will be migrated to the new storage. The system administrators will coordinate the movement of file systems to the new storage with users. The additional capacity will dramatically increase the residence time of data on disk as new data is stored on Dirac. The additional disk will make recalls of recently stored data much faster. 5,080 TB RAW 4,064 TB Usable Production Operations Discover Storage Expansion Equipment Delivery Additional disk capacity for Discover will be delivered at the beginning of December. This equipment will run through a variety of tests before being put into operations. This disk environment will be used for the downscaling experiments and for general use. 10,080 TB RAW 8,064 TB Usable Production Operations STORAGE Discover and Mass Storage Disk Augmentations NCCS Discover Changes, October 31,

NASA Center for Climate Simulation Total Number of Discover Intel Xeon Processor Cores as a Function of Time NCCS Discover Changes, October 31,

NASA Center for Climate Simulation Total Discover Peak Computing Capability as a Function of Time (Intel Xeon Processors Only) NCCS Discover Changes, October 31,