Discover Cluster Upgrades: Hello Haswells and SLES11 SP3, Goodbye Westmeres February 3, 2015 NCCS Brown Bag
NASA Center for Climate Simulation Agenda Discover Cluster Hardware Changes & Schedule – Brief Update Using Discover SCU10 Haswell / SLES11 SP3 Q & A Discover: Haswells & SLES11 SP3, Feb. 3, 20152
NASA Center for Climate Simulation DISCOVER HARDWARE CHANGES & SCHEDULE UPDATE Discover: Haswells & SLES11 SP3, Feb. 3, 20153
NASA Center for Climate Simulation Discover’s New Intel Xeon “Haswell” Nodes Discover’s Intel Xeon “Haswell” nodes: 28 cores per node, 2.6 GHz Usable memory: 120 GB per node, ~4.25 GB per core (128 GB total) FDR InfiniBand (56 Gbps), 1:1 blocking SLES11 SP3 NO SWAP space, but DO have lscratch and shmem disk space SCU10: –720* Haswell nodes general use (1,080 nodes total), 30,240 cores total, 1,229 TFLOPS peak total *Up to 360 of the 720 nodes may be episodically allocated for priority work SCU11: –~600 Haswell nodes, 16,800 cores total, 683 TFLOPS peak Discover: Haswells & SLES11 SP3, Feb. 3, 20154
NASA Center for Climate Simulation Discover Hardware Changes in a Nutshell January 30, 2015 (-70 TFLOPS): –Removed: 516 Westmere (12-core) nodes (SCU3, SCU4) February 2, 2015 (+806 TFLOPS for general work): –Added: ~720* Haswell (28-core) nodes (2/3 of SCU10) * Up to 360 of the 720 nodes may be episodically allocated to a priority project Week of February 9, 2015 (-70 TFLOPS): –Removed: 516 Westmere (12-core) nodes (SCU1, SCU2) –Removed: 7 oldest (‘Dunnington’) Dalis (dali02-dali08) Late February/early March 2015 (+713 TFLOPS for general work): –Added: 600 Haswell (28-core) nodes (SCU11) Discover: Haswells & SLES11 SP3, Feb. 3, ,180 ~1,100 1,804 TFLOPS for General User Work
NASA Center for Climate Simulation Discover Node Count for General Work – Fall/Winter Evolution Discover: Haswells & SLES11 SP3, Feb. 3, 20156
NASA Center for Climate Simulation Discover Processor Cores for General Work – Fall/Winter Evolution Discover: Haswells & SLES11 SP3, Feb. 3, 20157
NASA Center for Climate Simulation Oldest Dali Nodes to Be Decommissioned The oldest Dali nodes (dali02 – dali08) will be decommissioned starting February 9 (plenty of newer Dali nodes remain). You should see no impact from the decommissioning of old Dali nodes, provided you have not been explicitly specifying one of the dali02 – dali08 node names when logging in. Discover: Haswells & SLES11 SP3, Feb. 3, 20158
NASA Center for Climate Simulation USING DISCOVER SCU10 AND HASWELL / SLES11 SP3 Discover: Haswells & SLES11 SP3, Feb. 3, 20159
NASA Center for Climate Simulation How to use SCU Haswell nodes on SCU10 available in sp3 partition To be placed on a login node with the SP3 development environment, after providing your NCCS LDAP password, specify “discover- sp3” at the “Host” prompt: Host: discover-sp3 However, you may submit to the sp3 partition from any login node. Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation How to use SCU10 To submit a job to the sp3 partition, use either: –Command line: sbatch --partition=sp3 --constraint=hasw –Or inline directives: #SBATCH --partition=sp3 #SBATCH --constraint=hasw Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Porting your work: the fine print… There is a small (but non-zero) chance your scripts and binaries will run with no changes at all. Nearly all scripts and binaries will require changes to make best use of SCU10. Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Porting your work: the fine print… There is a small (but non-zero) chance your scripts and binaries will run with no changes at all. Nearly all scripts and binaries will require changes to make best use of SCU10, sooo… With great power comes great responsibility. - Ben Parker (2002) Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Adjust for new core count Haswell nodes have 28 cores, 128 GB –>x2 memory/core from Sandy Bridge Specify total cores/tasks needed, not nodes. –Example: for Sandy Bridge nodes: #SBATCH --ntasks=800 Not #SBATCH --nodes=50 This allows SLURM to allocate whatever resources are available. Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation If you must control the details… … still don’t use --nodes. If you need more than ~4 GB/core, use fewer cores/node. #SBATCH --ntasks-per-node=N… –Assumes 1 task/core (the usual case). Or specify required memory: #SBATCH --mem-per-cpu=N_MB… SLURM will figure out how many nodes are needed to meet this specification. Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Script changes summary Avoid specifying --partition unless absolutely necessary. –And sometimes not even then… Avoid specifying --nodes. –Ditto. Let SLURM do the work for you. –That’s what it’s there for, and it allows for better resource utilization. Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Source code changes You might not need to recompile… –… but SP3 upgrade may require it. SCU10 hardware is brand-new, possibly needing a recompile. –New features, e.g. AVX2 vector registers –SGI nodes, not IBM –FDR vs QDR Infiniband –NO SWAP SPACE! Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation And did I mention… … NO SWAP SPACE! This is critical. –When you run out of memory now, you won’t start to swap – your code will throw an exception. Ameliorated by higher GB/core ratio… –… but we still expect some problems from this. Use policeme to monitor the memory requirements of your code. Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation If you do recompile… Current working compiler modules: –All Intel C compilers ( ifortran not tested yet) –gcc 4.5, 4.81, 4.91 –g Current working MPI modules: –SGI MPT –Intel and later –MVAPICH2 1.81, 1.9, 1.9a, 2.0, 2.0a, 2.1a –OpenMPI: 1.8.1, 1.8.2, Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation MPI “gotchas” Programs using old Intel MPI must be upgraded. MVAPICH2 and OpenMPI have only been tested on single-node jobs. All MPI modules (except SGI MPT) may experience stability issues when node counts are >~300. –Symptom: Abnormally long MPI teardown times. Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation cron jobs discover-cron is still at SP1. –When running SP3-specific code, may need to ssh to SP3 node for proper execution. –Not extensively tested yet. Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Sequential job execution Jobs may not execute in submission order. –Small and interactive jobs favored during the day. –Large jobs favored at night. If execution order is important, the dependencies must be specified to SLURM. Multiple dependencies can be specified with the --dependency option. –Can depend on start, end, failure, error, etc. Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Dependency example # String to hold the job IDs. job_ids='' # Submit the first parallel processing job, save the job ID. job_id=`sbatch | cut -d ' ' -f 4` job_ids="$job_ids:$job_id" # Submit the second parallel processing job, save the job ID. job_id=`sbatch | cut -d ' ' -f 4` job_ids="$job_ids:$job_id" # Submit the third parallel processing job, save the job ID. job_id=`sbatch | cut -d ' ' -f 4` job_ids="$job_ids:$job_id" # Wait for the processing jobs to finish successfully, then # run the post-processing job. sbatch --dependency=afterok$job_ids Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Coming attraction: shared nodes SCU10 nodes will initially be exclusive: 1 job/node This is how we roll on discover now. May leave a lot of unused cores and/or memory. Eventually, SCU10 nodes (and maybe others) will be shared among jobs. –Same or different users. What does this mean? Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Shared nodes (future) You will no longer be able to assume that all of the node resources are for you. Specifying task and memory requirements will ensure SLURM gets you what you need. Your jobs must learn to “work and play well with others”. –Unexpected job interactions, esp. with I/O, may cause unusual behavior when nodes are shared. Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Shared nodes (future, continued) If you absolutely must have a minimum number of CPUs in a node, the -- mincpus=N option to sbatch will ensure you get it. Discover: Haswells & SLES11 SP3, Feb. 3,
Questions & Answers NCCS User Services: Thank you
NASA Center for Climate Simulation SUPPLEMENTAL SLIDES Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Discover Compute Nodes, February 3, 2015 (Peak ~1,629 TFLOPS) “ Haswell” nodes, 28 cores per node, 2.6 GHz (new ) –SLES11 SP3 –SCU10, 4.5 GB memory per core (new) 720* nodes general use (1,080 nodes total), 30,240 cores total, 1,229 TFLOPS peak total (*360 nodes episodically allocated for priority work) “Sandy Bridge” nodes, 16 cores per node, 2.6 GHz (no change) –SLES11 SP1 –SCU8, 2 GB memory per core 480 nodes, 7,680 cores, 160 TFLOPS peak –SCU9, 4 GB memory per core 480 nodes, 7,680 cores, 160 TFLOPS peak “Westmere” nodes, 12 cores per node, 2 GB memory per core, 2.6 GHz –SLES11 SP1 –SCU1, SCU2 (SCUS 3, 4, and 7 already removed) 516 nodes, 6,192 cores total, 70 TFLOPS peak Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Discover Compute Nodes, March 2015 (Peak ~2,200 TFLOPS) “Haswell” nodes, 28 cores per node –SLES11 SP3 –SCU10, 4.5 GB memory per core 720* nodes general use (1,080 nodes total), 30,240 cores total, 1,229 TFLOPS peak total (*360 nodes episodically allocated for priority work) –SCU11, 4.5 GB memory per core (new) ~600 nodes, 16,800 cores total, 683 TFLOPS peak “Sandy Bridge” nodes, 16 cores per node (no change) –SLES11 SP1 –SCU8, 2 GB memory per core 480 nodes, 7,680 cores, 160 TFLOPS peak –SCU9, 4 GB memory per core 480 nodes, 7,680 cores, 160 TFLOPS peak No remaining “Westmere” nodes Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Jan Jan Feb Feb Feb Feb Mar Mar SCU10 Integration SCU10 General Access: +720* Nodes SCU10 arrived in mid-November Following installation & resolution of initial power issues, the NCCS provisioned SCU10 with Discover images and integrated it with GPFS storage. NCCS stress testing and targeted high-priority use occurred in January 2015.(*360 nodes episodically allocated for priority work) SCU 8 and 9 No changes during this period (January – March 2015). In November 2014, 480 nodes previously allocated for a high-priority project were made available for all user processing. SCU11 Integration SCU 11 (600 Haswell nodes) has been delivered, and will be installed starting Feb. 9th. Then the NCCS will provision the system with Discover images and integrate it with GPFS storage. Power and I/O connections from Westmere SCUs 1, 2, 3, and 4 are needed for SCU11. Thus, SCUs 1, 2, 3, and 4 must be removed prior to SCU11 integration. SLES11, SP3 600 Nodes 16,800 Cores Intel Haswell 683 TF Peak SLES11, SP1 960 Nodes 15,360 Cores Intel Sandy Bridge 320 TF Peak SLES11, SP1 1,032 Nodes 12,384 Cores Intel Westmere 139 TF Peak SCU 1, 2, 3, 4 Decommissioning Drain: 516 Nodes To make room for the new SCU11 compute nodes, the nodes of Scalable Units 1, 2, 3, and 4 (12-core Westmeres installed in 2011) are being removed from operations during February. Removal of half of these nodes will coincide with the general access to SCU10, the remaining half during installation of SCU11. SLES11, SP3 1,080 Nodes 30,240 Cores Intel Haswell 1,229 TF Peak Discover COMPUTE Discover: Haswells & SLES11 SP3, Feb. 3, Drain: 516 Nodes Remove: 516 Nodes Remove: 516 Nodes Physical Installation Configur- ation Stress Testing SCU11 General Access: +600 Nodes
NASA Center for Climate Simulation Discover “SBU” Computational Capacity for General Work – Fall/Winter Evolution Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Total Discover Peak Computing Capability as a Function of Time (Intel Xeon Processors Only) Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Total Number of Discover Intel Xeon Processor Cores as a Function of Time Discover: Haswells & SLES11 SP3, Feb. 3,
NASA Center for Climate Simulation Storage Augmentations Dirac (Mass Storage) Disk Augmentation –4 Petabytes usable (5 Petabytes “raw”), installed –Gradual data move: starts week of February 9 (many files, “inodes” to move) Discover Storage Expansion –8 Petabytes usable (10 Petabytes “raw”), installed –For both general use and targeted “Climate Downscaling” project –Phased deployment, including optimizing the arrangement existing project and user nobackup space Discover: Haswells & SLES11 SP3, Feb. 3,