High-Performance Computing System CSU-CU Summit High-Performance Computing System
General Info Websites hpc.colostate.edu This presentation will be in “News” link rc.colorado.edu
Summit: Schematic Rack Layout Storage rack Compute rack 1 Compute rack 2 Compute rack 3 Compute rack 4 Compute rack 5 Compute rack 6 Compute rack 7 1 PB scratch GPFS DDN SFA14K HiMem nodes (5) 2 TB RAM / node Ethernet mgt. nodes OmniPath leaf nodes Nvidia K80 GPU nodes (10) core nodes Intel Knights Landing Phi nodes (20) OPA fabric Gateway nodes Intel Haswell nodes (376) Note: actual rack layout may differ from this schematic
CPU Nodes 380 CPU nodes Dell Poweredge C6320 9,120 Intel Haswell CPU cores 2X Intel Xeon E5-2680v3; 2.5 GHz 24 CPU cores / node 128 GB RAM / node 5.3 GB RAM / CPU core 200 GB SATA SSD / node
GPU Nodes 10 GPU nodes Dell Poweredge C4130 99,840 GPU cores 2X Nvidia K80 GPU cards / node 2X Intel Xeon E5-2680v3; 2.5 GHz 24 CPU cores / node 128 GB RAM / node 5.3 GB RAM / CPU core 200 GB SATA SSD
Knights-Landing Nodes 20 KnL nodes 1,440 KnL-F cores 72 Silvermont/Atom cores / node; 1.3 GHz 16 GB HBM (high bandwidth memory) 3D stacked MCDRAM (multi-channel DRAM) 384 GB DDR4 platform RAM 200 GB SATA SSD / node Delivery: 3Q 2017
GPU / KnL-F GPU and KnL-F computing will be addressed in separate workshops
HiMem Nodes 5 HiMem nodes Dell Poweredge R930 4X Intel Xeon E7-4830v3; 2.1 GHz 2 TB RAM / node (DDR4) 48 CPU cores / node 42 GB RAM / CPU core 200 GB SAS SSD / node 12 TB SAS HDD / node
Scratch Storage 1 Petabyte (PB) scratch storage DDN SFA14K block storage appliance 21 Gbytes / sec. sequential R/W; 6M random 4K IOP’s RAID6 array GPFS (General Parallel File System): Spectrum Scale High-speed parallel I/O Quota Initial: 10 TB / account Request increase by sending email to rc-help@colorado.edu
Interconnect OmniPath 100 Gb/s bandwidth; 1.5us latency Fat-Tree topology 2:1 oversubscription; cost-performance tradeoff
Accounts See hpc.colostate.edu, “Get Started” Get CSU eID (eid.colostate.edu) OR get CSU Associate’s eID Fill in Account Application Form on the “Get Started” page Set up DUO Two-Factor authentication (PDF online) Request CU-Boulder account (rcamp.rc.colorado.edu/accounts/account-request/create/general) Organization: Colorado State University Username: CSU eID Password: CSU password,DUO_key Role: choose anything Preferred login shell: leave it as “bash” Check “Summit supercomputing cluster” Receive account confirmation email NOTE: DUO_key cycles every 15 seconds
Allocations 1 Service Unit (SU) = 1 core-hour (i.e. full utilization of a single core on 1 compute node for 1 hour) 380 CPU nodes = ca. 80M SU/yr 10 GPU nodes = ca. 5M SU/yr 5 HiMem nodes = ca. 12M SU/yr Total = ca. 97M SU/yr RMACC gets 10% = 9.7M SU/yr UCB gets 75% of non-RMACC share = 65M SU/yr CSU gets 25% of non-RMACC share = 22M SU/yr Two types of allocation 1) Initial allocation All new accounts = 50K SU’s; expires after 1 year 2) Project allocation If >50K SU’s / yr, then get Project allocation Submit request form (PDF online) Reviewed by Management & Allocations committee
Remote Login SSH ssh csu_eName@colostate.edu@login.rc.colorado.edu csu_ePassword,push OR ssh csu_eName@colostate.edu@login.rc.colorado.edu csu_ePassword,DUO_key SSH client software Apple OSX Terminal (en.wikipedia.org/wiki/Terminal_(macOS)) Windows PuTTY (putty.org) Linux Terminal (www.digitalocean.com/community/tutorials/an- introduction-to-the-linux-terminal)
Remote Login After login, always enter: >ssh scompile scompile nodes = Haswell CPU compute nodes Users load balanced among multiple nodes compile & test code Not required after CU Janus system is decommissioned (later this year)
File Transfer: Slow SFTP sftp csu_eName@colostate.edu@login.rc.colorado.edu csu_ePassword,push OR csu_ePassword,DUO_key Also: FileZilla (filezilla-project.org) PuTTY (putty.org)
File Transfer: Fast GLOBUS globus.org fast parallel file transfer CU Summit is ready with Globus CSU NOT quite ready with Globus at CSU, can use Globus personal connect create Globus “personal endpoint” on workstation, server, etc. bandwidth limited by slowest link, usually 1 Gb/s. Ethernet Can use endpoints outside CSU
File Transfer: Fast Usage go to globus.org
File Transfer: Fast
File Transfer: Fast
File Transfer: Fast
File Transfer: Fast Endpoints Local Endpoint: csu_cray Server: cray2.colostate.edu Username: cray_userid Password: cray_password Remote Endpoint: CU-Boulder Research Computing Server: dtn02.rc.colorado.edu:7512 Username: csu_eName@colostate.edu Password: csu_ePassword,push
Directories - Files /home/eName@colostate.edu 2 GB, permanent, daily incremental backup /projects/eName@colostate.edu 250 GB, permanent, daily incremental backup /scratch/summit/eName@colostate.edu 1 PB, purged after 90 days, no backup 20 TB default quota to request quota increase, send message to rc-help@colorado.edu /scratch/local 200GB, SSD, local to individual nodes, no backup
Modules Linux modules Simplify shell environment & software management Lmod: hierarchical modules ml = shorthand for “module” module list (show currently loaded modules) ml list (show currently loaded modules) ml avail (show available modules; dependencies) ml spider (show all available modules) ml module_name1 (load module “name1”) ml unload module_name (unload module “name”) ml module_name2 (swap out “name1”; swap in “name2”; if they conflict) ml help module_name (description about module “name”) ml show module_name (shows paths etc. for module “name”) ml help (general help)
Modules >ml avail ------ Compilers ---------------------------------------------------------------------- gcc/6.1.0 intel/16.0.3 (m,D) intel/17.0.0 (m) pgi/16.5 ------ Independent Applications ------------------------------------------------------- R/3.3.0 cuda/7.5.18 (g) gnu_parallel/20160622 matlab/R2016b allinea/6.0.4 (m) cuda/8.0.61 (g,D) idl/8.5 ncl/6.3.0 autotools/2.69 cudnn/4.0 (g) jdk/1.7.0 papi/5.4.3 cmake/3.5.2 cudnn/5.1 (g,D) jdk/1.8.0 (D) paraview/5.0.1 cube/3.4.3 expat/2.1.1 loadbalance/0.1 pdtoolkit/3.22 cube/4.3.4 (D) git/2.8.3 mathematica/9.0 perl/5.24.0 Where: g: built for GPU m: built for host and native MIC D: Default Module
Modules >ml gcc >ml avail ----- MPI Implementations -------------------------------------------------------------- impi/5.1.3.210 openmpi/1.10.2 (D) openmpi/2.0.1 ----- Compiler Dependent Applications -------------------------------------------------- antlr/2.7.7 fftw/3.3.4 geos/3.5.0 gsl/2.1 jasper/1.900.1 mkl atlas/3.10.2 gdal/2.1.0 grib_api/1.15.0 hdf5/1.10.0 jpeg/9b nco ------ Compilers ----------------------------------------------------------------------- gcc/6.1.0 (L) intel/16.0.3 (m,D) intel/17.0.0 (m) pgi/16.5 ------ Independent Applications -------------------------------------------------------- R/3.3.0 cuda/7.5.18 (g) gnu_parallel/20160622 matlab/R2016b allinea/6.0.4 (m) cuda/8.0.61 (g,D) idl/8.5 ncl/6.3.0 autotools/2.69 cudnn/4.0 (g) jdk/1.7.0 papi/5.4.3 cmake/3.5.2 cudnn/5.1 (g,D) jdk/1.8.0 (D) paraview/5.0.1 cube/3.4.3 expat/2.1.1 loadbalance/0.1 pdtoolkit/3.22 cube/4.3.4 (D) git/2.8.3 mathematica/9.0 perl/5.24.0 Where: g: built for GPU L: Module is loaded m: built for host and native MIC D: Default Module
Compilers icc Intel C compiler icpc Intel C++ compiler ifort Intel Fortran compiler gcc GNU C compiler g++ GNU C++ compiler gfortran GNU Fortran compiler pgcc PGI C compiler pgCC PGI C++ compiler pgfortran PGI Fortran compiler
Compilers - Interpreters ml intel; ml impi; mpicc Intel MPI C compiler ml intel; ml impi; mpicxx Intel MPI C++ compiler ml intel; ml impi; mpif90 Intel MPI Fortran compiler ml openmpi; mpicc OpenMPI C compiler ml openmpi; mpicxx OpenMPI C++ compiler ml openmpi; mpif90 OpenMPI Fortran compiler nvcc Nvidia CUDA compiler python Python interpreter perl Perl interpreter
Debug GNU gdb debugger gnu.org search web for tutorials, cheat sheets, etc. >ml gcc >gcc -o hello -g hello.c >gdb hello (gdb) run (gdb) break hello.c:1 (gdb) step (gdb) continue (gdb) quit Intel, PGI developing material for these compilers - check website Valgrind (valgrind.org) >ml valgrind
Performance Analysis Allinea (allinea.com) PAPI (icl.cs.utk.edu/papi/overview/) Totalview (roguewave.com/products-services/totalview) Perfsuite (perfsuite.ncsa.illinois.edu) Tau (cs.uoregon.edu/research/tau/home.php >ml allinea >ml papi >ml totalview >ml perfsuite >ml tau Developing material for these tools - check website
Libraries Intel MKL (Math Kernel Library) library of optimized math routines for Intel architecture BLAS, LAPACK, ScaLAPACK, FFT, vectors, etc. >ssh scompile >ml intel >ml mkl >icc -o hello hello.c -mkl
Interactive Jobs >ssh scompile >cd /projects/csu_eName@colostate.edu >srun executable Hello world! >srun -n2 executable (-n = # tasks; default 1 task/node) >srun -n2 --cpus-per-task=1 executable (-n = # tasks; 2 tasks/node) >srun -N2 executable (-N = # nodes) >srun -t 00:01:00 executable (-t = runtime limit HH:MM:SS) >man srun (man pages)
Batch Queues Slurm batch scheduler slurm.schedmd.com Cheat sheet: slurm.schedmd.com/pdfs/summary.pdf All users have access to all partitions/nodes: Haswell CPU Nvidia GPU Intel KnL-F HiMem Condo
Batch Queues (short name) (long name) Compute node type Default time Max time QoS shas summit-haswell (380 nodes) 4 hr 24 hr N,D,C sgpu summit-gpu (10 nodes) sknl summit-knl (20 nodes) smem summit-himem (5 nodes) (7 D) N,D,L,C
Batch Queues QoS Description Limits normal (N) Default Normal priority debug (D) Quick turnaround for testing Priority boost long (L) For jobs with long runtimes Normal priority condo (C) For users who purchased compute nodes (“condo model”) (= 1 D wait in queue)
Batch Queues Batch job files Suppose we have text batch file - “filename” ——————————————————————————————————————————————————————————————————————————————————————————- #!/bin/bash #SBATCH -J job_name #job name #SBATCH -p shas #partition name #SBATCH -q debug #QoS #SBATCH -t 01:00:00 #wall clock time #SBATCH --nodes 1 #number of nodes #SBATCH --mail-user csu_eName@colostate.edu #send mail at job finish module load intel #load intel module module load impi #load impi libraries mpicc -o mpic mpic.c #compile mpirun -n 1 ./mpic #run ————————————————————————————————————————————————————————————————————
Batch Queues >sbatch filename (submit job) >squeue (show job status - all jobs) >squeue -u csu_eName@colostate.edu (show job status for user only) >scancel jobid (cancel job; get jobid from squeue) >sinfo (show partitions)
Allocation is not specific number of hours Fairshare Scheduler Allocation is not specific number of hours Allocation is a share of computer Shares averaged over 4-week period Motivation Several universities involved; instead of allocating time, everyone has equal share of machine Helps prevent system from being idle Cannot run out of allocation time Accounts never “frozen” or shut-down
FS scheduling uses complex formula to determine priority in queue Fairshare Scheduler FS scheduling uses complex formula to determine priority in queue Examines load for each user & balances utilization to fairly share resources Involves historical use by user Considers QoS Considers how long job has been in queue Shares averaged over 4-week period Compares current use to FS target and adjusts job queue priority
Fairshare Scheduler If you are under target FS -> queue priority increased If you are over target FS -> queue priority decreased Only impacts pending jobs in queue If no other pending jobs and enough resources are available, then your job will run regardless of previous usage Encourages consistent, steady usage Discourages sporadic, “burst-like” usage
HiMem Nodes 5 HiMem compute nodes 2 TB RAM / node In Slurm batch script file add: #SBATCH -p smem
MPI Intel MPI optimized for Intel microprocessor architectures & OmniPath interconnect based on standard MPICH2; supports MPI-3.1 distributed memory message passing libraries Usage >ssh scompile >ml intel >ml impi #include <mpi.h> #include <stdio.h> int main(int argc, char *argv[]) { int rank, numprocs; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf("Hello from pe %d of %d\n",rank,numprocs); MPI_Finalize(); } mpicc -o hello hello.c mpirun -n 2 ./hello
MPI OpenMPI opensource version of MPI based on standard MPICH2; supports MPI-3.1 distributed memory message passing libraries Usage >ssh scompile >ml gcc >ml openmpi #include <mpi.h> #include <stdio.h> int main(int argc, char *argv[]) { int rank, numprocs; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf("Hello from pe %d of %d\n",rank,numprocs); MPI_Finalize(); } mpicc -o hello hello.c mpirun -np2 ./hello
OpenMP OpenMP opensource multithreading API shared memory libraries Usage >ssh scompile >ml gcc Source code: #include <omp.h> #include "stdio.h" int main() { #pragma omp parallel int ID = omp_get_thread_num(); printf("hello(%d)\n", ID); }
OpenMP Batch script file (“fname"): #!/bin/bash #SBATCH -J openmp #SBATCH -p shas #SBATCH --qos debug #SBATCH -t 00:01:00 export OMP_NUM_THREADS=8 gcc -fopenmp -o hello hello.c ./hello Submit job: >sbatch fname hello(1) hello(0) hello(2) hello(3) hello(4) hello(5) hello(6) hello(7)
Condo Model Researchers purchase: • CPU compute nodes • GPU accelerators (if applicable) • KnL-F accelerators (if applicable; available Q3 2017) • Memory • Disk storage Central IT provides: • Data center facility • Shared service nodes (i.e. login nodes) • Shared OmniPath interconnect switches and cables • Ethernet management switches and cables • Shared scratch storage • Server racks • Power • Cooling • Security • Purchase, order & install equipment • Install OS • System administration • Assist with software application installation
Condo Model Condo jobs have the following privileges request longer run times (up to 168 hrs. (7 D)) get queue priority boost (= 1 D wait in queue) access all nodes To properly activate Condo shares, Condo users should send the following info to “richard.casey@colostate.edu" full name csu_eName condo group ID Your csu_eName will be added to appropriate condo ID
Condo Model PI Dept. Condo group ID Michael Antolin Biology bio Wolfgang Bangerth Mathematics mat Asa Ben-Hur Computer Science hal Stephen Guzik Mechanical Engineering cfd Tony Rappe Chemistry akr Chris Weinberger crw Ander Wilson Statistics fhw
Condo Model Ex.: suppose following text is in file “filename” #SBATCH -p shas #SBATCH --qos condo #SBATCH -A csu-summit-xxx #SBATCH -t 40:00:00 “shas” = Haswell CPU compute nodes “condo” = charges usage to condo account (required) Note double-dash for “qos” “xxx” = 3-digit condo ID (required) “40:00:00” = HH:MM:SS >sbatch filename
Software Installation Opensource github (github.com) sourceforge (sourceforge.net) Other sites Commercial license fees (no $$ in CSU IT) license server (FlexLM) No root account access; no sudo redirect install path to /projects/csu_eName@colostate.edu Package managers etc. yum, git, pip, rpm, Makefile, cmake, curl Dependency hell libraries compilers applications versions; major.minor.patch (semver.org) Support
Useful Commands Check current SU usage >sreport -n -t hours -P cluster AccountUtilizationByUser start=2017-01-01 tree user=csu_eName@colostate.edu summit|csu-general|csu_eName@colostate.edu|First Last|20000 Check fairshare usage >sshare -U Account User RawShares NormShares RawUsage EffectvUsage FairShare csu-general rmc7777@c+ 1 0.500000 243 0.000000 0.802469
Support Trouble tickets • Submit support requests to rc-help@colorado.edu System status • To receive system updates and other announcements, send a message to rc-announce-join@colorado.edu “Summit System User’s Guide” on hpc.colostate.edu Contacts Richard Casey, PhD • (970) 492-4127 • richard.casey@colostate.edu Tobin Magle, PhD • Data management specialist • tobin.magle@colostate.edu • (970) 491-0517 • See http://lib.colostate.edu/services/data-management for more information.