The performance of NAMD on a large Power4 system Joachim Hein EPCC, The University of Edinburgh
Measurement based load balancing NAMD measures its performance for the first 200 steps Redistributes the work load to optimise the performance Performance benefit for larger number of processors Benchmark time: Better estimate for production jobs from short jobs NAMD on Power4 11 May 2019
Measurement based load balancing NAMD on Power4 11 May 2019
Loadbalance Example: All but one CPUs in a narrow Window 128 CPUs 96769 atoms 32000 iters All but one CPUs in a narrow Window Effect of “slow” guy negligible NAMD on Power4 11 May 2019
Tune it! MP_EAGER_LIMIT Environment variable MP_EAGER_LIMIT changes the behaviour of MPI Messages smaller than MP_EAGER_LIMIT are send instantaneous Messages larger than MP_EAGER_LIMIT are send using “hand-shake” Default value is small and not optimal for NAMD Tune it! NAMD on Power4 11 May 2019
MP_EAGER_LIMIT NAMD on Power4 11 May 2019
Sample loadleveler script #@ shell = /bin/ksh #@ job_type = parallel #@ network.MPI = csss,shared,us #@ account_no = z001 #@ output = namd_run.$(schedd_host)_$(jobid).out #@ error = namd_run.$(schedd_host)_$(jobid).err #@ wall_clock_limit = 00:30:00 #@ node = 1 #@ tasks_per_node = 8 #@ queue export MP_SHARED_MEMORY=yes export MP_EAGER_LIMIT=65536 poe path/namd2 inputfile Communication: shared memory Setting MP_EAGER_LIMIT Set path & inputfile NAMD on Power4 11 May 2019
Benchmark Joint Amber Charm (JAC) Benchmark Apo A-1 benchmark Dihydrofolate reductase in water, 23558 atoms www.scripps.edu/brooks/Benchmarks Apo A-1 benchmark Apolipoprotein A-1, 92224 atoms www.ks.uiuc.edu/Research/apoa1 TCR peptide-MHC 96796 atoms www.hpcx.ac.uk/about/newsletter/HPCxNews02.pdf F1-ATP synthase F1 subunit of ATP synthase, 327506 atoms www.sc-2002.org/paperpdfs/pap.pap277.pdf NAMD on Power4 11 May 2019
The HPCx system Presently: Future (Summer 2004) 40 IBM p690 Regata H frames 32 POWER4 processors per frame (1.3 GHz) Frames subdivided into LPARs of 8 processors 8 GB of main memory per LPAR IBM SP Switch2 (Colony) network 2 switch adapters per LPAR Dual plane Future (Summer 2004) Upgrade to p690+ frames (1.7 GHz) LPARs of 32 processors IBM HPS (Federation) network NAMD on Power4 11 May 2019
Time per step for 32 processors Benchmark NAMD 2.4 NAMD 2.5 Comment dhf reductase 23558 atoms 0.051s 0.032s Too small for 32 cpus APO A-1 92224 atoms 0.28s 0.19s TCR MHC 96796 atoms 0.30s 0.21s F1-ATP 327506 atoms 0.58s NAMD 2.5 substantially faster than NAMD 2.4 NAMD on Power4 11 May 2019
Large number of processors NAMD on Power4 11 May 2019
Further Reading Full technical report: The performance of NAMD on HPCx Joachim Hein www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0310.pdf NAMD on Power4 11 May 2019