The performance of NAMD on a large Power4 system

Slides:



Advertisements
Similar presentations
ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.
Advertisements

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE NPACI Parallel Computing Seminars San Diego Supercomputing.
Running Jobs on Jacquard An overview of interactive and batch computing, with comparsions to Seaborg David Turner NUG Meeting 3 Oct 2005.
Information Technology Center Introduction to High Performance Computing at KFUPM.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Communication Pattern Based Node Selection for Shared Networks
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Brad Whitlock October 14, 2009 Brad Whitlock October 14, 2009 Porting VisIt to BG/P.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
J. Skovira 5/05 v11 Introduction to IBM LoadLeveler Batch Scheduling System.
Task Farming on HPCx David Henty HPCx Applications Support
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith.
IFS Benchmark with Federation Switch John Hague, IBM.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Jan 3, 2001Brian A Cole Page #1 EvB 2002 Major Categories of issues/work Performance (event rate) Hardware  Next generation of PCs  Network upgrade Control.
1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
23 Oct 2002HEPiX FNALJohn Gordon CLRC-RAL Site Report John Gordon CLRC eScience Centre.
How to get started on cees Mandy SEP Style. Resources Cees-clusters SEP-reserved disk20TB SEP reserved node35 (currently 25) Default max node149 (8 cores.
TEXAS ADVANCED COMPUTING CENTER User experiences on Heterogeneous TACC IBM Power4 System Avi Purkayastha Kent Milfeld, Chona Guiang Texas Advanced Computing.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
HPCVL High Performance Computing Virtual Laboratory Founded 1998 as a joint HPC lab between –Carleton U. (Comp. Sci.) –Queen’s U. (Engineering) –U. of.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Using the BYU SP-2. Our System Interactive nodes (2) –used for login, compilation & testing –marylou10.et.byu.edu I/O and scheduling nodes (7) –used for.
© 2005 IBM MPI Louisiana Tech University Ruston, Louisiana Charles Grassl IBM January, 2006.
Using hpc Instructor : Seung Hun An, DCS Lab, School of EECSE, Seoul National University.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
11 January 2005 High Performance Computing at NCAR Tom Bettge Deputy Director Scientific Computing Division National Center for Atmospheric Research Boulder,
Performance Benefits on HPCx from Power5 chips and SMT HPCx User Group Meeting 28 June 2006 Alan Gray EPCC, University of Edinburgh.
A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop.
1 HPCI Presentation Kulathep Charoenpornwattana. March 12, Outline Parallel programming with MPI Running MPI applications on Azul & Itanium Running.
Issues on the operational cluster 1 Up to 4.4x times variation of the execution time on 169 cores Using -O2 optimization flag Using IBM MPI without efficient.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
© 2007 IBM Corporation Snehal S. Antani, WebSphere XD Technical Lead SOA Technology Practice IBM Software WebSphere.
Science Support for Phase 4 Dr Alan D Simpson HPCx Project Director EPCC Technical Director.
Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,
GangLL Gang Scheduling on the IBM SP Andy B. Yoo and Morris A. Jette Lawrence Livermore National Laboratory.
Biowulf: Molecular Dynamics and Parallel Computation Susan Chacko Scientific Computing Branch, Division of Computer System Services CIT, NIH.
Introduction to HPC Debugging with Allinea DDT Nick Forrington
Multi-Grid Esteban Pauli 4/25/06. Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other.
Plans for the National NERC HPC services UM vn 6.1 installations and performance UM vn 6.6 and NEMO(?) plans.
Roy Taragan Shaham Kenat
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Interconnection structures
Supervisor: Andreas Gellrich
CommLab PC Cluster (Ubuntu OS version)
CRESCO Project: Salvatore Raia
uGNI-based Charm++ Runtime for Cray Gemini Interconnect
TYPES OFF OPERATING SYSTEM
Department of Computer Science University of California, Santa Barbara
ColdFusion Performance Troubleshooting and Tuning
Welcome to the 2016 Charm++ Workshop!
Parallelization of CPAIMD using Charm++
Hybrid Programming with OpenMP and MPI
Shuxia Zhang, Amidu Oloso, Birali Runesha
BigSim: Simulating PetaFLOPS Supercomputers
CINECA HIGH PERFORMANCE COMPUTING SYSTEM
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Quick Tutorial on MPICH for NIC-Cluster
Optimizing MPI collectives for SMP clusters
Parallel computing in Computational chemistry
Computational issues Issues Solutions Large time scale
ECE 671 – Lecture 8 Network Adapters.
Presentation transcript:

The performance of NAMD on a large Power4 system Joachim Hein EPCC, The University of Edinburgh

Measurement based load balancing NAMD measures its performance for the first 200 steps Redistributes the work load to optimise the performance Performance benefit for larger number of processors Benchmark time: Better estimate for production jobs from short jobs NAMD on Power4 11 May 2019

Measurement based load balancing NAMD on Power4 11 May 2019

Loadbalance Example: All but one CPUs in a narrow Window 128 CPUs 96769 atoms 32000 iters All but one CPUs in a narrow Window Effect of “slow” guy negligible NAMD on Power4 11 May 2019

Tune it! MP_EAGER_LIMIT Environment variable MP_EAGER_LIMIT changes the behaviour of MPI Messages smaller than MP_EAGER_LIMIT are send instantaneous Messages larger than MP_EAGER_LIMIT are send using “hand-shake” Default value is small and not optimal for NAMD Tune it! NAMD on Power4 11 May 2019

MP_EAGER_LIMIT NAMD on Power4 11 May 2019

Sample loadleveler script #@ shell = /bin/ksh #@ job_type = parallel #@ network.MPI = csss,shared,us #@ account_no = z001 #@ output = namd_run.$(schedd_host)_$(jobid).out #@ error = namd_run.$(schedd_host)_$(jobid).err #@ wall_clock_limit = 00:30:00 #@ node = 1 #@ tasks_per_node = 8 #@ queue export MP_SHARED_MEMORY=yes export MP_EAGER_LIMIT=65536 poe path/namd2 inputfile Communication: shared memory Setting MP_EAGER_LIMIT Set path & inputfile NAMD on Power4 11 May 2019

Benchmark Joint Amber Charm (JAC) Benchmark Apo A-1 benchmark Dihydrofolate reductase in water, 23558 atoms www.scripps.edu/brooks/Benchmarks Apo A-1 benchmark Apolipoprotein A-1, 92224 atoms www.ks.uiuc.edu/Research/apoa1 TCR peptide-MHC 96796 atoms www.hpcx.ac.uk/about/newsletter/HPCxNews02.pdf F1-ATP synthase F1 subunit of ATP synthase, 327506 atoms www.sc-2002.org/paperpdfs/pap.pap277.pdf NAMD on Power4 11 May 2019

The HPCx system Presently: Future (Summer 2004) 40 IBM p690 Regata H frames 32 POWER4 processors per frame (1.3 GHz) Frames subdivided into LPARs of 8 processors 8 GB of main memory per LPAR IBM SP Switch2 (Colony) network 2 switch adapters per LPAR Dual plane Future (Summer 2004) Upgrade to p690+ frames (1.7 GHz) LPARs of 32 processors IBM HPS (Federation) network NAMD on Power4 11 May 2019

Time per step for 32 processors Benchmark NAMD 2.4 NAMD 2.5 Comment dhf reductase 23558 atoms 0.051s 0.032s Too small for 32 cpus APO A-1 92224 atoms 0.28s 0.19s TCR MHC 96796 atoms 0.30s 0.21s F1-ATP 327506 atoms 0.58s NAMD 2.5 substantially faster than NAMD 2.4 NAMD on Power4 11 May 2019

Large number of processors NAMD on Power4 11 May 2019

Further Reading Full technical report: The performance of NAMD on HPCx Joachim Hein www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0310.pdf NAMD on Power4 11 May 2019