Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts Bert de Jong High Performance Software.

1 Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts Bert de Jong High Performance Software Development Molecular Science Computing Facility

2 2 Overall hardware issues Computer power per node has increased Increase of single CPU has flattened out (but you never know!) Multiple cores together tax out other hardware resources in a node Bandwidth and latency for other major hardware resources are far behind Affecting the flops we actually use Memory Very difficult to feed the CPU Multiple cores further reduce bandwidth Network Data access considerably slower than memory Speed of light is our enemy Disk input/output Slowest of them all, disks spin only so fast

3 3 Dealing with memory Amounts of data needed in coupled cluster can be huge Amplitudes Too large to store on a single node (except for T 1 ) Shared memory would be good, but will shared memory of 100s of terabytes be feasible and accessible? Integrals Recompute vs store (on disk or in memory) Can we avoid access to memory when recomputing Coupled cluster has one advantage: it can easily be formulated as matrix-multiplication Can be very efficient: DGEMM on EMSL’s 1.5 GHz Itanium-2 system reached over 95% of peak efficiency As long as we can get all the needed data in memory!

4 4 Dealing with networks With 10s of terabytes of data and distributed memory systems, getting data from remote nodes is inevitable Can be no problem, as long as you can hide the communication behind computation Fetch data while computing = one-sided communication NWChem uses Global Arrays to accomplish this Issues are Low bandwidth and high latency relative to increasing node speed Non-uniform network -Cabling a full fat tree can be cost prohibitive -Effect of network topology -Fault resiliency of network Multiple cores need to compete for limited number of busses Data contention increase with increasing node count Data locality, data locality, data locality

5 5 Dealing with spinning disks Using local disk Will only contain data needed by its own node Can be fast enough if you put large number of spindles behind it And, again, if you can hide behind computation (pre-fetch) With 100,000s of disks, chances of failure become significant Fault tolerance of computation becomes an issue Using globally shared disk Crucial when going to very large systems Allows for large files shared by large numbers of nodes Lustre file system of petabytes possible Speed limited by number of access points (hosts) Large number of reads and writes need to be handled by small number of hosts, creating lock and access contention

6 6 What about beyond 1 petaflop? Possibly 100,000s of multicore nodes How does one create a fat enough network between that many nodes? Possibly 32, 64, 128 or more cores per node All cores simply cannot do the same thing anymore Not enough memory bandwidth Not enough network bandwidth Heterogenous computing within a node (CPU+GPU) Designate nodes for certain tasks -Communication -Memory access, put and get -Recompute integrals hopefully using cache only -DGEMM operations Task scheduling will become an issue

