Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Benefits on HPCx from Power5 chips and SMT HPCx User Group Meeting 28 June 2006 Alan Gray EPCC, University of Edinburgh.

Similar presentations


Presentation on theme: "Performance Benefits on HPCx from Power5 chips and SMT HPCx User Group Meeting 28 June 2006 Alan Gray EPCC, University of Edinburgh."— Presentation transcript:

1 Performance Benefits on HPCx from Power5 chips and SMT HPCx User Group Meeting 28 June 2006 Alan Gray EPCC, University of Edinburgh

2 28/06/06 Contents Introduction and System Overview Benchmark Results Synthetic Applications Simultaneous Multithreading Conclusions

3 28/06/06 Introduction and System Overview

4 28/06/06 Introduction HPCx underwent upgrade in November 2005 from Power4 to Power5 technology. New system features Simultaneous Multithreading (SMT) We will compare the new and old systems via benchmark results, both synthetic and involving real applications representing typical use of the system. the use of SMT is also investigated Also included for comparison are results from EPCC’s Blue Gene/L system.

5 28/06/065 Previous HPCx (Phase 2): 50 IBM e-Server p690+ nodes –SMP cluster, 32 Power4 1.7 GHz processors per node –32 GB of RAM per node –Federation interconnect –6.2 TFLOP/s Linpack HPCx (Phase 2a): 96 IBM e-Server p575 nodes –SMP cluster, 16 Power5 1.5GHz processors per node –Power5 have improved memory architecture over Power4 –32 GB of RAM per node (twice as much per processor than Phase 2) –Federation interconnect (same as Phase 2) –7.4 TFLOP/s Linpack, No 46 on top500 BlueSky: Single e-Server Blue Gene frame –1024 dual core chips, 2048 PowerPC440 processors, 700 MHz –512 MB of RAM per chip (distributed memory system), shared between the two cores –4.7 TFLOP/s Linpack, joint No 73 on top500 Systems for comparison

6 28/06/06 Benchmark Results: Synthetic

7 28/06/06 Synthetic benchmarks: Intel MPI suite Switch communication: Insignificant difference (not surprising – same switch) Intra node communication: Phase2a has better asymptotic bandwidth but slightly higher latency Ping Pong Benchmark – 2 processes communicate, either over the switch or within a node via shared memory

8 28/06/06 Multi Ping Pong: All available processors utilised Modified to ensure that all comms utilise the switch No difference between Phase 2 and 2a (not surprising – same switch) Synthetic benchmarks: Intel MPI suite

9 28/06/069 Streams performance (scale) Streams benchmark gives measure of memory bandwidth Hardware limit is 2 load+store per cycle Can clearly see caches Phase2a significantly better than Phase 2 for all memory levels

10 28/06/06 Benchmark Results: Applications

11 28/06/0611 CASTEP: AL 2 O 3 Density functional theory application, Payne et al., 2002, Segall et al., 2002 Widely used in the UK (largest user of HPCx) Benchmark: Al 2 O 3 : 270 atom slab sampled with 2 k-points Phase 2a around 1.3 times faster than Phase 2 –Even although clocks are slower –Code is taking advantage of improved memory bandwidth

12 28/06/0612 H2MOL Solves time dependent Schrödinger equation for laser driven dissociation of H 2 -molecules Refines grid when increasing processor count, hence constant work/proc Phase2a almost a factor of 2 faster than Phase 2 Writing of intermediate results shows up poor IO on Blue Gene

13 28/06/0613 PCHAN Finite difference code for Turbulent Flow: shock/boundary layer interaction (SBLI) Communications: halo- exchanges between adjacent computational sub-domains Phase 2a around 2 times faster than Phase 2 Very good scaling all systems – HPCx superscales

14 28/06/06 AIMPRO Ab Initio Modelling PROgram Determines structure of atoms using Born and Oppenheimer approximation Benchmark: DS-4 - 433 atoms, 12124 basis functions and 4 k-points Phase 2a outperforming Phase 2 by around a factor of 1.2

15 28/06/0615 MDCASK MDCASK: classical molecular dynamics code to study radiation damage in metals Benchmark used: 1372000 atoms in Ti lattice Performance is worse on Phase 2a than on Phase 2. –by factor larger than clock frequency ratio –Scaling also worse on Phase 2a Classical Molecular dynamics codes are characterised by many strided memory accesses. –degradation could be due to sensitivity to increased latency in some part of memory subsystem

16 28/06/06 LAMMPS Classical Molecular Dynamics - can simulate wide range of materials Rhodopsin Benchmark: 2048000 atoms Performance degradation again on new system. Factor is clock ratio at low processor numbers, but scaling worse on Phase 2a.

17 28/06/0617 NAMD 2.6b1: APO-A1 92224 atoms NAMD: classical molecular dynamics code designed for high-performance simulation of large biomolecular systems ApoA1 benchmark 92224 atoms Similarly to other classical molecular dynamics, performs worse on Phase 2a.

18 28/06/0618 DL_POLY3: Gramicidin 792960 atoms DL_POLY is a general purpose molecular dynamics package DL_POLY3 uses a distributed domain decomposition model The benchmark: a system of eight Gramicidin-A species (792,960 atoms) performs slightly better on Phase2a, but not as well as some of the other codes

19 28/06/06 Simultaneous Multithreading

20 28/06/06 Simultaneous Multithreading (SMT) Theoretical peak floating point performance of microprocessors has steadily risen in recent years Actual performance of apps, relative to theoretical peak, has dropped substantially –i.e. Number of cycles for which floating point units are idle is rising Due to latencies involved with processor operations. Compiler attempts to schedule instructions to minimise waste cycles –but effectiveness is limited by lack of independent instructions SMT: multiple threads can issue instructions to the functional units in each cycle. –no. independent instructions increases, no. idle cycles decreases

21 28/06/06 Simultaneous Multithreading (SMT) Power 5 processors on HPCx have 2 floating point units, and support SMT with 2 threads. Hence have 2 virtual processes (MPI tasks or OpenMP threads) running per physical processor. No SMT: #@tasks_per_node = 16 With SMT: #@ tasks_per_node = 32 #@ requirements = (Feature == “SMT”) Disadvantages: –More communication –Memory limit per task is halved

22 28/06/06 SMT: Streams Compare open squares (No SMT) with open circles (SMT) With SMT, there are twice as many tasks per node. For direct comparison, SMT results have been multiplied by a factor of 2 No difference observed in memory bandwidth with SMT. Of course, caches are effectively halved in size Therefore, any improvements in apps must be due to reduced memory latency (as expected).

23 28/06/06 SMT: Classical Molecular Dynamics Reminder: Classical Molecular Dynamics codes did worse than expected on Phase 2a, likely due to sensitivity to increased memory latency. Such codes benefit from SMT: seems that latencies are successfully hidden. Benefit limited to lower processor counts. At high counts large amount of communication takes over. For NAMD, up to factor of 1.4 improvement and crossover point is around 512 processors.

24 28/06/06 SMT: Classical Molecular Dynamics Reminder: Classical Molecular Dynamics codes did worse than expected on Phase 2a, likely due to sensitivity to increased memory latency. Such codes benefit from SMT: seems that latencies are successfully hidden. Benefit limited to lower processor counts. At high counts large amount of communication takes over. For MDcask, up to factor of 1.4 improvement and crossover point is around 256 processors.

25 28/06/06 SMT: Classical Molecular Dynamics Reminder: Classical Molecular Dynamics codes did worse than expected on Phase 2a, likely due to sensitivity to increased memory latency. Such codes benefit from SMT: seems that latencies are successfully hidden. Benefit limited to lower processor counts. At high counts large amount of communication takes over. For DL_POLY, up to factor of 1.2 improvement and crossover point is around 64 processors.

26 28/06/06 SMT: Castep and H2MOL Reminder: Performance of Castep and H2MOL codes improved on new system. –No performance benefit seen with SMT. –SMT degrades performance in certain situations

27 28/06/06 Conclusions HPCx upgraded from Power4 to Power5 technology recently. Although new chips have slightly lower clock frequency, significant improvements observed in majority of applications –due to better memory bandwidth Some types of application, in particular Classical Molecular Dynamics, have not performed as well as expected on new system. –These apps characterised by many strided memory accesses –Sensitivity to an increased latency could be to blame. Performance benefits with the use of SMT have been observed in certain situations –In particular for those codes which didn't do as well as expected –Users should benchmark their own codes


Download ppt "Performance Benefits on HPCx from Power5 chips and SMT HPCx User Group Meeting 28 June 2006 Alan Gray EPCC, University of Edinburgh."

Similar presentations


Ads by Google