S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende

S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study2 Acknowledgements  Alan Morris [UO]  Kevin Huck [UO]  Allen D. Malony [UO]  Kenneth Roche [ORNL]  Bronis R. de Supinski [LLNL]  John Mellor-Crummey [Rice]  Nick Wright [SDSC]  Jeff Larkin [Cray, Inc.] The performance data presented here is available at: http://www.cs.uoregon.edu/research/tau/s3d

TAU Performance SystemS3D Scalability Study3 TAU Parallel Performance System  http://www.cs.uoregon.edu/research/tau/  Multi-level performance instrumentation  Multi-language automatic source instrumentation  Flexible and configurable performance measurement  Widely-ported parallel performance profiling system  Computer system architectures and operating systems  Different programming languages and compilers  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid

TAU Performance SystemS3D Scalability Study4 The Story So Far...  Scalability study of S3D using TAU  MPI_Wait  I/O (WRITE_SAVEFILE)  Loop: ComputeSpeciesDiffFlux (630-656) [Rice, SDSC]  Loop: ReactionRateBounds (374-386) [exp]  3D Scatter plots pointed to a single “slow” node before  Identifying individual nodes by mapping ranks to nodes within TAU  Cray utilities: nodeinfo, xtshowmesh, xtshowcabs  Ran a 6400 core simulation to identify XT3/XT4 partition performance issues (removed -feature=xt3)

TAU Performance SystemS3D Scalability Study5 Total Runtime Breakdown by Events - Time MPI_Wait WRITE_ SAVEFILE

TAU Performance SystemS3D Scalability Study6 Relative Efficiency

TAU Performance SystemS3D Scalability Study7 MPI Scaling

TAU Performance SystemS3D Scalability Study8 Relative Efficiency & Speedup for One Event

TAU Performance SystemS3D Scalability Study9 ParaProf’s Source Browser (8 core profile)

TAU Performance SystemS3D Scalability Study10 Case Study  Harness testcase  Platform: Jaguar Combined Cray XT3/XT4 at ORNL  6400p  Goal:  To evaluate the performance impact of combined XT3/XT4 nodes on S3D executions  Performance evaluation of MPI_Wait  Study mapping of MPI ranks to nodes

TAU Performance SystemS3D Scalability Study11 TAU: ParaProf Profile

TAU Performance SystemS3D Scalability Study12 Overall Mean Profile: Exclusive Wallclock Time

TAU Performance SystemS3D Scalability Study13 Overall Inclusive Time

TAU Performance SystemS3D Scalability Study14 Mean Mflops observed over all ranks

TAU Performance SystemS3D Scalability Study15 Inclusive Total Instructions Executed

TAU Performance SystemS3D Scalability Study16 Total Instructions Executed (Exclusive)

TAU Performance SystemS3D Scalability Study17 Comparing Exclusive PAPI Counters, MFlops

TAU Performance SystemS3D Scalability Study18 3D Scatter Plots  Plot four routines along X, Y, Z, and Color axes  Each routine has a range (max, min)  Each process (rank) has a unique position along the three axes and a unique color  Allows us to examine the distribution of nodes (clusters)

TAU Performance SystemS3D Scalability Study19 Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters!

TAU Performance SystemS3D Scalability Study20 3D Triangle Mesh Display  Plot MPI rank, routine name, and exclusive time along X, Y and Z axes  Color can be shown by a fourth metric  Scalable view  Suitable for very large number of processors

TAU Performance SystemS3D Scalability Study21 MPI_Wait: 3D View

TAU Performance SystemS3D Scalability Study22 3D View: Zooming In... Jagged Edges!

TAU Performance SystemS3D Scalability Study23 3D View: Uh Oh!

TAU Performance SystemS3D Scalability Study24 Zoom, Change Color to L1 Data Cache Misses Loop in ComputeSpeciesDiffFlux (630-656) has high L1 DCMs (red) Takes longer to execute on this “slice” of processors. So do other routines. Slower memory?

TAU Performance SystemS3D Scalability Study25 Changing Color to MFLOPS Loop in ComputeSpeciesDiffFlux (630-656) lower Mflops (dark blue)

TAU Performance SystemS3D Scalability Study26 Getting Back to MPI_Wait() Why does MPI_Wait take less time on these cores? What does the profile of MPI_Wait look like?

TAU Performance SystemS3D Scalability Study27 MPI_Wait - Sorted by Exclusive Time MPI_Wait takes 435.84 seconds on rank 3101 It takes 59.6 s on rank 3233 and 29.2 s on rank 3200 It takes 15.49 seconds on rank 0! How is rank 3101 different from rank 0?

TAU Performance SystemS3D Scalability Study28 Comparing Ranks 3101 and 0 (extremes)

TAU Performance SystemS3D Scalability Study29 Comparing Inclusive Times - Same for S3D

TAU Performance SystemS3D Scalability Study30 Comparing PAPI Floating Point Instructions PAPI_FP_INS are the same - as expected

TAU Performance SystemS3D Scalability Study31 Comparing Performance - MFLOPS For the memory intensive loop in ComputeSpeciesDiffFlux, rank 0 gets 65% Mflops of rank 3101 (114 vs 174 Mflops)!

TAU Performance SystemS3D Scalability Study32 Comparing MFLOPS: Rank 3101 vs Rank 0 Rank 0 appears to be “slower” than rank 3101 Are there other nodes that are similarly slow with less wait times? How does the MPI_Wait profile look like over all nodes?

TAU Performance SystemS3D Scalability Study33 MPI_Wait Profile What is this rank?

TAU Performance SystemS3D Scalability Study34 MPI_Wait Profile Shifts at rank 114! Ranks 0 through 113 take less time in MPI_Wait than 114...

TAU Performance SystemS3D Scalability Study35 Another Shift in MPI_Wait() This shift is observed in ranks 3200 through 3313 Again 114 processors... (like ranks 0 through 113) Hmm... How do other routines perform on these ranks? What are the physical node ids?

TAU Performance SystemS3D Scalability Study36 MPI_Wait While MPI_Wait takes less time on these cpus, other routines take longer Points to a load imbalance!

TAU Performance SystemS3D Scalability Study37 Identifying Physical Processors using Metadata

TAU Performance SystemS3D Scalability Study38 MetaData for Ranks 3200 and 0 Rank 3200 and 0 both lie on the same physical node nid03406!

TAU Performance SystemS3D Scalability Study39 Mapping Ranks from TAU to Physical Processors Ranks 0..113 lie on processors 3406..3551 Ranks 3200..3313 are also on 3406..3551

TAU Performance SystemS3D Scalability Study40 Results from Cray’s nodeinfo Utility Processors 3406..3551 (physical ids) are located on the XT3 partition XT3 partition has slow DDR-400 memory (5986 MB/s) XT3 has a slower SS1 (1109 MB/s) interconnect XT4 partition has faster DDR2-667 memory modules (7147 MB/s) and faster Seastar2 (SS2) (2022 MB/s) interconnect

TAU Performance SystemS3D Scalability Study41 Location of Physical Nodes in the Cabinets Using Cray utilities xtshowcabs, and xtshowmesh utilities All nodes marked with a Job “c” came from our S3D job

TAU Performance SystemS3D Scalability Study42 xtshowcabs Nodes marked with a “c” are from our S3D run What does the mesh look like?

TAU Performance SystemS3D Scalability Study43 xtshowmesh (1 of 2) Nodes marked with a “c” are from our S3D run

TAU Performance SystemS3D Scalability Study44 xtshowmesh (2 of 2) Nodes marked with a “c” are from our S3D run

TAU Performance SystemS3D Scalability Study45 Conclusions  Using a combination of XT3/XT4 nodes slowed down parts of S3D  The application spends a considerable amount of time spinning/polling in MPI_Wait  The load imbalance is probably caused by non-uniform nodes  Conducted a performance characterization of S3D  This data will help derive communication models that explain the performance data observed [John Mellor-Crummey, Rice]  Techniques to improve cache memory utilization in the loops identified by TAU will help overall performance [SDSC, Rice]  I/O characterization of S3D will help identify I/O scaling issues

TAU Performance SystemS3D Scalability Study46 S3D - Building with TAU  Change name of compiler in build/make.XT3  ftn=> tau_f90.sh  cc => tau_cc.sh  Set compile time environment variables  setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/ Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi  Disabled tracking message communication statistics in TAU  MPI_Comm_compare() is not called inside TAU’s MPI wrapper  Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation  setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’  Selective instrumentation file eliminates instrumentation in lightweight routines  Pre-process Fortran source code using cpp before compiling  Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script:  export TAU_THROTTLE=1  export COUNTER1 GET_TIME_OF_DAY  export COUNTER2 PAPI_FP_INS  export COUNTER3 PAPI_L1_DCM  export COUNTER4 PAPI_TOT_INS  export COUNTER5 PAPI_L2_DCM

TAU Performance SystemS3D Scalability Study47 Selective Instrumentation in TAU % cat select.tau BEGIN_EXCLUDE_LIST MCADIF GETRATES TRANSPORT_M::MCAVIS_NEW MCEDIF MCACON CKYTCP THERMCHEM_M::MIXCP THERMCHEM_M::MIXENTH THERMCHEM_M::GIBBSENRG_ALL_DIMT CKRHOY MCEVAL4 THERMCHEM_M::HIS THERMCHEM_M::CPS THERMCHEM_M::ENTROPY END_EXCLUDE_LIST BEGIN_INSTRUMENT_SECTION loops routine="#" END_INSTRUMENT_SECTION

TAU Performance SystemS3D Scalability Study48 Getting Access to TAU on Jaguar  set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path)  Choose Stub Makefiles (TAU_MAKEFILE env. var.) from /spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.*  Makefile.tau-mpi-pdt-pgi (flat profile)  Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir)  Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile)  Binaries of S3D can be found in:  ~sameer/scratch/S3D-BINARIES withtau »papi, multiplecounters, mpi, pdt, pgi options without_tau

TAU Performance SystemS3D Scalability Study49 Concluding Discussion  Performance tools must be used effectively  More intelligent performance systems for productive use  Evolve to application-specific performance technology  Deal with scale by “full range” performance exploration  Autonomic and integrated tools  Knowledge-based and knowledge-driven process  Performance observation methods do not necessarily need to change in a fundamental sense  More automatically controlled and efficiently use  Develop next-generation tools and deliver to community  Open source with support by ParaTools, Inc.  http://www.cs.uoregon.edu/research/tau

TAU Performance SystemS3D Scalability Study50 Support Acknowledgements  Department of Energy (DOE)  Office of Science  LLNL, LANL, ORNL, ASC  PERI

S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende

Similar presentations

Presentation on theme: "S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende

Similar presentations

Presentation on theme: "S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende"— Presentation transcript:

Similar presentations

About project

Feedback