Download presentation
Presentation is loading. Please wait.
1
S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu
2
TAU Performance SystemS3D Scalability Study2 Acknowledgements Alan Morris [UO] Kevin Huck [UO] Allen D. Malony [UO] Kenneth Roche [ORNL] Bronis R. de Supinski [LLNL] John Mellor-Crummey [Rice] Nick Wright [SDSC] Jeff Larkin [Cray, Inc.] The performance data presented here is available at: http://www.cs.uoregon.edu/research/tau/s3d
3
TAU Performance SystemS3D Scalability Study3 TAU Parallel Performance System http://www.cs.uoregon.edu/research/tau/ Multi-level performance instrumentation Multi-language automatic source instrumentation Flexible and configurable performance measurement Widely-ported parallel performance profiling system Computer system architectures and operating systems Different programming languages and compilers Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid
4
TAU Performance SystemS3D Scalability Study4 The Story So Far... Scalability study of S3D using TAU 3D Scatter plots and mapping of ranks to physical processors points to partitioning in XT3/XT4 Memory and network on XT3 partition cause the rest of the application to slow down Hypothesis: Running S3D on a ‘pure’ XT4 system will help improve the performance significantly Ran a 6400 core simulation on an XT4 partition to compare with XT3+XT4 (used #PBS -lfeature=xt4)...
5
TAU Performance SystemS3D Scalability Study5 3D Scatter Plots Plot four routines along X, Y, Z, and Color axes Each routine has a range (max, min) Each process (rank) has a unique position along the three axes and a unique color Allows us to examine the distribution of nodes (clusters)
6
TAU Performance SystemS3D Scalability Study6 Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters! Previous work proved: Blue nodes are XT3, Red are XT4
7
TAU Performance SystemS3D Scalability Study7 3D Triangle Mesh Display Plot MPI rank, routine name, and exclusive time along X, Y and Z axes Color can be shown by a fourth metric Scalable view Suitable for very large number of processors
8
TAU Performance SystemS3D Scalability Study8 XT3+XT4: MPI_Wait Gap represents XT3 nodes
9
TAU Performance SystemS3D Scalability Study9 3D View: Large MPI_Wait times on most CPUs To improve performance, we must reduce MPI_Wait time on other cpus
10
TAU Performance SystemS3D Scalability Study10 3D View: XT3 Partition, Imbalance On XT3: MPI_Wait takes less time, other routines take more time!
11
TAU Performance SystemS3D Scalability Study11 Getting Back to MPI_Wait() MPI_Wait takes less time on XT3 nodes Other routines take longer
12
TAU Performance SystemS3D Scalability Study12 XT3+XT4: MPI_Wait - Sorted by Exclusive Time MPI_Wait takes 435.84 seconds on rank 3101 It takes 15.49 seconds on rank 0! Rank 3101 is on XT4, rank 0 is on XT3
13
TAU Performance SystemS3D Scalability Study13 Comparing XT4 and XT3 ranks (Best vs worst)
14
TAU Performance SystemS3D Scalability Study14 Improving S3D Performance Hypothesis: Running S3D on a ‘pure’ XT4 system will help improve the performance significantly and reduce the time spent idling in MPI_Wait
15
TAU Performance SystemS3D Scalability Study15 XT4 Profile: Main Window
16
TAU Performance SystemS3D Scalability Study16 XT4: Mean Profile Sorted by Exclusive Time MPI_Wait has moved down!
17
TAU Performance SystemS3D Scalability Study17 XT4: Mean Profile Sorted by Inclusive Time
18
TAU Performance SystemS3D Scalability Study18 Comparing XT4 with XT3+XT4 MPI_Wait takes 26% of time compared to combined XT3+XT4!
19
TAU Performance SystemS3D Scalability Study19 Comparing Mean Inclusive Time
20
TAU Performance SystemS3D Scalability Study20 XT4: 3D View The “exp” loop [~1GFlop] takes most time now!
21
TAU Performance SystemS3D Scalability Study21 XT3+XT4: Scatter Plot (Before)
22
TAU Performance SystemS3D Scalability Study22 XT4 Scatter Plot (After) MPI_Wait takes from 78 to 121 s now!
23
TAU Performance SystemS3D Scalability Study23 Comparing Performance Hypothesis confirmed: XT4 is faster than XT3+XT4 Inclusive time down from 1935 to 1702 s 12% improvement Saved 24853.3 minutes (414 hours) of wallclock time! Reduction in MPI_Wait time is most significant 390s (mean) down to 104s (mean) Lessons learned: Slower XT3 nodes can have a significant impact on a large scale S3D run S3D harness testcase does not perform well on non- homogeneous nodes We recommend running S3D on XT4 partition only! #PBS -lfeature=xt4
24
TAU Performance SystemS3D Scalability Study24 Discussion Did we get optimal performance on XT4 nodes? Are the nodes performing at similar rates uniformly now? Let us see the std. deviation plot of all routines...
25
TAU Performance SystemS3D Scalability Study25 XT4: Standard Deviation IO routines!
26
TAU Performance SystemS3D Scalability Study26 Scatter Plot: One CPU... WRITE_SAVEFILE
27
TAU Performance SystemS3D Scalability Study27 WRITE_SAVEFILE Rank 0 is quicker!
28
TAU Performance SystemS3D Scalability Study28 MPI_Barrier
29
TAU Performance SystemS3D Scalability Study29 I/O is not performed uniformly
30
TAU Performance SystemS3D Scalability Study30 I/O Becomes a Bottleneck: XT3, XT3+XT4... MPI_Wait WRITE_ SAVEFILE
31
TAU Performance SystemS3D Scalability Study31 Conclusions Using pure XT4 improved performance by 12% Need to investigate I/O in XT4/Lustre further to achieve better performance... Discuss I/O issues with S3D developers
32
TAU Performance SystemS3D Scalability Study32 S3D - Building with TAU Change name of compiler in build/make.XT3 ftn=> tau_f90.sh cc => tau_cc.sh Set compile time environment variables setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/ Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi Disabled tracking message communication statistics in TAU MPI_Comm_compare() is not called inside TAU’s MPI wrapper Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’ Selective instrumentation file eliminates instrumentation in lightweight routines Pre-process Fortran source code using cpp before compiling Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script: export TAU_THROTTLE=1 export COUNTER1 GET_TIME_OF_DAY export COUNTER2 PAPI_FP_INS export COUNTER3 PAPI_L1_DCM export COUNTER4 PAPI_TOT_INS export COUNTER5 PAPI_L2_DCM
33
TAU Performance SystemS3D Scalability Study33 Selective Instrumentation in TAU % cat select.tau BEGIN_EXCLUDE_LIST MCADIF GETRATES TRANSPORT_M::MCAVIS_NEW MCEDIF MCACON CKYTCP THERMCHEM_M::MIXCP THERMCHEM_M::MIXENTH THERMCHEM_M::GIBBSENRG_ALL_DIMT CKRHOY MCEVAL4 THERMCHEM_M::HIS THERMCHEM_M::CPS THERMCHEM_M::ENTROPY END_EXCLUDE_LIST BEGIN_INSTRUMENT_SECTION loops routine="#" END_INSTRUMENT_SECTION
34
TAU Performance SystemS3D Scalability Study34 Getting Access to TAU on Jaguar set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path) Choose Stub Makefiles (TAU_MAKEFILE env. var.) from /spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.* Makefile.tau-mpi-pdt-pgi (flat profile) Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir) Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile) Binaries of S3D can be found in: ~sameer/scratch/S3D-BINARIES withtau »papi, multiplecounters, mpi, pdt, pgi options without_tau
35
TAU Performance SystemS3D Scalability Study35 Concluding Discussion Performance tools must be used effectively More intelligent performance systems for productive use Evolve to application-specific performance technology Deal with scale by “full range” performance exploration Autonomic and integrated tools Knowledge-based and knowledge-driven process Performance observation methods do not necessarily need to change in a fundamental sense More automatically controlled and efficiently use Develop next-generation tools and deliver to community Open source with support by ParaTools, Inc. http://www.cs.uoregon.edu/research/tau
36
TAU Performance SystemS3D Scalability Study36 Support Acknowledgements Department of Energy (DOE) Office of Science LLNL, LANL, ORNL, ASC PERI
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.