S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende

Slides:



Advertisements
Similar presentations
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Advertisements

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Automated Instrumentation and Monitoring System (AIMS)
Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Scalability Study of S3D using TAU Sameer Shende
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
Profiling S3D on Cray XT3 using TAU Sameer Shende
The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
TAU Performance System
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Case Study: PETSc ex19  Non-linear solver (snes)  2-D driven cavity code  uses velocity-velocity formulation  finite difference discretization on a.
TAU Performance SystemS3D Scalability Study1 Total Execution Time.
TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,
Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.
Performance Evaluation of S3D using TAU Sameer Shende
TAU: Performance Regression Testing Harness for FLASH Sameer Shende
Scalability Study of S3D using TAU Sameer Shende
S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende
The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Computer System Architectures Computer System Software
© 2008 Pittsburgh Supercomputing Center Performance Engineering of Parallel Applications Philip Blood, Raghu Reddy Pittsburgh Supercomputing Center.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology.
John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.
Profiling Tools In Ranger Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon
Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Zhengji Zhao, Nicholas Wright, and Katie Antypas NERSC Effects of Hyper- Threading on the NERSC workload on Edison NUG monthly meeting, June 6, 2013.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
TAU Performance System ® TAU is a profiling and tracing toolkit that supports programs written in C, C++, Fortran, Java, Python,
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
NGS computation services: APIs and.
Introduction to HPC Debugging with Allinea DDT Nick Forrington
© 2010 Pittsburgh Supercomputing Center Performance Engineering of Parallel Applications Philip Blood, Raghu Reddy Pittsburgh Supercomputing Center.
Navigating TAU Visual Display ParaProf and TAU Portal Mahin Mahmoodi Pittsburgh Supercomputing Center 2010.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Introduction to the TAU Performance System®
Performance Technology for Scalable Parallel Systems
TAU integration with Score-P
Allen D. Malony, Sameer Shende
Advanced TAU Commander
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Presentation transcript:

S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende

TAU Performance SystemS3D Scalability Study2 Acknowledgements  Alan Morris [UO]  Kevin Huck [UO]  Allen D. Malony [UO]  Kenneth Roche [ORNL]  Bronis R. de Supinski [LLNL]  John Mellor-Crummey [Rice]  Nick Wright [SDSC]  Jeff Larkin [Cray, Inc.] The performance data presented here is available at:

TAU Performance SystemS3D Scalability Study3 TAU Parallel Performance System   Multi-level performance instrumentation  Multi-language automatic source instrumentation  Flexible and configurable performance measurement  Widely-ported parallel performance profiling system  Computer system architectures and operating systems  Different programming languages and compilers  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid

TAU Performance SystemS3D Scalability Study4 The Story So Far...  Scalability study of S3D using TAU  MPI_Wait  I/O (WRITE_SAVEFILE)  Loop: ComputeSpeciesDiffFlux ( ) [Rice, SDSC]  Loop: ReactionRateBounds ( ) [exp]  3D Scatter plots pointed to a single “slow” node before  Identifying individual nodes by mapping ranks to nodes within TAU  Cray utilities: nodeinfo, xtshowmesh, xtshowcabs  Ran a 6400 core simulation to identify XT3/XT4 partition performance issues (removed -feature=xt3)

TAU Performance SystemS3D Scalability Study5 Total Runtime Breakdown by Events - Time MPI_Wait WRITE_ SAVEFILE

TAU Performance SystemS3D Scalability Study6 Relative Efficiency

TAU Performance SystemS3D Scalability Study7 MPI Scaling

TAU Performance SystemS3D Scalability Study8 Relative Efficiency & Speedup for One Event

TAU Performance SystemS3D Scalability Study9 ParaProf’s Source Browser (8 core profile)

TAU Performance SystemS3D Scalability Study10 Case Study  Harness testcase  Platform: Jaguar Combined Cray XT3/XT4 at ORNL  6400p  Goal:  To evaluate the performance impact of combined XT3/XT4 nodes on S3D executions  Performance evaluation of MPI_Wait  Study mapping of MPI ranks to nodes

TAU Performance SystemS3D Scalability Study11 TAU: ParaProf Profile

TAU Performance SystemS3D Scalability Study12 Overall Mean Profile: Exclusive Wallclock Time

TAU Performance SystemS3D Scalability Study13 Overall Inclusive Time

TAU Performance SystemS3D Scalability Study14 Mean Mflops observed over all ranks

TAU Performance SystemS3D Scalability Study15 Inclusive Total Instructions Executed

TAU Performance SystemS3D Scalability Study16 Total Instructions Executed (Exclusive)

TAU Performance SystemS3D Scalability Study17 Comparing Exclusive PAPI Counters, MFlops

TAU Performance SystemS3D Scalability Study18 3D Scatter Plots  Plot four routines along X, Y, Z, and Color axes  Each routine has a range (max, min)  Each process (rank) has a unique position along the three axes and a unique color  Allows us to examine the distribution of nodes (clusters)

TAU Performance SystemS3D Scalability Study19 Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters!

TAU Performance SystemS3D Scalability Study20 3D Triangle Mesh Display  Plot MPI rank, routine name, and exclusive time along X, Y and Z axes  Color can be shown by a fourth metric  Scalable view  Suitable for very large number of processors

TAU Performance SystemS3D Scalability Study21 MPI_Wait: 3D View

TAU Performance SystemS3D Scalability Study22 3D View: Zooming In... Jagged Edges!

TAU Performance SystemS3D Scalability Study23 3D View: Uh Oh!

TAU Performance SystemS3D Scalability Study24 Zoom, Change Color to L1 Data Cache Misses Loop in ComputeSpeciesDiffFlux ( ) has high L1 DCMs (red) Takes longer to execute on this “slice” of processors. So do other routines. Slower memory?

TAU Performance SystemS3D Scalability Study25 Changing Color to MFLOPS Loop in ComputeSpeciesDiffFlux ( ) lower Mflops (dark blue)

TAU Performance SystemS3D Scalability Study26 Getting Back to MPI_Wait() Why does MPI_Wait take less time on these cores? What does the profile of MPI_Wait look like?

TAU Performance SystemS3D Scalability Study27 MPI_Wait - Sorted by Exclusive Time MPI_Wait takes seconds on rank 3101 It takes 59.6 s on rank 3233 and 29.2 s on rank 3200 It takes seconds on rank 0! How is rank 3101 different from rank 0?

TAU Performance SystemS3D Scalability Study28 Comparing Ranks 3101 and 0 (extremes)

TAU Performance SystemS3D Scalability Study29 Comparing Inclusive Times - Same for S3D

TAU Performance SystemS3D Scalability Study30 Comparing PAPI Floating Point Instructions PAPI_FP_INS are the same - as expected

TAU Performance SystemS3D Scalability Study31 Comparing Performance - MFLOPS For the memory intensive loop in ComputeSpeciesDiffFlux, rank 0 gets 65% Mflops of rank 3101 (114 vs 174 Mflops)!

TAU Performance SystemS3D Scalability Study32 Comparing MFLOPS: Rank 3101 vs Rank 0 Rank 0 appears to be “slower” than rank 3101 Are there other nodes that are similarly slow with less wait times? How does the MPI_Wait profile look like over all nodes?

TAU Performance SystemS3D Scalability Study33 MPI_Wait Profile What is this rank?

TAU Performance SystemS3D Scalability Study34 MPI_Wait Profile Shifts at rank 114! Ranks 0 through 113 take less time in MPI_Wait than

TAU Performance SystemS3D Scalability Study35 Another Shift in MPI_Wait() This shift is observed in ranks 3200 through 3313 Again 114 processors... (like ranks 0 through 113) Hmm... How do other routines perform on these ranks? What are the physical node ids?

TAU Performance SystemS3D Scalability Study36 MPI_Wait While MPI_Wait takes less time on these cpus, other routines take longer Points to a load imbalance!

TAU Performance SystemS3D Scalability Study37 Identifying Physical Processors using Metadata

TAU Performance SystemS3D Scalability Study38 MetaData for Ranks 3200 and 0 Rank 3200 and 0 both lie on the same physical node nid03406!

TAU Performance SystemS3D Scalability Study39 Mapping Ranks from TAU to Physical Processors Ranks lie on processors Ranks are also on

TAU Performance SystemS3D Scalability Study40 Results from Cray’s nodeinfo Utility Processors (physical ids) are located on the XT3 partition XT3 partition has slow DDR-400 memory (5986 MB/s) XT3 has a slower SS1 (1109 MB/s) interconnect XT4 partition has faster DDR2-667 memory modules (7147 MB/s) and faster Seastar2 (SS2) (2022 MB/s) interconnect

TAU Performance SystemS3D Scalability Study41 Location of Physical Nodes in the Cabinets Using Cray utilities xtshowcabs, and xtshowmesh utilities All nodes marked with a Job “c” came from our S3D job

TAU Performance SystemS3D Scalability Study42 xtshowcabs Nodes marked with a “c” are from our S3D run What does the mesh look like?

TAU Performance SystemS3D Scalability Study43 xtshowmesh (1 of 2) Nodes marked with a “c” are from our S3D run

TAU Performance SystemS3D Scalability Study44 xtshowmesh (2 of 2) Nodes marked with a “c” are from our S3D run

TAU Performance SystemS3D Scalability Study45 Conclusions  Using a combination of XT3/XT4 nodes slowed down parts of S3D  The application spends a considerable amount of time spinning/polling in MPI_Wait  The load imbalance is probably caused by non-uniform nodes  Conducted a performance characterization of S3D  This data will help derive communication models that explain the performance data observed [John Mellor-Crummey, Rice]  Techniques to improve cache memory utilization in the loops identified by TAU will help overall performance [SDSC, Rice]  I/O characterization of S3D will help identify I/O scaling issues

TAU Performance SystemS3D Scalability Study46 S3D - Building with TAU  Change name of compiler in build/make.XT3  ftn=> tau_f90.sh  cc => tau_cc.sh  Set compile time environment variables  setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/ Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi  Disabled tracking message communication statistics in TAU  MPI_Comm_compare() is not called inside TAU’s MPI wrapper  Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation  setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’  Selective instrumentation file eliminates instrumentation in lightweight routines  Pre-process Fortran source code using cpp before compiling  Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script:  export TAU_THROTTLE=1  export COUNTER1 GET_TIME_OF_DAY  export COUNTER2 PAPI_FP_INS  export COUNTER3 PAPI_L1_DCM  export COUNTER4 PAPI_TOT_INS  export COUNTER5 PAPI_L2_DCM

TAU Performance SystemS3D Scalability Study47 Selective Instrumentation in TAU % cat select.tau BEGIN_EXCLUDE_LIST MCADIF GETRATES TRANSPORT_M::MCAVIS_NEW MCEDIF MCACON CKYTCP THERMCHEM_M::MIXCP THERMCHEM_M::MIXENTH THERMCHEM_M::GIBBSENRG_ALL_DIMT CKRHOY MCEVAL4 THERMCHEM_M::HIS THERMCHEM_M::CPS THERMCHEM_M::ENTROPY END_EXCLUDE_LIST BEGIN_INSTRUMENT_SECTION loops routine="#" END_INSTRUMENT_SECTION

TAU Performance SystemS3D Scalability Study48 Getting Access to TAU on Jaguar  set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path)  Choose Stub Makefiles (TAU_MAKEFILE env. var.) from /spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.*  Makefile.tau-mpi-pdt-pgi (flat profile)  Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir)  Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile)  Binaries of S3D can be found in:  ~sameer/scratch/S3D-BINARIES withtau »papi, multiplecounters, mpi, pdt, pgi options without_tau

TAU Performance SystemS3D Scalability Study49 Concluding Discussion  Performance tools must be used effectively  More intelligent performance systems for productive use  Evolve to application-specific performance technology  Deal with scale by “full range” performance exploration  Autonomic and integrated tools  Knowledge-based and knowledge-driven process  Performance observation methods do not necessarily need to change in a fundamental sense  More automatically controlled and efficiently use  Develop next-generation tools and deliver to community  Open source with support by ParaTools, Inc. 

TAU Performance SystemS3D Scalability Study50 Support Acknowledgements  Department of Energy (DOE)  Office of Science  LLNL, LANL, ORNL, ASC  PERI