S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende

Slides:

Advertisements

Similar presentations

Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.

Advertisements

Thoughts on Shared Caches Jeff Odom University of Maryland.

Automated Instrumentation and Monitoring System (AIMS)

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,

S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Scalability Study of S3D using TAU Sameer Shende

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Profiling S3D on Cray XT3 using TAU Sameer Shende

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

TAU Performance System

Case Study: PETSc ex19  Non-linear solver (snes)  2-D driven cavity code  uses velocity-velocity formulation  finite difference discretization on a.

TAU Performance SystemS3D Scalability Study1 Total Execution Time.

TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,

Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.

Performance Evaluation of S3D using TAU Sameer Shende

TAU: Performance Regression Testing Harness for FLASH Sameer Shende

Scalability Study of S3D using TAU Sameer Shende

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Computer System Architectures Computer System Software

Parallel Processing LAB NO 1.

© 2008 Pittsburgh Supercomputing Center Performance Engineering of Parallel Applications Philip Blood, Raghu Reddy Pittsburgh Supercomputing Center.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Chee Wai Lee, Allen D. Malony, Alan Morris Department of Computer and Information Science Performance Research.

Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.

Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

SvPablo Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work.

1 Cray Inc. 11/28/2015 Cray Inc Slide 2 Cray Cray Adaptive Supercomputing Vision Cray moves to Linux-base OS Cray Introduces CX1 Cray moves.

SvPablo. Source view Pablo GUI for instrumenting source code and viewing runtime performance data Joint work at Univ. of Illinois and Rice Univ. HPF programs.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

Towards large-scale parallel simulated packings of ellipsoids with OpenMP and HyperFlow Monika Bargieł 1, Łukasz Szczygłowski 1, Radosław Trzcionkowski.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

NGS computation services: APIs and.

Background Computer System Architectures Computer System Software.

Introduction to HPC Debugging with Allinea DDT Nick Forrington

Navigating TAU Visual Display ParaProf and TAU Portal Mahin Mahmoodi Pittsburgh Supercomputing Center 2010.

Performance Engineering FUN3D at Scale with TAU Commander

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Productive Performance Tools for Heterogeneous Parallel Computing

Introduction to the TAU Performance System®

Performance Technology for Scalable Parallel Systems

TAU integration with Score-P

Allen D. Malony, Sameer Shende

Advanced TAU Commander

Presentation transcript:

S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende

TAU Performance SystemS3D Scalability Study2 Acknowledgements  Alan Morris [UO]  Kevin Huck [UO]  Allen D. Malony [UO]  Kenneth Roche [ORNL]  Bronis R. de Supinski [LLNL]  John Mellor-Crummey [Rice]  Nick Wright [SDSC]  Jeff Larkin [Cray, Inc.] The performance data presented here is available at:

TAU Performance SystemS3D Scalability Study3 TAU Parallel Performance System   Multi-level performance instrumentation  Multi-language automatic source instrumentation  Flexible and configurable performance measurement  Widely-ported parallel performance profiling system  Computer system architectures and operating systems  Different programming languages and compilers  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid

TAU Performance SystemS3D Scalability Study4 The Story So Far...  Scalability study of S3D using TAU  3D Scatter plots and mapping of ranks to physical processors points to partitioning in XT3/XT4  Memory and network on XT3 partition cause the rest of the application to slow down  Hypothesis: Running S3D on a ‘pure’ XT4 system will help improve the performance significantly  Ran a 6400 core simulation on an XT4 partition to compare with XT3+XT4 (used #PBS -lfeature=xt4)...

TAU Performance SystemS3D Scalability Study5 3D Scatter Plots  Plot four routines along X, Y, Z, and Color axes  Each routine has a range (max, min)  Each process (rank) has a unique position along the three axes and a unique color  Allows us to examine the distribution of nodes (clusters)

TAU Performance SystemS3D Scalability Study6 Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters! Previous work proved: Blue nodes are XT3, Red are XT4

TAU Performance SystemS3D Scalability Study7 3D Triangle Mesh Display  Plot MPI rank, routine name, and exclusive time along X, Y and Z axes  Color can be shown by a fourth metric  Scalable view  Suitable for very large number of processors

TAU Performance SystemS3D Scalability Study8 XT3+XT4: MPI_Wait Gap represents XT3 nodes

TAU Performance SystemS3D Scalability Study9 3D View: Large MPI_Wait times on most CPUs To improve performance, we must reduce MPI_Wait time on other cpus

TAU Performance SystemS3D Scalability Study10 3D View: XT3 Partition, Imbalance On XT3: MPI_Wait takes less time, other routines take more time!

TAU Performance SystemS3D Scalability Study11 Getting Back to MPI_Wait() MPI_Wait takes less time on XT3 nodes Other routines take longer

TAU Performance SystemS3D Scalability Study12 XT3+XT4: MPI_Wait - Sorted by Exclusive Time MPI_Wait takes seconds on rank 3101 It takes seconds on rank 0! Rank 3101 is on XT4, rank 0 is on XT3

TAU Performance SystemS3D Scalability Study13 Comparing XT4 and XT3 ranks (Best vs worst)

TAU Performance SystemS3D Scalability Study14 Improving S3D Performance  Hypothesis: Running S3D on a ‘pure’ XT4 system will help improve the performance significantly and reduce the time spent idling in MPI_Wait

TAU Performance SystemS3D Scalability Study15 XT4 Profile: Main Window

TAU Performance SystemS3D Scalability Study16 XT4: Mean Profile Sorted by Exclusive Time MPI_Wait has moved down!

TAU Performance SystemS3D Scalability Study17 XT4: Mean Profile Sorted by Inclusive Time

TAU Performance SystemS3D Scalability Study18 Comparing XT4 with XT3+XT4 MPI_Wait takes 26% of time compared to combined XT3+XT4!

TAU Performance SystemS3D Scalability Study19 Comparing Mean Inclusive Time

TAU Performance SystemS3D Scalability Study20 XT4: 3D View The “exp” loop [~1GFlop] takes most time now!

TAU Performance SystemS3D Scalability Study21 XT3+XT4: Scatter Plot (Before)

TAU Performance SystemS3D Scalability Study22 XT4 Scatter Plot (After) MPI_Wait takes from 78 to 121 s now!

TAU Performance SystemS3D Scalability Study23 Comparing Performance  Hypothesis confirmed: XT4 is faster than XT3+XT4  Inclusive time down from 1935 to 1702 s  12% improvement  Saved minutes (414 hours) of wallclock time!  Reduction in MPI_Wait time is most significant  390s (mean) down to 104s (mean)  Lessons learned:  Slower XT3 nodes can have a significant impact on a large scale S3D run  S3D harness testcase does not perform well on non- homogeneous nodes  We recommend running S3D on XT4 partition only!  #PBS -lfeature=xt4

TAU Performance SystemS3D Scalability Study24 Discussion  Did we get optimal performance on XT4 nodes?  Are the nodes performing at similar rates uniformly now?  Let us see the std. deviation plot of all routines...

TAU Performance SystemS3D Scalability Study25 XT4: Standard Deviation IO routines!

TAU Performance SystemS3D Scalability Study26 Scatter Plot: One CPU... WRITE_SAVEFILE

TAU Performance SystemS3D Scalability Study27 WRITE_SAVEFILE Rank 0 is quicker!

TAU Performance SystemS3D Scalability Study28 MPI_Barrier

TAU Performance SystemS3D Scalability Study29 I/O is not performed uniformly

TAU Performance SystemS3D Scalability Study30 I/O Becomes a Bottleneck: XT3, XT3+XT4... MPI_Wait WRITE_ SAVEFILE

TAU Performance SystemS3D Scalability Study31 Conclusions  Using pure XT4 improved performance by 12%  Need to investigate I/O in XT4/Lustre further to achieve better performance...  Discuss I/O issues with S3D developers

TAU Performance SystemS3D Scalability Study32 S3D - Building with TAU  Change name of compiler in build/make.XT3  ftn=> tau_f90.sh  cc => tau_cc.sh  Set compile time environment variables  setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/ Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi  Disabled tracking message communication statistics in TAU  MPI_Comm_compare() is not called inside TAU’s MPI wrapper  Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation  setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’  Selective instrumentation file eliminates instrumentation in lightweight routines  Pre-process Fortran source code using cpp before compiling  Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script:  export TAU_THROTTLE=1  export COUNTER1 GET_TIME_OF_DAY  export COUNTER2 PAPI_FP_INS  export COUNTER3 PAPI_L1_DCM  export COUNTER4 PAPI_TOT_INS  export COUNTER5 PAPI_L2_DCM

TAU Performance SystemS3D Scalability Study33 Selective Instrumentation in TAU % cat select.tau BEGIN_EXCLUDE_LIST MCADIF GETRATES TRANSPORT_M::MCAVIS_NEW MCEDIF MCACON CKYTCP THERMCHEM_M::MIXCP THERMCHEM_M::MIXENTH THERMCHEM_M::GIBBSENRG_ALL_DIMT CKRHOY MCEVAL4 THERMCHEM_M::HIS THERMCHEM_M::CPS THERMCHEM_M::ENTROPY END_EXCLUDE_LIST BEGIN_INSTRUMENT_SECTION loops routine="#" END_INSTRUMENT_SECTION

TAU Performance SystemS3D Scalability Study34 Getting Access to TAU on Jaguar  set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path)  Choose Stub Makefiles (TAU_MAKEFILE env. var.) from /spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.*  Makefile.tau-mpi-pdt-pgi (flat profile)  Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir)  Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile)  Binaries of S3D can be found in:  ~sameer/scratch/S3D-BINARIES withtau »papi, multiplecounters, mpi, pdt, pgi options without_tau

TAU Performance SystemS3D Scalability Study35 Concluding Discussion  Performance tools must be used effectively  More intelligent performance systems for productive use  Evolve to application-specific performance technology  Deal with scale by “full range” performance exploration  Autonomic and integrated tools  Knowledge-based and knowledge-driven process  Performance observation methods do not necessarily need to change in a fundamental sense  More automatically controlled and efficiently use  Develop next-generation tools and deliver to community  Open source with support by ParaTools, Inc. 

TAU Performance SystemS3D Scalability Study36 Support Acknowledgements  Department of Energy (DOE)  Office of Science  LLNL, LANL, ORNL, ASC  PERI