Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon

UVSQOctober 21, 2010 Extreme Performance Engineering Outline Part 1  Scalable systems and productivity  Role of performance knowledge  Performance engineering claims  Model-based performance engineering and expectations Part 2  Introduction to the TAU performance system Part 3  Performance data mining  Heterogeneous performance measurement (TAUcuda)  Parallel performance monitoring (TAUmon)

UVSQOctober 21, 2010 Extreme Performance Engineering Parallel Performance Engineering and Productivity  Scalable, optimized applications deliver HPC promise  Optimization through performance engineering process  Understand performance complexity and inefficiencies  Tune application to run optimally at scale  How to make the process more effective and productive?  What performance technology should be used?  Performance technology part of larger environment  Programmability, reusability, portability, robustness  Application development and optimization productivity  Process, performance technology, and use will change as parallel systems evolve and grow to extreme scale  Goal is to deliver effective performance with high productivity value now and in the future

UVSQOctober 21, 2010 Extreme Performance Engineering  Traditionally an empirically-based approach observation experimentation diagnosis tuning  Performance technology developed for each level characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance storage Performance Technology Data mining Models Expert systems Performance Technology Parallel Performance Engineering Process

UVSQOctober 21, 2010 Extreme Performance Engineering Performance Technology and Scale  What does it mean for performance technology to scale?  Instrumentation, measurement, analysis, tuning  Scaling complicates observation and analysis  Performance data size and value  standard approaches deliver a lot of data with little value  Measurement overhead, intrusion, perturbation, noise  measurement overhead  intrusion  perturbation  tradeoff with analysis accuracy  “noise” in the system distorts performance  Analysis complexity increases  More events, larger data, more parallelism  Traditional empirical process breaks down at extremes

UVSQOctober 21, 2010 Extreme Performance Engineering Performance Engineering Process and Scale  What will enhance productive application development?  Address application (developer) requirements at scale  Support application-specific performance views  What are the important events and performance metrics  Tied to application structure and computational model  Tied to application domain and algorithms  Process and tools must be more application-aware  Support online, dynamic performance analysis  Complement static, offline assessment and adjustment  Introspection of execution (performance) dynamics  Requires scalable performance monitoring  Integration with runtime infrastructure

UVSQOctober 21, 2010 Extreme Performance Engineering Whole Performance Evaluation  Extreme scale performance is an optimized orchestration  Application, processor, memory, network, I/O  Reductionist approaches to performance will be unable to support optimization and productivity objectives  Application-level only performance view is myopic  Interplay of hardware, software, and system components  Ultimately determines how performance is delivered  Performance should be evaluated in toto  Application and system components  Understand effects of performance interactions  Identify opportunities for optimization across levels  Need whole performance evaluation practice

UVSQOctober 21, 2010 Extreme Performance Engineering Role of Intelligence, Automation, and Knowledge  Scale forces the process to become more intelligent  Even with intelligent and application-specific tools, the decisions of what to analyze is difficult  More automation and knowledge-based decision making  Build these capabilities into performance tools  Support broader experimentation methods and refinement  Access and correlate data from several sources  Automate performance data analysis / mining / learning  Include predictive features and experiment refinement  Knowledge-driven adaptation and optimization guidance  Address scale issues through increased expertise

UVSQOctober 21, 2010 Extreme Performance Engineering Role of Structure in Performance Understanding  Dealing with increased complexity of the problem (measurement and analysis) should not be the only focus  There is fundamental structure in the computational model and an application's behavior Shigeo Fukuda, 1987 Lunch with a Helmut On  Performance is a consequence of structural forms and behaviors  It is the relationship between performance data and application semantics that is important to understand  This need not be complex (not as complex as it seems)

UVSQOctober 21, 2010 Extreme Performance Engineering Parallel Performance Diagnosis

UVSQOctober 21, 2010 Extreme Performance Engineering Extreme Performance Engineering Claims Claim 1: “Scaling up” current performance measurement and analysis techniques are insufficient for extreme scale (ES) performance diagnosis and tuning. Claim 2: Performance of ES applications and systems should be evaluated in toto, to understand the effects of performance interactions and identify opportunities for optimization based on whole-performance engineering. Claim 3: Adaptive ES applications might require performance monitoring, dynamic control, and tight coupling between the application and system at run time. Claim 4: Optimization of ES applications requires model- level knowledge of parallel algorithms, parallel computation, system, and performance.

UVSQOctober 21, 2010 Extreme Performance Engineering Models-based Performance Engineering  Performance engineering tools and practice must incorporate a performance knowledge discovery process  Focus on and automate performance problem identification and use to guide tuning decisions  Model-oriented knowledge  Computational semantics of the application  Symbolic models for algorithms  Performance models for system architectures / components  Use to define performance expectations  Higher-level abstraction for performance investigation  Application developers can be more directly involved in the performance engineering process

UVSQOctober 21, 2010 Extreme Performance Engineering Performance Expectations  Context for understanding the performance behavior of the application and system  Traditional performance implicitly requires an expectation  Users are forced to reason from the perspective of absolute performance for every performance experiment and every application operation  Peak measures provide an absolute upper bound  Empirical data alone does not lead to performance understanding and is insufficient for optimization  Empirical measurements lack a context for determining whether the operations under consideration are performing well or performing poorly

UVSQOctober 21, 2010 Extreme Performance Engineering Sources of Performance Expectations Computation Model – Operational semantics of a parallel application speciﬁed by its algorithms and structure deﬁne a space of relevant performance behavior Symbolic Performance Model – Symbolic or analytical model provides the relevant parameters for the performance model, which generate expectations Historical Performance Data – Provides real execution history and empirical data on similar platforms Relative Performance Data – Used to compare similar application operations across architectural components Architectural parameters – Based on what is known of processor, memory, and communications performance

UVSQOctober 21, 2010 Extreme Performance Engineering Model Use  Performance models derived from application knowledge, performance experiments, and symbolic analysis  They will be used to predict the performance of individual system components, such as communication and computation, as well as the application as a whole.  The models can then be compared to empirical measurements—manually or automatically—to pin point or dynamically rectify performance problems.  Further, these models can be evaluated at runtime in order to reduce perturbation and data management burdens, and to dynamically reconﬁgure system software.

UVSQOctober 21, 2010 Extreme Performance Engineering Where are we now?  Large-scale performance measurement and analysis  TAU performance system  HPCToolkit  Running on LCF machines  SciDAC PERI  Automatic performance tuning  Application / architecture modeling  Tiger teams  Performance database  PERI DB project  TAU PerfDMF

UVSQOctober 21, 2010 Extreme Performance Engineering Where are we now? (2)  Performance data mining  TAU PerfExplorer  multi-experiment data mining  analysis scripting, inference  SciDAC TASCS  Computational Quality of Service (CQoS)  Configuration, adaptation  DOE FastOS (with Argonne)  Linux (ZeptoOS) performance measurement (KTAU)  Scalable performance monitoring

UVSQOctober 21, 2010 Extreme Performance Engineering Where are we now? (3)  Language integration  OpenMP  UPC  Charm++  SciDAC FACETS  Load balance modeling  Automated performance testing ProjectionsTAU

UVSQOctober 21, 2010 Extreme Performance Engineering What do we need?  Performance knowledge engineering  At all levels  Support the building of models  Represented in form the tools can reason about  Understanding of “performance” interactions  Between integrated components  Control and data interactions  Properties of concurrency and dependencies  With respect to scientific problem formulation  Translate to performance requirements  More robust tools that are being used more broadly

UVSQOctober 21, 2010 Extreme Performance Engineering TAU Performance System Components TAU Architecture Program Analysis Parallel Profile Analysis PDT PerfDMF ParaProf Performance Data Mining Performance Monitoring TAUoverSupermon PerfExplorer

UVSQOctober 21, 2010 Extreme Performance Engineering Performance Data Management  Provide an open, flexible framework to support common data management tasks  Foster multi-experiment performance evaluation  Extensible toolkit to promote integration and reuse across available performance tools (PerfDMF)  Originally designed to address critical TAU requirements  Can import multiple profile formats  Supports multiple DBMS: PostgreSQL, MySQL,...  Profile query and analysis API  Reference implementation for PERI-DB project

UVSQOctober 21, 2010 Extreme Performance Engineering Metadata Collection  Integration of XML metadata for each parallel profile  Three ways to incorporate metadata  Measured hardware/system information (TAU, PERI-DB)  CPU speed, memory in GB, MPI node IDs, …  Application instrumentation (application-specific)  TAU_METADATA() used to insert any name/value pair  Application parameters, input data, domain decomposition  PerfDMF data management tools can incorporate an XML file of additional metadata  Compiler flags, submission scripts, input files, …  Metadata can be imported from / exported to PERI-DB

UVSQOctober 21, 2010 Extreme Performance Engineering Performance Data Mining / Analytics  Conduct systematic and scalable analysis process  Multi-experiment performance analysis  Support automation, collaboration, and reuse  Performance knowledge discovery framework  Data mining analysis applied to parallel performance data  comparative, clustering, correlation, dimension reduction, …  Use the existing TAU infrastructure  PerfExplorer v1 performance data mining framework  Multiple experiments and parametric studies  Integrate available statistics and data mining packages  Weka, R, Matlab / Octave  Apply data mining operations in interactive enviroment

UVSQOctober 21, 2010 Extreme Performance Engineering How to explain performance?  Should not just redescribe the performance results  Should explain performance phenomena  What are the causes for performance observed?  What are the factors and how do they interrelate?  Performance analytics, forensics, and decision support  Need to add knowledge to do more intelligent things  Automated analysis needs good informed feedback  iterative tuning, performance regression testing  Performance model generation requires interpretation  We need better methods and tools for  Integrating meta-information  Knowledge-based performance problem solving

UVSQOctober 21, 2010 Extreme Performance Engineering PerfExplorer 2.0  Performance data mining framework  Parallel profile performance data and metadata  Programmable, extensible workflow automation  Rule-based inference for expert system analysis

UVSQOctober 21, 2010 Extreme Performance Engineering Automation & Knowledge Engineering

UVSQOctober 21, 2010 Extreme Performance Engineering S3D Scalability Study  S3D flow solver for simulation of turbulent combustion  Targeted application by DOE SciDAC PERI tiger team  Performance scaling study (C2H4 benchmark)  Cray XT3+XT4 hybrid, XT4 (ORNL, Jaguar)  IBM BG/P (ANL, Intrepid)  Weak scaling (1 to 12000 cores)  Evaluate scaling of code regions and MPI  Demonstration of scalable performance measurement, analysis, and visualization  Understanding scalability  Requires environment for performance investigation  Performance factors relative to main events

UVSQOctober 21, 2010 Extreme Performance Engineering 12131415 891011 4567 0123 2 neighbors 3 neighbors 4 neighbors 4x4 example: Center cells Corner cells Data: Sweep3D on Linux Cluster, 16 processes S3D Computational Structure  Domain decomposition with wavefront evaluation and recursion dependences in all 3 grid directions  Communication affected by cell location

UVSQOctober 21, 2010 Extreme Performance Engineering ORNL Jaguar * Cray XT3/XT4 * 6400 cores S3D on a Hybrid System (Cray XT3 + XT4)  6400 core execution on transitional Jaguar configuration MPI_Wait

UVSQOctober 21, 2010 Extreme Performance Engineering Zoom View of Parallel Profile (S3D, XT3+XT4)  Gap represents XT3 nodes  MPI_Wait takes less time, other routines take more time

UVSQOctober 21, 2010 Extreme Performance Engineering 6400 cores Scatterplot (S3D, XT3+XT4)  Scatterplot of top three routines  Process metadata is used to color performance to machine type (2 clusters)  Memory speed accounts for performance difference!!!

UVSQOctober 21, 2010 Extreme Performance Engineering S3D Run on XT4 Only  Better balance across nodes  More uniform performance

UVSQOctober 21, 2010 Extreme Performance Engineering Capturing Analysis and Inference Knowledge  Create PerfExplorer workflow for S3D case study? Load Data Extract Non-callpath data Extract top 5 events Load metadata Correlate events With metadata Process inference rules

UVSQOctober 21, 2010 Extreme Performance Engineering PerfExplorer Workflow Applied to S3D Performance --------------- JPython test script start ------------ doing single trial analysis for sweep3d on jaguar Loading Rules... Reading rules: rules/GeneralRules.drl... done. Reading rules: rules/ApplicationRules.drl... done. Reading rules: rules/MachineRules.drl... done. loading the data... Getting top 10 events (sorted by exclusive time)... Firing rules... MPI_Recv(): "CALLS" metric is correlated with the metadata field "total neighbors". The correlation is 1.0 (direct). MPI_Send(): "CALLS"metric is correlated with the metadata field "total Neighbors". The correlation is 1.0 (direct). MPI_Send(): "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is correlated with the metadata field "total neighbors". The correlation is 0.8596 (moderate). SOURCE [{source.f} {2,18}]: "PAPI_FP_INS:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)". The correlation is -0.9792 (very high). SOURCE [{source.f} {2,18}]: "PAPI_FP_INS:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)". The correlation is -0.9785258663321764 (very high). SOURCE [{source.f} {2,18}]: "PAPI_L1_TCA:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is -0.9818810020169854 (very high). SOURCE [{source.f} {2,18}]: "PAPI_L1_TCA:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is -0.9810373923601381 (very high). SOURCE [{source.f} {2,18}]: "PAPI_L2_TCM:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is 0.9985297567878844 (very high). SOURCE [{source.f} {2,18}]: "PAPI_L2_TCM:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is 0.996415213842904 (very high). SOURCE [{source.f} {2,18}]: "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is -0.9980107779462387 (very high). SOURCE [{source.f} {2,18}]: "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is -0.9959749668655212 (very high)....done with rules. ---------------- JPython test script end ------------- Identified hardware differences Correlated communication behavior with metadata Cray-specific inference

UVSQOctober 21, 2010 Extreme Performance Engineering S3D Weak Scaling (Cray XT4, Jaguar) C2H4 benchmark

UVSQOctober 21, 2010 Extreme Performance Engineering S3D Weak Scaling (IBM BG/P, Intrepid) C2H4 benchmark Jaguar Intrepid

UVSQOctober 21, 2010 Extreme Performance Engineering S3D Total Runtime Breakdown by Events (Jaguar)

UVSQOctober 21, 2010 Extreme Performance Engineering S3D FP Instructions / L1 Data Cache Miss (Jaguar)

UVSQOctober 21, 2010 Extreme Performance Engineering S3D Total Runtime Breakdown by Events (Intrepid)

UVSQOctober 21, 2010 Extreme Performance Engineering S3D Relative Efficiency / Speedup by Event (Jaguar)

UVSQOctober 21, 2010 Extreme Performance Engineering S3D Relative Efficiency by Event (Intrepid)

UVSQOctober 21, 2010 Extreme Performance Engineering Relative Speedup by Event (Intrepid)

UVSQOctober 21, 2010 Extreme Performance Engineering S3D Event Correlation to Total Time (Jaguar) r = 1 implies direct correlation

UVSQOctober 21, 2010 Extreme Performance Engineering S3D Event Correlation to Total Time (Intrepid)

UVSQOctober 21, 2010 Extreme Performance Engineering S3D 3D Correlation Cube (Intrepid, MPI_Wait) GETRATES_I vs. RATI_I vs. RATX_I vs. MPI_Wait

UVSQOctober 21, 2010 Extreme Performance Engineering Heterogeneous Parallel Systems and Performance  Heterogeneous parallel systems are highly relevant today  Heterogeneous hardware technology more accessible  Multicore processors (e.g., 4-core, 6-core, 8-core, 16-core)  Manycore (throughput) accelerators (e.g., Tesla, Fermi)  High-performance engines (e.g., Intel)  Special purpose components (e.g., FPGAs)  Performance is the main driving concern  Heterogeneity is an important (the?) path to extreme scale  Heterogeneous software technology to get performance  More sophisticated parallel programming environments  Integrated parallel performance tools  support heterogeneous performance model and perspectives

UVSQOctober 21, 2010 Extreme Performance Engineering Implications for Parallel Performance Tools  Current status quo is somewhat comfortable  Mostly homogeneous parallel systems and software  Shared-memory multithreading – OpenMP  Distributed-memory message passing – MPI  Parallel computational models are relatively stable (simple)  Corresponding performance models are relatively tractable  Parallel performance tools can keep up and evolve  Heterogeneity creates richer computational potential  Results in greater performance diversity and complexity  Heterogeneous systems will utilize more sophisticated programming and runtime environments  Performance tools have to support richer computation models and more versatile performance perspectives

UVSQOctober 21, 2010 Extreme Performance Engineering Heterogeneous Performance Views  Want to create performance views that capture heterogeneous concurrency and execution behavior  Reflect interactions between heterogeneous components  Capture performance semantics relative to computation model  Assimilate performance for all execution paths for shared view  Existing parallel performance tools are CPU (host)-centric  Event-based sampling (not appropriate for accelerators)  Direct measurement (through instrumentation of events)  What perspective does the host have of other components?  Determines the semantics of the measurement data  Determines assumptions about behavior and interactions  Performance views may have to work with reduced data

UVSQOctober 21, 2010 Extreme Performance Engineering Task-based Performance View  Consider the “task” abstraction for GPU accelerator scenario  Host regards external execution as a task  Tasks operate concurrently with respect to the host  Requires support for tracking asynchronous execution  Host creates measurement perspective for external task  Maintains local and remote performance data  Tasks may have limited measurement support  May depend on host for performance data I/O  Performance data might be received from external task  How to create a view of heterogeneous external performance?

UVSQOctober 21, 2010 Extreme Performance Engineering TAUcuda Performance Measurement (Version 2)  Performance measurement of heterogeneous applications using CUDA  CUDA system architecture  Implemented by CUDA libraries  driver and device (cuXXX) libraries  runtime (cudaYYY) library  Tools support (Parallel Nsight (Nexus), CUDA Profiler)  not intended to integrate with other HPC performance tools  TAUcuda (v2) built on experimental Linux CUDA driver  Linux CUDA driver R190.86 supports a callback interface!!!  Currently working with NVIDIA to develop version with production performance tools API

UVSQOctober 21, 2010 Extreme Performance Engineering TAUcuda Architecture TAU events TAUcuda events

UVSQOctober 21, 2010 Extreme Performance Engineering TAUcuda Instrumentation TAUcuda eventsTAU events

UVSQOctober 21, 2010 Extreme Performance Engineering TAUcuda Experimentation Environments  University of Oregon  Linux workstation  Dual quad core Intel Xeon  GTX 280  GPU cluster (Mist)  Four dual quad core Intel Xeon server nodes  Two NVIDIA S1070 Tesla servers (4 Tesla GPUs per S1070)  Argonne National Laboratory (Eureka)  100 dual quad core NVIDIA Quadro Plex S4  200 Quadro FX5600 (2 per S4)  University of Illinois at Urbana-Champaign  GPU cluster (AC cluster)  32 nodes with one S1070 (4 GPUs per node)

UVSQOctober 21, 2010 Extreme Performance Engineering CUDA Linpack Profile (4 processes, 4 GPUs)  GPU-accelerated Linpack benchmark (NVIDIA)

UVSQOctober 21, 2010 Extreme Performance Engineering CUDA Linpack Trace MPI communication (yellow) CUDA memory transfer (white)

UVSQOctober 21, 2010 Extreme Performance Engineering SHOC Stencil2D (512 iterations, 4 CPUxGPU)  Scalable HeterOgenerous Computing benchmark suite  CUDA / OpenCL kernels and microbenchmarks (ORNL) CUDA memory transfer (white)

UVSQOctober 21, 2010 Extreme Performance Engineering Scalable Parallel Performance Monitoring  Performance problem analysis is increasingly complex  Multi-core, heterogeneous, and extreme scale computing  Adaptive algorithms and runtime application tuning  Performance dynamics variability within/between executions  Neo-performance measurement and analysis perspective  Static, offline analysis dynamic, online analysis  Scalable runtime analysis of parallel performance data  Performance feedback to application for adaptive control  Integrated performance monitoring (measurement + query)  Co-allocation of additional (tool specific) system resources  Goal  Scalable, integrated parallel performance monitoring

UVSQOctober 21, 2010 Extreme Performance Engineering Parallel Performance Measurement and Data  Parallel performance tools measure concurrently  Scaling dictates “local” measurements (profile, trace)  save data with "local context" (processes or threads)  Done without synchronization or central control  Parallel performance state is globally distributed as a result  Logically part of application’s global data space  Offline: outputs data at end for post-mortem analysis  Online: access to performance state for runtime analysis  Definition: Monitoring  Online access to parallel performance (data) state  May or may not involve runtime analysis

UVSQOctober 21, 2010 Extreme Performance Engineering Monitoring for Performance Dynamics  Runtime access to parallel performance data  Scalable and lightweight  Raises concerns of overhead and intrusion  Support for performance-adaptive, dynamic applications  Alternative 1: Extend existing performance measurement  Create own integrated monitoring infrastructure  Disadvantage: maintain own monitoring framework  Alternative 2: Couple with other monitoring infrastructure  Leverage scalable middleware from other supported projects  Challenge: measurement system / monitor integration

UVSQOctober 21, 2010 Extreme Performance Engineering Performance Dynamics: Parallel Profile Snapshots  Profile snapshots are parallel profiles recorded at runtime  Shows performance profile dynamics (all types allowed) Information Overhead Traces Profile Snapshots Profiles A. Morris, W. Spear, A. Malony, and S. Shende, “Observing Performance Dynamics using Parallel Profile Snapshots,” European Conference on Parallel Processing (EuroPar), 2008.

UVSQOctober 21, 2010 Extreme Performance Engineering Parallel Profile Snapshots of FLASH 3.0 (UIC)  Simulation of astrophysical thermonuclear flashes  Snapshots show profile differences since last snapshot  Captures all events since beginning per thread  Mean profile calculated post- mortem  Highlight change in performance per iteration and at checkpointing Initialization Checkpointing Finalization

UVSQOctober 21, 2010 Extreme Performance Engineering FLASH 3.0 Performance Dynamics (Periodic) INTRFC

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmon: Design  Design a transport-neutral application monitoring framework  Base on prior / existing work with various transport systems  Supermon, MRNet, MPI  Enable efficient development of monitoring functionality  Objectives  Scalable access to a running application’s performance  at end of the application (before parallel teardown)  while the application is still running  Support for scalable performance data analysis  reduction  statistical evaluation  Feedback (data, control) to application  Monitoring engineering and performance efficiency issues

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmon: Architecture... MPI process 0 MPI process kMPI process P-1 TAUmon TAU profiles threads MPI monitoring infrastructure

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmon: Current Usage  TAU_ONLINE_DUMP() collective operations in application  Called by all thread / processes (originally to output profiles)  Arguments specify data analysis operation (future)  Appropriate version of TAU selected for transport system  TAUmonMRnet: TAUmon using MRNet infrastructure  TAUmonMPI: TAUmon using MPI infrastructure  User instruments application with TAU support for desired monitoring transport system (temporary)  User submits instrumented application to parallel job system  Other launch systems must be submitted along with the application to the job scheduler as needed  different machine-specific job-submission scripts

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmon: Parallel Profile Data Analysis  Total parallel profile data size depends on:  # events * size of event * # execution threads *  Event size depends on # metrics  Example: 200 events * 100 bytes * 64,000 threads = 1.28 G  Monitoring operations  Periodic profile data output (à la profile snapshorts)  Events unification  Basic statistics: mean, min/max, standard deviation,...  Advanced statistics: histogram, clustering,...  Strong motivation to implement the operations in parallel

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMRNet (a.k.a. ToM Revisited)  TAU over MRNet (ToM)  Previously working with MRNet 2.1 (Cluster 2008 paper)  1-phase and 3-phase filters  Explore overlay network with different span out (nodes)  TAUmon re-engineered for MRNet 3.0 (just released!)  Re-implement ToM functionality  Use new MRNet support  Current implementation uses pre-released MRNet 3.0 version  Testing with released version

UVSQOctober 21, 2010 Extreme Performance Engineering Monitoring Operation with MRNet  Application collectively invokes TAU_ONLINE_DUMP() to start monitoring operations using current performance information  TAU data is accessed and sent through MRNet’s communication API via streams and filters  Filters perform appropriate aggregation operations on data  TAU frontend is responsible for collecting the data, storing it, and eventual delivery to a consumer

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMPI  Use MPI-based transport  No separate launch mechanisms  Parallel gather operations implemented as a binomial heap with staged MPI point-to-point calls (Rank 0 serves as root)  Current limitations:  Application shares parallel resources with monitoring transport  Monitoring operations may cause performance intrusion  No user control of transport network configuration  Potential advantages  Easy to use  Could be more robust overall... rank 0

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmon Experiments: PFLOTRAN  Predictive modeling of subsurface reactive flows  Machines  ORNL Jaguar and UTK Kraken, Cray XT5  Processor counts  16,380 cores and 131Kcores, 12K (interactive)  Scaling  Instrumentation (Source, PMPI)  Full: 1131 events total, lots of small routines  Partial: 1% exclusive + all MPI, 68 events total (44 MPI, 19 PETSc)  with and without callpaths  Measurements (PAPI)  Execution time: TOT CYC  Counters: FP OPS, TOT IN, L1 DCA/DCM, L2 TCA/TCM, RES STL

UVSQOctober 21, 2010 Extreme Performance Engineering PFLOTRAN Profile (Exclusive, 16,380 cores) MPI_Allreduce MPI_Waitany KSPSolve oursnesjacobian TAU ParaProf profile analyzer...

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMPI Event Unification (Cray XT5) TAU unification and merge time

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMPI Scaling (PFLOTRAN, Cray XT5) New histogram timings 12288: 0.8643 seconds 24576: 0.6238 seconds

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMRnet Scaling (PFLOTRAN, Cray XT5)

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMPI Scaling (PFLOTRAN, BG/P)

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5)  4104 cores running with 374 extra cores for MRNet  Each line bar shows the mean profile of an iteration

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5)  Frames (iteration) 12, 17, 21 12k PFLOTRAN execution  Shifts in order of events sorted by average value over time

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMRnet Snapshot (FLASH, Cray XT5)  Sod 2D, 1,536 Cray XT5 cores  Over 200 iterations. 15 maximum levels of refinement.  MPI_Alltoall plateaus correspond to AMR refinement

UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMRnet Clustering (FLASH, Cray XT5) MPI_Init MPI_Alltoall COMPRESS_LIST MPI_Allreduce DRIVER_COMPUTEDT MPI_Init MPI_Alltoall MPI_Allreduce DRIVER_COMPUTEDT MPI_Alltoall MPI_Init MPI_Allreduce DRIVER_COMPUTEDT MPI_Alltoall COMPRESS_LISTMPI_Allreduce

UVSQOctober 21, 2010 Extreme Performance Engineering High-Productivity Performance Engineering (POINT) Testbed Apps ENZO NAMD NEMO3D

UVSQOctober 21, 2010 Extreme Performance Engineering Model Oriented Global Optimization (MOGO)  Empirical performance data evaluated with respect to performance expectations at various levels of abstraction

UVSQOctober 21, 2010 Extreme Performance Engineering Performance Refactoring (PRIMA) (UO, Juelich)  Integration of instrumentation and measurement  Core infrastructure  Focus on TAU and Scalasca  University of Oregon, Research Centre Juelich  Refactor instrumentation, measurement, and analysis  Build next-generation tools on new common foundation  Extend to involve the SILC project  Juelich, TU Dresden, TU Munich

UVSQOctober 21, 2010 Extreme Performance Engineering Heterogeneous Exascale Software (Vancouver)  DOE X-stack program  Partners: Oak Ridge National Laboratory University of Oregon University of Illinois Georgia Institute of Technology  Components  Compilers  Scheduling and runtime resource management  Libraries  Performance measurement, analysis, modeling

UVSQOctober 21, 2010 Extreme Performance Engineering Call for “Extreme” Performance Engineering  Improve performance problem solving process  Strategy to respond to scaling challenges and technology changes / disruptions  Evolve process and technology  Build on robust, integrated performance infrastructure  Carry forward performance expertise and knowledge  Model-oriented with knowledge-based reasoning  Community-driven knowledge engineering  Automated data / decision analytics  Closer integration with application development and execution environment

UVSQOctober 21, 2010 Extreme Performance Engineering Support Acknowledgements  Department of Energy (DOE)  Office of Science  ASC/NNSA  Department of Defense (DoD)  HPC Modernization Office (HPCMO)  NSF Software Development for Cyberinfrastructure (SDCI)  Research Centre Juelich  Argonne National Laboratory  Technical University Dresden  ParaTools, Inc.  NVIDIA 85

Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

Similar presentations

Presentation on theme: "Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.

Similar presentations

Presentation on theme: "Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon."— Presentation transcript:

Similar presentations

About project

Feedback