Download presentation
Presentation is loading. Please wait.
Published byDomenic Morrison Modified over 9 years ago
1
Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon
2
UVSQOctober 21, 2010 Extreme Performance Engineering Outline Part 1 Scalable systems and productivity Role of performance knowledge Performance engineering claims Model-based performance engineering and expectations Part 2 Introduction to the TAU performance system Part 3 Performance data mining Heterogeneous performance measurement (TAUcuda) Parallel performance monitoring (TAUmon)
3
UVSQOctober 21, 2010 Extreme Performance Engineering Parallel Performance Engineering and Productivity Scalable, optimized applications deliver HPC promise Optimization through performance engineering process Understand performance complexity and inefficiencies Tune application to run optimally at scale How to make the process more effective and productive? What performance technology should be used? Performance technology part of larger environment Programmability, reusability, portability, robustness Application development and optimization productivity Process, performance technology, and use will change as parallel systems evolve and grow to extreme scale Goal is to deliver effective performance with high productivity value now and in the future
4
UVSQOctober 21, 2010 Extreme Performance Engineering Traditionally an empirically-based approach observation experimentation diagnosis tuning Performance technology developed for each level characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance storage Performance Technology Data mining Models Expert systems Performance Technology Parallel Performance Engineering Process
5
UVSQOctober 21, 2010 Extreme Performance Engineering Performance Technology and Scale What does it mean for performance technology to scale? Instrumentation, measurement, analysis, tuning Scaling complicates observation and analysis Performance data size and value standard approaches deliver a lot of data with little value Measurement overhead, intrusion, perturbation, noise measurement overhead intrusion perturbation tradeoff with analysis accuracy “noise” in the system distorts performance Analysis complexity increases More events, larger data, more parallelism Traditional empirical process breaks down at extremes
6
UVSQOctober 21, 2010 Extreme Performance Engineering Performance Engineering Process and Scale What will enhance productive application development? Address application (developer) requirements at scale Support application-specific performance views What are the important events and performance metrics Tied to application structure and computational model Tied to application domain and algorithms Process and tools must be more application-aware Support online, dynamic performance analysis Complement static, offline assessment and adjustment Introspection of execution (performance) dynamics Requires scalable performance monitoring Integration with runtime infrastructure
7
UVSQOctober 21, 2010 Extreme Performance Engineering Whole Performance Evaluation Extreme scale performance is an optimized orchestration Application, processor, memory, network, I/O Reductionist approaches to performance will be unable to support optimization and productivity objectives Application-level only performance view is myopic Interplay of hardware, software, and system components Ultimately determines how performance is delivered Performance should be evaluated in toto Application and system components Understand effects of performance interactions Identify opportunities for optimization across levels Need whole performance evaluation practice
8
UVSQOctober 21, 2010 Extreme Performance Engineering Role of Intelligence, Automation, and Knowledge Scale forces the process to become more intelligent Even with intelligent and application-specific tools, the decisions of what to analyze is difficult More automation and knowledge-based decision making Build these capabilities into performance tools Support broader experimentation methods and refinement Access and correlate data from several sources Automate performance data analysis / mining / learning Include predictive features and experiment refinement Knowledge-driven adaptation and optimization guidance Address scale issues through increased expertise
9
UVSQOctober 21, 2010 Extreme Performance Engineering Role of Structure in Performance Understanding Dealing with increased complexity of the problem (measurement and analysis) should not be the only focus There is fundamental structure in the computational model and an application's behavior Shigeo Fukuda, 1987 Lunch with a Helmut On Performance is a consequence of structural forms and behaviors It is the relationship between performance data and application semantics that is important to understand This need not be complex (not as complex as it seems)
10
UVSQOctober 21, 2010 Extreme Performance Engineering Parallel Performance Diagnosis
11
UVSQOctober 21, 2010 Extreme Performance Engineering Extreme Performance Engineering Claims Claim 1: “Scaling up” current performance measurement and analysis techniques are insufficient for extreme scale (ES) performance diagnosis and tuning. Claim 2: Performance of ES applications and systems should be evaluated in toto, to understand the effects of performance interactions and identify opportunities for optimization based on whole-performance engineering. Claim 3: Adaptive ES applications might require performance monitoring, dynamic control, and tight coupling between the application and system at run time. Claim 4: Optimization of ES applications requires model- level knowledge of parallel algorithms, parallel computation, system, and performance.
12
UVSQOctober 21, 2010 Extreme Performance Engineering Models-based Performance Engineering Performance engineering tools and practice must incorporate a performance knowledge discovery process Focus on and automate performance problem identification and use to guide tuning decisions Model-oriented knowledge Computational semantics of the application Symbolic models for algorithms Performance models for system architectures / components Use to define performance expectations Higher-level abstraction for performance investigation Application developers can be more directly involved in the performance engineering process
13
UVSQOctober 21, 2010 Extreme Performance Engineering Performance Expectations Context for understanding the performance behavior of the application and system Traditional performance implicitly requires an expectation Users are forced to reason from the perspective of absolute performance for every performance experiment and every application operation Peak measures provide an absolute upper bound Empirical data alone does not lead to performance understanding and is insufficient for optimization Empirical measurements lack a context for determining whether the operations under consideration are performing well or performing poorly
14
UVSQOctober 21, 2010 Extreme Performance Engineering Sources of Performance Expectations Computation Model – Operational semantics of a parallel application specified by its algorithms and structure define a space of relevant performance behavior Symbolic Performance Model – Symbolic or analytical model provides the relevant parameters for the performance model, which generate expectations Historical Performance Data – Provides real execution history and empirical data on similar platforms Relative Performance Data – Used to compare similar application operations across architectural components Architectural parameters – Based on what is known of processor, memory, and communications performance
15
UVSQOctober 21, 2010 Extreme Performance Engineering Model Use Performance models derived from application knowledge, performance experiments, and symbolic analysis They will be used to predict the performance of individual system components, such as communication and computation, as well as the application as a whole. The models can then be compared to empirical measurements—manually or automatically—to pin point or dynamically rectify performance problems. Further, these models can be evaluated at runtime in order to reduce perturbation and data management burdens, and to dynamically reconfigure system software.
16
UVSQOctober 21, 2010 Extreme Performance Engineering Where are we now? Large-scale performance measurement and analysis TAU performance system HPCToolkit Running on LCF machines SciDAC PERI Automatic performance tuning Application / architecture modeling Tiger teams Performance database PERI DB project TAU PerfDMF
17
UVSQOctober 21, 2010 Extreme Performance Engineering Where are we now? (2) Performance data mining TAU PerfExplorer multi-experiment data mining analysis scripting, inference SciDAC TASCS Computational Quality of Service (CQoS) Configuration, adaptation DOE FastOS (with Argonne) Linux (ZeptoOS) performance measurement (KTAU) Scalable performance monitoring
18
UVSQOctober 21, 2010 Extreme Performance Engineering Where are we now? (3) Language integration OpenMP UPC Charm++ SciDAC FACETS Load balance modeling Automated performance testing ProjectionsTAU
19
UVSQOctober 21, 2010 Extreme Performance Engineering What do we need? Performance knowledge engineering At all levels Support the building of models Represented in form the tools can reason about Understanding of “performance” interactions Between integrated components Control and data interactions Properties of concurrency and dependencies With respect to scientific problem formulation Translate to performance requirements More robust tools that are being used more broadly
20
UVSQOctober 21, 2010 Extreme Performance Engineering TAU Performance System Components TAU Architecture Program Analysis Parallel Profile Analysis PDT PerfDMF ParaProf Performance Data Mining Performance Monitoring TAUoverSupermon PerfExplorer
21
UVSQOctober 21, 2010 Extreme Performance Engineering Performance Data Management Provide an open, flexible framework to support common data management tasks Foster multi-experiment performance evaluation Extensible toolkit to promote integration and reuse across available performance tools (PerfDMF) Originally designed to address critical TAU requirements Can import multiple profile formats Supports multiple DBMS: PostgreSQL, MySQL,... Profile query and analysis API Reference implementation for PERI-DB project
22
UVSQOctober 21, 2010 Extreme Performance Engineering Metadata Collection Integration of XML metadata for each parallel profile Three ways to incorporate metadata Measured hardware/system information (TAU, PERI-DB) CPU speed, memory in GB, MPI node IDs, … Application instrumentation (application-specific) TAU_METADATA() used to insert any name/value pair Application parameters, input data, domain decomposition PerfDMF data management tools can incorporate an XML file of additional metadata Compiler flags, submission scripts, input files, … Metadata can be imported from / exported to PERI-DB
23
UVSQOctober 21, 2010 Extreme Performance Engineering Performance Data Mining / Analytics Conduct systematic and scalable analysis process Multi-experiment performance analysis Support automation, collaboration, and reuse Performance knowledge discovery framework Data mining analysis applied to parallel performance data comparative, clustering, correlation, dimension reduction, … Use the existing TAU infrastructure PerfExplorer v1 performance data mining framework Multiple experiments and parametric studies Integrate available statistics and data mining packages Weka, R, Matlab / Octave Apply data mining operations in interactive enviroment
24
UVSQOctober 21, 2010 Extreme Performance Engineering How to explain performance? Should not just redescribe the performance results Should explain performance phenomena What are the causes for performance observed? What are the factors and how do they interrelate? Performance analytics, forensics, and decision support Need to add knowledge to do more intelligent things Automated analysis needs good informed feedback iterative tuning, performance regression testing Performance model generation requires interpretation We need better methods and tools for Integrating meta-information Knowledge-based performance problem solving
25
UVSQOctober 21, 2010 Extreme Performance Engineering PerfExplorer 2.0 Performance data mining framework Parallel profile performance data and metadata Programmable, extensible workflow automation Rule-based inference for expert system analysis
26
UVSQOctober 21, 2010 Extreme Performance Engineering Automation & Knowledge Engineering
27
UVSQOctober 21, 2010 Extreme Performance Engineering S3D Scalability Study S3D flow solver for simulation of turbulent combustion Targeted application by DOE SciDAC PERI tiger team Performance scaling study (C2H4 benchmark) Cray XT3+XT4 hybrid, XT4 (ORNL, Jaguar) IBM BG/P (ANL, Intrepid) Weak scaling (1 to 12000 cores) Evaluate scaling of code regions and MPI Demonstration of scalable performance measurement, analysis, and visualization Understanding scalability Requires environment for performance investigation Performance factors relative to main events
28
UVSQOctober 21, 2010 Extreme Performance Engineering 12131415 891011 4567 0123 2 neighbors 3 neighbors 4 neighbors 4x4 example: Center cells Corner cells Data: Sweep3D on Linux Cluster, 16 processes S3D Computational Structure Domain decomposition with wavefront evaluation and recursion dependences in all 3 grid directions Communication affected by cell location
29
UVSQOctober 21, 2010 Extreme Performance Engineering ORNL Jaguar * Cray XT3/XT4 * 6400 cores S3D on a Hybrid System (Cray XT3 + XT4) 6400 core execution on transitional Jaguar configuration MPI_Wait
30
UVSQOctober 21, 2010 Extreme Performance Engineering Zoom View of Parallel Profile (S3D, XT3+XT4) Gap represents XT3 nodes MPI_Wait takes less time, other routines take more time
31
UVSQOctober 21, 2010 Extreme Performance Engineering 6400 cores Scatterplot (S3D, XT3+XT4) Scatterplot of top three routines Process metadata is used to color performance to machine type (2 clusters) Memory speed accounts for performance difference!!!
32
UVSQOctober 21, 2010 Extreme Performance Engineering S3D Run on XT4 Only Better balance across nodes More uniform performance
33
UVSQOctober 21, 2010 Extreme Performance Engineering Capturing Analysis and Inference Knowledge Create PerfExplorer workflow for S3D case study? Load Data Extract Non-callpath data Extract top 5 events Load metadata Correlate events With metadata Process inference rules
34
UVSQOctober 21, 2010 Extreme Performance Engineering PerfExplorer Workflow Applied to S3D Performance --------------- JPython test script start ------------ doing single trial analysis for sweep3d on jaguar Loading Rules... Reading rules: rules/GeneralRules.drl... done. Reading rules: rules/ApplicationRules.drl... done. Reading rules: rules/MachineRules.drl... done. loading the data... Getting top 10 events (sorted by exclusive time)... Firing rules... MPI_Recv(): "CALLS" metric is correlated with the metadata field "total neighbors". The correlation is 1.0 (direct). MPI_Send(): "CALLS"metric is correlated with the metadata field "total Neighbors". The correlation is 1.0 (direct). MPI_Send(): "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is correlated with the metadata field "total neighbors". The correlation is 0.8596 (moderate). SOURCE [{source.f} {2,18}]: "PAPI_FP_INS:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)". The correlation is -0.9792 (very high). SOURCE [{source.f} {2,18}]: "PAPI_FP_INS:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)". The correlation is -0.9785258663321764 (very high). SOURCE [{source.f} {2,18}]: "PAPI_L1_TCA:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is -0.9818810020169854 (very high). SOURCE [{source.f} {2,18}]: "PAPI_L1_TCA:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is -0.9810373923601381 (very high). SOURCE [{source.f} {2,18}]: "PAPI_L2_TCM:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is 0.9985297567878844 (very high). SOURCE [{source.f} {2,18}]: "PAPI_L2_TCM:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is 0.996415213842904 (very high). SOURCE [{source.f} {2,18}]: "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is inversely correlated with the metadata field "Memory Speed (MB/s)”. The correlation is -0.9980107779462387 (very high). SOURCE [{source.f} {2,18}]: "P_WALL_CLOCK_TIME:EXCLUSIVE" metric is inversely correlated with the metadata field "Seastar Speed (MB/s)”. The correlation is -0.9959749668655212 (very high)....done with rules. ---------------- JPython test script end ------------- Identified hardware differences Correlated communication behavior with metadata Cray-specific inference
35
UVSQOctober 21, 2010 Extreme Performance Engineering S3D Weak Scaling (Cray XT4, Jaguar) C2H4 benchmark
36
UVSQOctober 21, 2010 Extreme Performance Engineering S3D Weak Scaling (IBM BG/P, Intrepid) C2H4 benchmark Jaguar Intrepid
37
UVSQOctober 21, 2010 Extreme Performance Engineering S3D Total Runtime Breakdown by Events (Jaguar)
38
UVSQOctober 21, 2010 Extreme Performance Engineering S3D FP Instructions / L1 Data Cache Miss (Jaguar)
39
UVSQOctober 21, 2010 Extreme Performance Engineering S3D Total Runtime Breakdown by Events (Intrepid)
40
UVSQOctober 21, 2010 Extreme Performance Engineering S3D Relative Efficiency / Speedup by Event (Jaguar)
41
UVSQOctober 21, 2010 Extreme Performance Engineering S3D Relative Efficiency by Event (Intrepid)
42
UVSQOctober 21, 2010 Extreme Performance Engineering Relative Speedup by Event (Intrepid)
43
UVSQOctober 21, 2010 Extreme Performance Engineering S3D Event Correlation to Total Time (Jaguar) r = 1 implies direct correlation
44
UVSQOctober 21, 2010 Extreme Performance Engineering S3D Event Correlation to Total Time (Intrepid)
45
UVSQOctober 21, 2010 Extreme Performance Engineering S3D 3D Correlation Cube (Intrepid, MPI_Wait) GETRATES_I vs. RATI_I vs. RATX_I vs. MPI_Wait
46
UVSQOctober 21, 2010 Extreme Performance Engineering Heterogeneous Parallel Systems and Performance Heterogeneous parallel systems are highly relevant today Heterogeneous hardware technology more accessible Multicore processors (e.g., 4-core, 6-core, 8-core, 16-core) Manycore (throughput) accelerators (e.g., Tesla, Fermi) High-performance engines (e.g., Intel) Special purpose components (e.g., FPGAs) Performance is the main driving concern Heterogeneity is an important (the?) path to extreme scale Heterogeneous software technology to get performance More sophisticated parallel programming environments Integrated parallel performance tools support heterogeneous performance model and perspectives
47
UVSQOctober 21, 2010 Extreme Performance Engineering Implications for Parallel Performance Tools Current status quo is somewhat comfortable Mostly homogeneous parallel systems and software Shared-memory multithreading – OpenMP Distributed-memory message passing – MPI Parallel computational models are relatively stable (simple) Corresponding performance models are relatively tractable Parallel performance tools can keep up and evolve Heterogeneity creates richer computational potential Results in greater performance diversity and complexity Heterogeneous systems will utilize more sophisticated programming and runtime environments Performance tools have to support richer computation models and more versatile performance perspectives
48
UVSQOctober 21, 2010 Extreme Performance Engineering Heterogeneous Performance Views Want to create performance views that capture heterogeneous concurrency and execution behavior Reflect interactions between heterogeneous components Capture performance semantics relative to computation model Assimilate performance for all execution paths for shared view Existing parallel performance tools are CPU (host)-centric Event-based sampling (not appropriate for accelerators) Direct measurement (through instrumentation of events) What perspective does the host have of other components? Determines the semantics of the measurement data Determines assumptions about behavior and interactions Performance views may have to work with reduced data
49
UVSQOctober 21, 2010 Extreme Performance Engineering Task-based Performance View Consider the “task” abstraction for GPU accelerator scenario Host regards external execution as a task Tasks operate concurrently with respect to the host Requires support for tracking asynchronous execution Host creates measurement perspective for external task Maintains local and remote performance data Tasks may have limited measurement support May depend on host for performance data I/O Performance data might be received from external task How to create a view of heterogeneous external performance?
50
UVSQOctober 21, 2010 Extreme Performance Engineering TAUcuda Performance Measurement (Version 2) Performance measurement of heterogeneous applications using CUDA CUDA system architecture Implemented by CUDA libraries driver and device (cuXXX) libraries runtime (cudaYYY) library Tools support (Parallel Nsight (Nexus), CUDA Profiler) not intended to integrate with other HPC performance tools TAUcuda (v2) built on experimental Linux CUDA driver Linux CUDA driver R190.86 supports a callback interface!!! Currently working with NVIDIA to develop version with production performance tools API
51
UVSQOctober 21, 2010 Extreme Performance Engineering TAUcuda Architecture TAU events TAUcuda events
52
UVSQOctober 21, 2010 Extreme Performance Engineering TAUcuda Instrumentation TAUcuda eventsTAU events
53
UVSQOctober 21, 2010 Extreme Performance Engineering TAUcuda Experimentation Environments University of Oregon Linux workstation Dual quad core Intel Xeon GTX 280 GPU cluster (Mist) Four dual quad core Intel Xeon server nodes Two NVIDIA S1070 Tesla servers (4 Tesla GPUs per S1070) Argonne National Laboratory (Eureka) 100 dual quad core NVIDIA Quadro Plex S4 200 Quadro FX5600 (2 per S4) University of Illinois at Urbana-Champaign GPU cluster (AC cluster) 32 nodes with one S1070 (4 GPUs per node)
54
UVSQOctober 21, 2010 Extreme Performance Engineering CUDA Linpack Profile (4 processes, 4 GPUs) GPU-accelerated Linpack benchmark (NVIDIA)
55
UVSQOctober 21, 2010 Extreme Performance Engineering CUDA Linpack Trace MPI communication (yellow) CUDA memory transfer (white)
56
UVSQOctober 21, 2010 Extreme Performance Engineering SHOC Stencil2D (512 iterations, 4 CPUxGPU) Scalable HeterOgenerous Computing benchmark suite CUDA / OpenCL kernels and microbenchmarks (ORNL) CUDA memory transfer (white)
57
UVSQOctober 21, 2010 Extreme Performance Engineering Scalable Parallel Performance Monitoring Performance problem analysis is increasingly complex Multi-core, heterogeneous, and extreme scale computing Adaptive algorithms and runtime application tuning Performance dynamics variability within/between executions Neo-performance measurement and analysis perspective Static, offline analysis dynamic, online analysis Scalable runtime analysis of parallel performance data Performance feedback to application for adaptive control Integrated performance monitoring (measurement + query) Co-allocation of additional (tool specific) system resources Goal Scalable, integrated parallel performance monitoring
58
UVSQOctober 21, 2010 Extreme Performance Engineering Parallel Performance Measurement and Data Parallel performance tools measure concurrently Scaling dictates “local” measurements (profile, trace) save data with "local context" (processes or threads) Done without synchronization or central control Parallel performance state is globally distributed as a result Logically part of application’s global data space Offline: outputs data at end for post-mortem analysis Online: access to performance state for runtime analysis Definition: Monitoring Online access to parallel performance (data) state May or may not involve runtime analysis
59
UVSQOctober 21, 2010 Extreme Performance Engineering Monitoring for Performance Dynamics Runtime access to parallel performance data Scalable and lightweight Raises concerns of overhead and intrusion Support for performance-adaptive, dynamic applications Alternative 1: Extend existing performance measurement Create own integrated monitoring infrastructure Disadvantage: maintain own monitoring framework Alternative 2: Couple with other monitoring infrastructure Leverage scalable middleware from other supported projects Challenge: measurement system / monitor integration
60
UVSQOctober 21, 2010 Extreme Performance Engineering Performance Dynamics: Parallel Profile Snapshots Profile snapshots are parallel profiles recorded at runtime Shows performance profile dynamics (all types allowed) Information Overhead Traces Profile Snapshots Profiles A. Morris, W. Spear, A. Malony, and S. Shende, “Observing Performance Dynamics using Parallel Profile Snapshots,” European Conference on Parallel Processing (EuroPar), 2008.
61
UVSQOctober 21, 2010 Extreme Performance Engineering Parallel Profile Snapshots of FLASH 3.0 (UIC) Simulation of astrophysical thermonuclear flashes Snapshots show profile differences since last snapshot Captures all events since beginning per thread Mean profile calculated post- mortem Highlight change in performance per iteration and at checkpointing Initialization Checkpointing Finalization
62
UVSQOctober 21, 2010 Extreme Performance Engineering FLASH 3.0 Performance Dynamics (Periodic) INTRFC
63
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmon: Design Design a transport-neutral application monitoring framework Base on prior / existing work with various transport systems Supermon, MRNet, MPI Enable efficient development of monitoring functionality Objectives Scalable access to a running application’s performance at end of the application (before parallel teardown) while the application is still running Support for scalable performance data analysis reduction statistical evaluation Feedback (data, control) to application Monitoring engineering and performance efficiency issues
64
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmon: Architecture... MPI process 0 MPI process kMPI process P-1 TAUmon TAU profiles threads MPI monitoring infrastructure
65
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmon: Current Usage TAU_ONLINE_DUMP() collective operations in application Called by all thread / processes (originally to output profiles) Arguments specify data analysis operation (future) Appropriate version of TAU selected for transport system TAUmonMRnet: TAUmon using MRNet infrastructure TAUmonMPI: TAUmon using MPI infrastructure User instruments application with TAU support for desired monitoring transport system (temporary) User submits instrumented application to parallel job system Other launch systems must be submitted along with the application to the job scheduler as needed different machine-specific job-submission scripts
66
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmon: Parallel Profile Data Analysis Total parallel profile data size depends on: # events * size of event * # execution threads * Event size depends on # metrics Example: 200 events * 100 bytes * 64,000 threads = 1.28 G Monitoring operations Periodic profile data output (à la profile snapshorts) Events unification Basic statistics: mean, min/max, standard deviation,... Advanced statistics: histogram, clustering,... Strong motivation to implement the operations in parallel
67
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMRNet (a.k.a. ToM Revisited) TAU over MRNet (ToM) Previously working with MRNet 2.1 (Cluster 2008 paper) 1-phase and 3-phase filters Explore overlay network with different span out (nodes) TAUmon re-engineered for MRNet 3.0 (just released!) Re-implement ToM functionality Use new MRNet support Current implementation uses pre-released MRNet 3.0 version Testing with released version
68
UVSQOctober 21, 2010 Extreme Performance Engineering Monitoring Operation with MRNet Application collectively invokes TAU_ONLINE_DUMP() to start monitoring operations using current performance information TAU data is accessed and sent through MRNet’s communication API via streams and filters Filters perform appropriate aggregation operations on data TAU frontend is responsible for collecting the data, storing it, and eventual delivery to a consumer
69
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMPI Use MPI-based transport No separate launch mechanisms Parallel gather operations implemented as a binomial heap with staged MPI point-to-point calls (Rank 0 serves as root) Current limitations: Application shares parallel resources with monitoring transport Monitoring operations may cause performance intrusion No user control of transport network configuration Potential advantages Easy to use Could be more robust overall... rank 0
70
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmon Experiments: PFLOTRAN Predictive modeling of subsurface reactive flows Machines ORNL Jaguar and UTK Kraken, Cray XT5 Processor counts 16,380 cores and 131Kcores, 12K (interactive) Scaling Instrumentation (Source, PMPI) Full: 1131 events total, lots of small routines Partial: 1% exclusive + all MPI, 68 events total (44 MPI, 19 PETSc) with and without callpaths Measurements (PAPI) Execution time: TOT CYC Counters: FP OPS, TOT IN, L1 DCA/DCM, L2 TCA/TCM, RES STL
71
UVSQOctober 21, 2010 Extreme Performance Engineering PFLOTRAN Profile (Exclusive, 16,380 cores) MPI_Allreduce MPI_Waitany KSPSolve oursnesjacobian TAU ParaProf profile analyzer...
72
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMPI Event Unification (Cray XT5) TAU unification and merge time
73
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMPI Scaling (PFLOTRAN, Cray XT5) New histogram timings 12288: 0.8643 seconds 24576: 0.6238 seconds
74
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMRnet Scaling (PFLOTRAN, Cray XT5)
75
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMPI Scaling (PFLOTRAN, BG/P)
76
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5) 4104 cores running with 374 extra cores for MRNet Each line bar shows the mean profile of an iteration
77
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMRnet: Snapshot (PFLOTRAN, Cray XT5) Frames (iteration) 12, 17, 21 12k PFLOTRAN execution Shifts in order of events sorted by average value over time
78
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMRnet Snapshot (FLASH, Cray XT5) Sod 2D, 1,536 Cray XT5 cores Over 200 iterations. 15 maximum levels of refinement. MPI_Alltoall plateaus correspond to AMR refinement
79
UVSQOctober 21, 2010 Extreme Performance Engineering TAUmonMRnet Clustering (FLASH, Cray XT5) MPI_Init MPI_Alltoall COMPRESS_LIST MPI_Allreduce DRIVER_COMPUTEDT MPI_Init MPI_Alltoall MPI_Allreduce DRIVER_COMPUTEDT MPI_Alltoall MPI_Init MPI_Allreduce DRIVER_COMPUTEDT MPI_Alltoall COMPRESS_LISTMPI_Allreduce
80
UVSQOctober 21, 2010 Extreme Performance Engineering High-Productivity Performance Engineering (POINT) Testbed Apps ENZO NAMD NEMO3D
81
UVSQOctober 21, 2010 Extreme Performance Engineering Model Oriented Global Optimization (MOGO) Empirical performance data evaluated with respect to performance expectations at various levels of abstraction
82
UVSQOctober 21, 2010 Extreme Performance Engineering Performance Refactoring (PRIMA) (UO, Juelich) Integration of instrumentation and measurement Core infrastructure Focus on TAU and Scalasca University of Oregon, Research Centre Juelich Refactor instrumentation, measurement, and analysis Build next-generation tools on new common foundation Extend to involve the SILC project Juelich, TU Dresden, TU Munich
83
UVSQOctober 21, 2010 Extreme Performance Engineering Heterogeneous Exascale Software (Vancouver) DOE X-stack program Partners: Oak Ridge National Laboratory University of Oregon University of Illinois Georgia Institute of Technology Components Compilers Scheduling and runtime resource management Libraries Performance measurement, analysis, modeling
84
UVSQOctober 21, 2010 Extreme Performance Engineering Call for “Extreme” Performance Engineering Improve performance problem solving process Strategy to respond to scaling challenges and technology changes / disruptions Evolve process and technology Build on robust, integrated performance infrastructure Carry forward performance expertise and knowledge Model-oriented with knowledge-based reasoning Community-driven knowledge engineering Automated data / decision analytics Closer integration with application development and execution environment
85
UVSQOctober 21, 2010 Extreme Performance Engineering Support Acknowledgements Department of Energy (DOE) Office of Science ASC/NNSA Department of Defense (DoD) HPC Modernization Office (HPCMO) NSF Software Development for Cyberinfrastructure (SDCI) Research Centre Juelich Argonne National Laboratory Technical University Dresden ParaTools, Inc. NVIDIA 85
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.