Presentation is loading. Please wait.

Presentation is loading. Please wait.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University.

Similar presentations


Presentation on theme: "Allen D. Malony Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University."— Presentation transcript:

1 Allen D. Malony malony@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University of Oregon Performance Technology for Productive, High-End Parallel Computing

2 PSC2 Research Motivation  Tools for performance problem solving  Empirical-based performance optimization process  Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance storage Performance Technology

3 Performance Technology for Productive, High-End Parallel ComputingPSC3 Challenges in Performance Problem Solving  How to make the process more effective (productive)?  Process may depend on scale of parallel system  What are the important events and performance metrics?  Tied to application structure and computational model  Tied to application domain and algorithms  Process and tools can/must be more application-aware  Tools have poor support for application-specific aspects  What are the significant issues that will affect the technology used to support the process?  Enhance application development and benchmarking  New paradigm in performance process and technology

4 Performance Technology for Productive, High-End Parallel ComputingPSC4 Large Scale Performance Problem Solving  How does our view of this process change when we consider very large-scale parallel systems?  What are the significant issues that will affect the technology used to support the process?  Parallel performance observation is clearly needed  In general, there is the concern for intrusion  Seen as a tradeoff with performance diagnosis accuracy  Scaling complicates observation and analysis  Performance data size becomes a concern  Analysis complexity increases  Nature of application development may change

5 Performance Technology for Productive, High-End Parallel ComputingPSC5 Role of Intelligence, Automation, and Knowledge  Scale forces the process to become more intelligent  Even with intelligent and application-specific tools, the decisions of what to analyze is difficult and intractable  More automation and knowledge-based decision making  Build automatic/autonomic capabilities into the tools  Support broader experimentation methods and refinement  Access and correlate data from several sources  Automate performance data analysis / mining / learning  Include predictive features and experiment refinement  Knowledge-driven adaptation and optimization guidance  Address scale issues through increased expertise

6 Performance Technology for Productive, High-End Parallel ComputingPSC6 Outline of Talk  Performance problem solving  Scalability, productivity, and performance technology  Application-specific and autonomic performance tools  TAU parallel performance system and advances  Performance data management and data mining  Performance Data Management Framework (PerfDMF)  PerfExplorer  Multi-experiment case studies  Clustering analysis  Comparative analysis (PERC tool study)  Future work and concluding remarks

7 Performance Technology for Productive, High-End Parallel ComputingPSC7 TAU Performance System  Tuning and Analysis Utilities (13+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Targets a general complex system computation model  Entities: nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance problem solving  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  University of Oregon, Research Center Jülich, LANL

8 Performance Technology for Productive, High-End Parallel ComputingPSC8 TAU Parallel Performance System Goals  Multi-level performance instrumentation  Multi-language automatic source instrumentation  Flexible and configurable performance measurement  Widely-ported parallel performance profiling system  Computer system architectures and operating systems  Different programming languages and compilers  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid  Support for performance mapping  Support for object-oriented and generic programming  Integration in complex software, systems, applications

9 Performance Technology for Productive, High-End Parallel ComputingPSC9 TAU Performance System Architecture event selection

10 Performance Technology for Productive, High-End Parallel ComputingPSC10 TAU Performance System Architecture

11 Performance Technology for Productive, High-End Parallel ComputingPSC11 Advances in TAU Instrumentation  Source instrumentation  Program Database Toolkit (PDT)  automated Fortran 90/95 support (Cleanscape Flint parser)  statement level support in C/C++ (Fortran soon)  TAU_COMPILER to automate instrumentation process  Automatic proxy generation for component applications  automatic CCA component instrumentation  Python instrumentation and automatic instrumentation  Continued integration with dynamic instrumentation  Update of OpenMP instrumentation (POMP2)  Selective instrumentation and overhead reduction  Improvements in performance mapping instrumentation

12 Performance Technology for Productive, High-End Parallel ComputingPSC12 Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM TAU_instr Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation

13 Performance Technology for Productive, High-End Parallel ComputingPSC13 Advances in TAU Measurement  Profiling (four types)  Memory profiling  global heap memory tracking (several options)  Callpath profiling and calldepth profiling  user-controllable callpath length and calling depth  Phase-based profiling  Tracing  Generation of VTF3 / SLOG traces files (fully portable)  Inclusion of hardware performance counts in trace files  Hierarchical trace merging  Online performance overhead compensation  Component software proxy generation and monitoring

14 Performance Technology for Productive, High-End Parallel ComputingPSC14 Profile Measurement  Flat profiles  Metric (e.g., time) spent in an event (callgraph nodes)  Exclusive/inclusive, # of calls, child calls  Callpath profiles (Calldepth profiles)  Time spent along a calling path (edges in callgraph)  “main=> f1 => f2 => MPI_Send” (event name)  TAU_CALLPATH_LENGTH environment variable  Phase-based profiles  Flat profiles under a phase (nested phases are allowed)  Default “main” phase  Supports static or dynamic (per-iteration) phases

15 Performance Technology for Productive, High-End Parallel ComputingPSC15 Advances in TAU Performance Analysis  Enhanced parallel profile analysis (ParaProf)  Callpath analysis integration in ParaProf  Event callgraph view  Performance Data Management Framework (PerfDMF)  First release of prototype  In use by several groups  S. Moore (UTK), P. Teller (UTEP), P. Hovland (ANL), …  Integration with Vampir Next Generation (VNG)  Online trace analysis  3D Performance visualization prototype (ParaVis)  Component performance modeling and QoS

16 Performance Technology for Productive, High-End Parallel ComputingPSC16 Pprof – Flat Profile (NAS PB LU)  Intel Linux cluster  F90 + MPICH  Profile - Node - Context - Thread  Events - code - MPI  Metric - time  Text display

17 Performance Technology for Productive, High-End Parallel ComputingPSC17 ParaProf – Manager Window performance database derived performance metrics

18 Performance Technology for Productive, High-End Parallel ComputingPSC18 ParaProf – Full Profile (Miranda) 8K processors!

19 Performance Technology for Productive, High-End Parallel ComputingPSC19 ParaProf– Flat Profile (Miranda)

20 Performance Technology for Productive, High-End Parallel ComputingPSC20 ParaProf– Callpath Profile (Flash)

21 Performance Technology for Productive, High-End Parallel ComputingPSC21 ParaProf– Callpath Profile (ESMF) 21-level callpath

22 Performance Technology for Productive, High-End Parallel ComputingPSC22 ParaProf – Phase Profile (MFIX) In 51 st iteration, time spent in MPI_Waitall was 85.81 secs Total time spent in MPI_Waitall was 4137.9 secs across all 92 iterations dynamic phases one per interation

23 Performance Technology for Productive, High-End Parallel ComputingPSC23 ParaProf – Histogram View (Miranda) 8k processors 16k processors  Scalable 2D displays

24 Performance Technology for Productive, High-End Parallel ComputingPSC24 ParaProf –Callgraph View (MFIX)

25 Performance Technology for Productive, High-End Parallel ComputingPSC25 ParaProf – Callpath Highlighting (Flash) MODULEHYDRO_1D:HYDRO_1D

26 Performance Technology for Productive, High-End Parallel ComputingPSC26 Profiling of Miranda on BG/L (Miller, LLNL) 128 Nodes512 Nodes1024 Nodes  Profile code performance (automatic instrumentation)  Scaling studies (problem size, number of processors)  Run on 8K and 16K processors!

27 Performance Technology for Productive, High-End Parallel ComputingPSC27 ParaProf – 3D Full Profile (Miranda) 16k processors

28 Performance Technology for Productive, High-End Parallel ComputingPSC28 ParaProf – 3D Scatterplot (Miranda)  Each point is a “thread” of execution  A total of four metrics shown in relation  ParaVis 3D profile visualization library  JOGL

29 Performance Technology for Productive, High-End Parallel ComputingPSC29 Performance Tracing on Miranda  Use TAU to generate VTF3 traces for Vampir analysis  MPI calls with HW counter information (not shown)  Detailed code behavior to focus optimization efforts

30 Performance Technology for Productive, High-End Parallel ComputingPSC30 S3D on Lemieux (TAU-to-VTF3, Vampir)

31 Performance Technology for Productive, High-End Parallel ComputingPSC31 S3D on Lemieux (Zoomed)

32 Performance Technology for Productive, High-End Parallel ComputingPSC32 TAU Performance System Status  Computing platforms (selected)  IBM SP/pSeries, SGI Origin, Cray T3E/SV-1/X1/XT3, HP (Compaq) SC (Tru64), Sun, Hitachi SR8000, NEC SX-5/6, Linux clusters (IA-32/64, Alpha, PPC, PA-RISC, Power, Opteron), Apple (G4/5, OS X), Windows  Programming languages  C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python  Thread libraries (selected)  pthreads, SGI sproc, Java,Windows, OpenMP  Compilers (selected)  Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun, PathScale, SGI, Cray, IBM (xlc, xlf), HP, NEC, Absoft

33 Performance Technology for Productive, High-End Parallel ComputingPSC33 Project Affiliations (selected)  Center for Simulation of Accidental Fires and Explosion  University of Utah, ASCI ASAP Center, C-SAFE  Uintah Computational Framework (UCF) (C++)  Center for Simulation of Dynamic Response of Materials  California Institute of Technology, ASCI ASAP Center  Virtual Testshock Facility (VTF) (Python, Fortran 90)  Earth Systems Modeling Framework (ESMF)  NSF, NOAA, DOE, NASA, …  Instrumentation for ESMF framework and applications  C, C++, and Fortran 95 code modules  MPI wrapper library for MPI calls

34 Performance Technology for Productive, High-End Parallel ComputingPSC34 Project Affiliations (selected) (continued)  Lawrence Livermore National Lab  Hydrodynamics (Miranda), Radiation diffusion (KULL)  Sandia National Lab and Los Alamos National Lab  DOE CCTTSS SciDAC project  Common component architecture (CCA) integration  Argonne National Lab  OS / RTS for Extreme Scale Scientific Computation  Zeptos - scalable components for petascale architectures  KTAU - integration of TAU infrastructure in Linux kernel  Oak Ridge National Lab  Contribution to the Joule Report: S3D, AORSA3D

35 Performance Technology for Productive, High-End Parallel ComputingPSC35 Important Questions for Application Developers  How does performance vary with different compilers?  Is poor performance correlated with certain OS features?  Has a recent change caused unanticipated performance?  How does performance vary with MPI variants?  Why is one application version faster than another?  What is the reason for the observed scaling behavior?  Did two runs exhibit similar performance?  How are performance data related to application events?  Which machines will run my code the fastest and why?  Which benchmarks predict my code performance best?

36 Performance Technology for Productive, High-End Parallel ComputingPSC36 Performance Problem Solving Goals  Answer questions at multiple levels of interest  Data from low-level measurements and simulations  use to predict application performance  High-level performance data spanning dimensions  machine, applications, code revisions, data sets  examine broad performance trends  Discover general correlations application performance and features of their external environment  Develop methods to predict application performance on lower-level metrics  Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

37 Performance Technology for Productive, High-End Parallel ComputingPSC37 Automatic Performance Analysis Tool (Concept) Performance database Build application Execute application Simple analysis feedback 105% Faster! 72% Faster! build information environment / performance data Offline analysis

38 Performance Technology for Productive, High-End Parallel ComputingPSC38 Performance Data Management Framework

39 Performance Technology for Productive, High-End Parallel ComputingPSC39 ParaProf Performance Profile Analysis HPMToolkit MpiP TAU Raw files PerfDMF managed (database) Metadata Application Experiment Trial

40 Performance Technology for Productive, High-End Parallel ComputingPSC40 PerfExplorer (K. Huck, Ph.D. student, UO)  Performance knowledge discovery framework  Use the existing TAU infrastructure  TAU instrumentation data, PerfDMF  Client-server based system architecture  Data mining analysis applied to parallel performance data  comparative, clustering, correlation, dimension reduction,...  Technology integration  Relational DatabaseManagement Systems (RDBMS)  Java API and toolkit  R-project / Omegahat statistical analysis  WEKA data mining package  Web-based client

41 Performance Technology for Productive, High-End Parallel ComputingPSC41 PerfExplorer Architecture

42 Performance Technology for Productive, High-End Parallel ComputingPSC42 PerfExplorer Client GUI

43 Performance Technology for Productive, High-End Parallel ComputingPSC43 Hierarchical and K-means Clustering (sPPM)

44 Performance Technology for Productive, High-End Parallel ComputingPSC44 Miranda Clustering on 16K Processors

45 Performance Technology for Productive, High-End Parallel ComputingPSC45 PERC Tool Requirements and Evaluation  Performance Evaluation Research Center (PERC)  DOE SciDAC  Evaluation methods/tools for high-end parallel systems  PERC tools study (led by ORNL, Pat Worley)  In-depth performance analysis of select applications  Evaluation performance analysis requirements  Test tool functionality and ease of use  Applications  Start with fusion code – GYRO  Repeat with other PERC benchmarks  Continue with SciDAC codes

46 Performance Technology for Productive, High-End Parallel ComputingPSC46 Primary Evaluation Machines  Phoenix (ORNL – Cray X1)  512 multi-streaming vector processors  Ram (ORNL – SGI Altix (1.5 GHz Itanium2))  256 total processors  TeraGrid  ~7,738 total processors on 15 machines at 9 sites  Cheetah (ORNL – p690 cluster (1.3 GHz, HPS))  864 total processors on 27 compute nodes  Seaborg (NERSC – IBM SP3)  6080 total processors on 380 compute nodes

47 Performance Technology for Productive, High-End Parallel ComputingPSC47 GYRO Execution Parameters  Three benchmark problems  B1-std: 16n processors, 500 timesteps  B2-cy: 16n processors, 1000 timesteps  B3-gtc: 64n processors, 100 timesteps (very large)  Test different methods to evaluate nonlinear terms:  Direct method  FFT (“nl2” for B1 and B2, “nl1” for B3)  Task affinity enabled/disabled (p690 only)  Memory affinity enabled/disabled (p690 only)  Filesystem location (Cray X1 only)

48 Performance Technology for Productive, High-End Parallel ComputingPSC48 PerfExplorer Analysis of Self-Instrumented Data  PerfExplorer  Focus on comparative analysis  Apply to PERC tool evaluation study  Look at user timer data  Aggregate data  no per process data  process clustering analysis is not applicable  Timings output every N timesteps  some phase analysis possible  Goal  Recreate manually generated performance reports

49 Performance Technology for Productive, High-End Parallel ComputingPSC49 PerfExplorer Interface Select experiments and trials of interest Data organized in application, experiment, trial structure (will allow arbitrary in future) Experiment metadata

50 Performance Technology for Productive, High-End Parallel ComputingPSC50 PerfExplorer Interface Select analysis

51 Performance Technology for Productive, High-End Parallel ComputingPSC51 B1-std B3-gtc Timesteps per Second  Cray X1 is the fastest to solution in all 3 tests  FFT (nl2) improves time for B3-gtc only  TeraGrid faster than p690 for B1-std?  Plots generated automatically B1-std B2-cy B3-gtc TeraGrid

52 Performance Technology for Productive, High-End Parallel ComputingPSC52 Relative Efficiency (B1-std)  By experiment (B1-std)  Total runtime (Cheetah (red))  By event for one experiment  Coll_tr (blue) is significant  By experiment for one event  Shows how Coll_tr behaves for all experiments 16 processor base case CheetahColl_tr

53 Performance Technology for Productive, High-End Parallel ComputingPSC53 Automated Parallel Performance Diagnosis

54 Performance Technology for Productive, High-End Parallel ComputingPSC54 Current and Future Work  ParaProf  Developing phase-based performance displays  PerfDMF  Adding new database backends and distributed support  Building support for user-created tables  PerfExplorer  Extending comparative and clustering analysis  Adding new data mining capabilities  Building in scripting support  Performance regression testing tool (PerfRegress)  Integrate in Eclipse Parallel Tool Project (PTP)

55 Performance Technology for Productive, High-End Parallel ComputingPSC55 Concluding Discussion  Performance tools must be used effectively  More intelligent performance systems for productive use  Evolve to application-specific performance technology  Deal with scale by “full range” performance exploration  Autonomic and integrated tools  Knowledge-based and knowledge-driven process  Performance observation methods do not necessarily need to change in a fundamental sense  More automatically controlled and efficiently use  Develop next-generation tools and deliver to community  Open source with support by ParaTools, Inc.

56 Performance Technology for Productive, High-End Parallel ComputingPSC56 Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah ASCI Level 1 sub-contract  ASC/NNSA Level 3 contract  NSF  High-End Computing Grant  Research Centre Juelich  John von Neumann Institute  Dr. Bernd Mohr  Los Alamos National Laboratory

57 Performance Technology for Productive, High-End Parallel ComputingPSC57


Download ppt "Allen D. Malony Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University."

Similar presentations


Ads by Google