Presentation is loading. Please wait.

Presentation is loading. Please wait.

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Similar presentations

Presentation on theme: "Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department."— Presentation transcript:

1 Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department of Computer and Information Science University of Oregon Performance Technology for Productive, High-End Parallel Computing: the TAU Parallel Performance System

2 Performance Technology for Productive, High-End Parallel ComputingSun2 Outline  Research interests and motivation  TAU performance system  Instrumentation  Measurement  Analysis tools  Parallel profile analysis (ParaProf)  Performance data management (PerfDMF)  Performance data mining (PerfExplorer)  TAU on Solaris 10  ZeptoOS and KTAU

3 Performance Technology for Productive, High-End Parallel ComputingSun3 Research Motivation  Tools for performance problem solving  Empirical-based performance optimization process  Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance data storage Performance Technology

4 Performance Technology for Productive, High-End Parallel ComputingSun4 Challenges in Performance Problem Solving  How to make the process more effective (productive)?  Process likely to change as parallel systems evolve  What are the important events and performance metrics?  Tied to application structure and computational model  Tied to application domain and algorithms  What are the significant issues that will affect the technology used to support the process?  Enhance application development and optimization  Process and tools can/must be more application-aware  Tools have poor support for application-specific aspects  Integrate performance technology and process

5 Performance Technology for Productive, High-End Parallel ComputingSun5 Performance Process, Technology, and Scale  How does our view of this process change when we consider very large-scale parallel systems?  Scaling complicates observation and analysis  Performance data size  standard approaches deliver a lot of data with little value  Measurement overhead and intrusion  tradeoff with analysis accuracy  “noise” in the system  Analysis complexity increases  What will enhance productive application development?  Process and technology evolution  Nature of application development may change

6 Performance Technology for Productive, High-End Parallel ComputingSun6 Role of Intelligence, Automation, and Knowledge  Scale forces the process to become more intelligent  Even with intelligent and application-specific tools, the decisions of what to analyze is difficult and intractable  More automation and knowledge-based decision making  Build automatic/autonomic capabilities into the tools  Support broader experimentation methods and refinement  Access and correlate data from several sources  Automate performance data analysis / mining / learning  Include predictive features and experiment refinement  Knowledge-driven adaptation and optimization guidance  Address scale issues through increased expertise

7 Performance Technology for Productive, High-End Parallel ComputingSun7 TAU Performance System  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Targets a general complex system computation model  Entities: nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance problem solving  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  Partners: LLNL, ANL, Research Center Jülich, LANL

8 Performance Technology for Productive, High-End Parallel ComputingSun8 TAU Parallel Performance System Goals  Portable (open source) parallel performance system  Computer system architectures and operating systems  Different programming languages and compilers  Multi-level, multi-language performance instrumentation  Flexible and configurable performance measurement  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component  Support for performance mapping  Integration of leading performance technology  Scalable (very large) parallel performance analysis

9 Performance Technology for Productive, High-End Parallel ComputingSun9 memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view General Complex System Computation Model  Node: physically distinct shared memory machine  Message passing node interconnection network  Context: distinct virtual memory space within node  Thread: execution threads (user/system) in context

10 Performance Technology for Productive, High-End Parallel ComputingSun10 TAU Performance System Architecture

11 Performance Technology for Productive, High-End Parallel ComputingSun11 TAU Performance System Architecture

12 Performance Technology for Productive, High-End Parallel ComputingSun12 TAU Instrumentation Approach  Support for standard program events  Routines, classes and templates  Statement-level blocks  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups (aggregation, selection)  Instrumentation optimization  Eliminate instrumentation in lightweight routines

13 Performance Technology for Productive, High-End Parallel ComputingSun13 TAU Instrumentation Mechanisms  Source code  Manual (TAU API, TAU component API)  Automatic (robust)  C, C++, F77/90/95 (Program Database Toolkit (PDT))  OpenMP (directive rewriting (Opari), POMP2 spec)  Object code  Pre-instrumented libraries (e.g., MPI using PMPI)  Statically-linked and dynamically-linked  Executable code  Dynamic instrumentation (pre-execution) (DynInstAPI)  Virtual machine instrumentation (e.g., Java using JVMPI)  TAU_COMPILER to automate instrumentation process

14 Performance Technology for Productive, High-End Parallel ComputingSun14 User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping  Multiple interfaces  Information sharing  Between interfaces  Event selection  Within/between levels  Mapping  Associate performance data with high-level semantic abstractions

15 Performance Technology for Productive, High-End Parallel ComputingSun15 Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM tau_instrument or Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation

16 Performance Technology for Productive, High-End Parallel ComputingSun16 Program Database Toolkit (PDT)  Program code analysis framework  Develop source-based tools  High-level interface to source code information  Integrated toolkit for source code parsing, database creation, and database query  Commercial grade front-end parsers  Portable IL analyzer, database format, and access API  Open software approach for tool development  Multiple source languages  Implement automatic performance instrumentation tools  tau_instrumentor

17 Performance Technology for Productive, High-End Parallel ComputingSun17 TAU Measurement Approach  Portable and scalable parallel profiling solution  Multiple profiling types and options  Event selection and control (enabling/disabling, throttling)  Online profile access and sampling  Online performance profile overhead compensation  Portable and scalable parallel tracing solution  Trace translation to Open Trace Format (OTF)  Trace streams and hierarchical trace merging  Robust timing and hardware performance support  Multiple counters (hardware, user-defined, system)  Performance measurement for CCA component software

18 Performance Technology for Productive, High-End Parallel ComputingSun18 TAU Measurement Mechanisms  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events and mapping events  TAU parallel profile stored (dumped) during execution  Support for flat, callgraph/callpath, phase profiling  Support for memory profiling  Tracing  All profile-level events  Inter-process communication events  Inclusion of multiple counter data in traced events

19 Performance Technology for Productive, High-End Parallel ComputingSun19 Types of Parallel Performance Profiling  Flat profiles  Metric (e.g., time) spent in an event (callgraph nodes)  Exclusive/inclusive, # of calls, child calls  Callpath profiles (Calldepth profiles)  Time spent along a calling path (edges in callgraph)  “main=> f1 => f2 => MPI_Send” (event name)  TAU_CALLPATH_LENGTH environment variable  Phase profiles  Flat profiles under a phase (nested phases are allowed)  Default “main” phase  Supports static or dynamic (per-iteration) phases

20 Performance Technology for Productive, High-End Parallel ComputingSun20 Performance Analysis and Visualization  Analysis of parallel profile and trace measurement  Parallel profile analysis  ParaProf: parallel profile analysis and presentation  ParaVis: parallel performance visualization package  Profile generation from trace data (tau2pprof)  Performance data management framework (PerfDMF)  Parallel trace analysis  Translation to VTF (V3.0), EPILOG, OTF formats  Integration with VNG (Technical University of Dresden)  Online parallel analysis and visualization  Integration with CUBE browser (KOJAK, UTK, FZJ)

21 Performance Technology for Productive, High-End Parallel ComputingSun21 ParaProf Parallel Performance Profile Analysis HPMToolkit MpiP TAU Raw files PerfDMF managed (database) Metadata Application Experiment Trial

22 Performance Technology for Productive, High-End Parallel ComputingSun22 Example Applications  sPPM  ASCI benchmark, Fortran, C, MPI, OpenMP or pthreads  Miranda  research hydrodynamics code, Fortran, MPI  GYRO  tokamak turbulence simulation, Fortran, MPI  FLASH  physics simulation, Fortran, MPI  WRF  weather research and forecasting, Fortran, MPI  S3D  3D combustion, Fortran, MPI

23 Performance Technology for Productive, High-End Parallel ComputingSun23 ParaProf – Flat Profile (Miranda, BG/L) 8K processors node, context, thread Miranda  hydrodynamics  Fortran + MPI  LLNL Run to 64K

24 Performance Technology for Productive, High-End Parallel ComputingSun24 ParaProf – Stacked View (Miranda)

25 Performance Technology for Productive, High-End Parallel ComputingSun25 ParaProf – Callpath Profile (Flash) Flash  thermonuclear flashes  Fortran + MPI  Argonne

26 Performance Technology for Productive, High-End Parallel ComputingSun26 ParaProf – Histogram View (Miranda) 8k processors 16k processors

27 Performance Technology for Productive, High-End Parallel ComputingSun27 NAS BT – Flat Profile How is MPI_Wait() distributed relative to solver direction? Application routine names reflect phase semantics

28 Performance Technology for Productive, High-End Parallel ComputingSun28 NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events

29 Performance Technology for Productive, High-End Parallel ComputingSun29 ParaProf – 3D Full Profile (Miranda) 16k processors

30 Performance Technology for Productive, High-End Parallel ComputingSun30 ParaProf – 3D Full Profile (Flash) 128 processors

31 Performance Technology for Productive, High-End Parallel ComputingSun31 ParaProf – 3D Scatterplot (Miranda)  Each point is a “thread” of execution  A total of four metrics shown in relation  ParaVis 3D profile visualization library  JOGL

32 Performance Technology for Productive, High-End Parallel ComputingSun32 ParaProf – Callgraph Zoom (Flash) Zoom in (+) Zoom out (-)

33 Performance Technology for Productive, High-End Parallel ComputingSun33 Performance Tracing on Miranda  Use TAU to generate VTF3 traces for Vampir analysis  MPI calls with HW counter information (not shown)  Detailed code behavior to focus optimization efforts

34 Performance Technology for Productive, High-End Parallel ComputingSun34 S3D on Lemieux (TAU-to-VTF3, Vampir) S3D  3D combustion  Fortran + MPI  PSC

35 Performance Technology for Productive, High-End Parallel ComputingSun35 S3D on Lemieux (Zoomed)

36 Performance Technology for Productive, High-End Parallel ComputingSun36 Hypothetical Mapping Example  Particles distributed on surfaces of a cube Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] =... f(face);... } last+= particles_on_this_face; }

37 Performance Technology for Productive, High-End Parallel ComputingSun37 Hypothetical Mapping Example (continued)  How much time (flops) spent processing face i particles?  What is the distribution of performance among faces?  How is this determined if execution is parallel? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); } … engine work packets

38 Performance Technology for Productive, High-End Parallel ComputingSun38 No Performance Mapping versus Mapping  Typical performance tools report performance with respect to routines  Does not provide support for mapping  TAU’s performance mapping can observe performance with respect to scientist’s programming and problem abstractions TAU (no mapping) TAU (w/ mapping)

39 Performance Technology for Productive, High-End Parallel ComputingSun39 Component-Based Scientific Applications  How to support performance analysis and tuning process consistent with application development methodology?  Common Component Architecture (CCA) applications  Performance tools should integrate with software  Design performance observation component  Measurement port and measurement interfaces  Build support for application component instrumentation  Interpose a proxy component for each port  Inside the proxy, track caller/callee invocations, timings  Automate the process of proxy component creation  using PDT for static analysis of components  include support for selective instrumentation

40 Performance Technology for Productive, High-End Parallel ComputingSun40 Flame Reaction-Diffusion (Sandia) CCAFFEINE

41 Performance Technology for Productive, High-End Parallel ComputingSun41 Earth Systems Modeling Framework  Coupled modeling with modular software framework  Instrumentation for ESMF framework and applications  PDT automatic instrumentation  Fortran 95 code modules  C / C++ code modules  MPI wrapper library for MPI calls  ESMF component instrumentation (using CCA)  CCA measurement port manual instrumentation  Proxy generation using PDT and runtime interposition  Significant callpath profiling used by ESMF team

42 Performance Technology for Productive, High-End Parallel ComputingSun42 Using TAU Component in ESMF/CCA

43 Performance Technology for Productive, High-End Parallel ComputingSun43 Important Questions for Application Developers  How does performance vary with different compilers?  Is poor performance correlated with certain OS features?  Has a recent change caused unanticipated performance?  How does performance vary with MPI variants?  Why is one application version faster than another?  What is the reason for the observed scaling behavior?  Did two runs exhibit similar performance?  How are performance data related to application events?  Which machines will run my code the fastest and why?  Which benchmarks predict my code performance best?

44 Performance Technology for Productive, High-End Parallel ComputingSun44 Performance Problem Solving Goals  Answer questions at multiple levels of interest  Data from low-level measurements and simulations  use to predict application performance  High-level performance data spanning dimensions  machine, applications, code revisions, data sets  examine broad performance trends  Discover general correlations application performance and features of their external environment  Develop methods to predict application performance on lower-level metrics  Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

45 Performance Technology for Productive, High-End Parallel ComputingSun45 Automatic Performance Analysis Tool (Concept) Performance database Build application Execute application Simple analysis feedback 105% Faster! 72% Faster! build information environment / performance data Offline analysis

46 Performance Technology for Productive, High-End Parallel ComputingSun46 Performance Data Management (PerfDMF) K. Huck, A. Malony, R. Bell, A. Morris, “Design and Implementation of a Parallel Performance Data Management Framework,” ICPP 2005. (awarded best paper)

47 Performance Technology for Productive, High-End Parallel ComputingSun47 Performance Data Mining (Objectives)  Conduct parallel performance analysis in a systematic, collaborative and reusable manner  Manage performance complexity  Discover performance relationship and properties  Automate process  Multi-experiment performance analysis  Large-scale performance data reduction  Summarize characteristics of large processor runs  Implement extensible analysis framework  Abtraction / automation of data mining operations  Interface to existing analysis and data mining tools

48 Performance Technology for Productive, High-End Parallel ComputingSun48 Performance Data Mining (PerfExplorer)  Performance knowledge discovery framework  Data mining analysis applied to parallel performance data  comparative, clustering, correlation, dimension reduction, …  Use the existing TAU infrastructure  TAU performance profiles, PerfDMF  Client-server based system architecture  Technology integration  Java API and toolkit for portability  PerfDMF  R-project/Omegahat, Octave/Matlab statistical analysis  WEKA data mining package  JFreeChart for visualization, vector output (EPS, SVG)

49 Performance Technology for Productive, High-End Parallel ComputingSun49 Performance Data Mining (PerfExplorer) K. Huck and A. Malony, “PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing,” SC 2005.

50 Performance Technology for Productive, High-End Parallel ComputingSun50 PerfExplorer Analysis Methods  Data summaries, distributions, scatterplots  Clustering  k-means  Hierarchical  Correlation analysis  Dimension reduction  PCA  Random linear projection  Thresholds  Comparative analysis  Data management views

51 Performance Technology for Productive, High-End Parallel ComputingSun51 Cluster Analysis  Performance data represented as vectors - each dimension is the cumulative time for an event  k-means: k random centers are selected and instances are grouped with the "closest" (Euclidean) center  New centers are calculated and the process repeated until stabilization or max iterations  Dimension reduction necessary for meaningful results  Virtual topology, summaries constructed

52 Performance Technology for Productive, High-End Parallel ComputingSun52 sPPM Cluster Analysis

53 Performance Technology for Productive, High-End Parallel ComputingSun53 Hierarchical and K-means Clustering (sPPM)

54 Performance Technology for Productive, High-End Parallel ComputingSun54 Miranda Clusters, Average Values (16K CPUs)  Two primary clusters due to MPI_Alltoall behavior …  … also inverse relationship between MPI_Barrier and MPI_Group_translate_ranks

55 Performance Technology for Productive, High-End Parallel ComputingSun55 Miranda Modified  After code modifications, work distribution is even  MPI_Barrier and MPI_Group_translate_ranks are no longer significant contributors to run time

56 Performance Technology for Productive, High-End Parallel ComputingSun56 Flash Clustering on 16K BG/L Processors  Four significant events automatically selected  Clusters and correlations are visible

57 Performance Technology for Productive, High-End Parallel ComputingSun57 Correlation Analysis  Describes strength and direction of a linear relationship between two variables (events) in the data

58 Performance Technology for Productive, High-End Parallel ComputingSun58 Comparative Analysis  Relative speedup, efficiency  total runtime, by event, one event, by phase  Breakdown of total runtime  Group fraction of total runtime  Correlating events to total runtime  Timesteps per second

59 Performance Technology for Productive, High-End Parallel ComputingSun59 User-Defined Views  Reorganization of data for multiple parametric studies  Construction of views / sub-views with simple operators  Simple “wizard” like interface for creating view Application Processors Problem size Application Problem type Processors

60 Performance Technology for Productive, High-End Parallel ComputingSun60 PerfExplorer Future Work  Extensions to PerfExplorer framework  Examine properties of performance data  Automated guidance of analysis  Workflow scripting for repeatable analysis  Dependency modeling (go beyond correlation)  Time-series analysis of phase-based data

61 Performance Technology for Productive, High-End Parallel ComputingSun61 TAU Eclipse Integration  Eclipse GUI integration of existing TAU tools  New Eclipse plug-in for code instrumentation  Integration with CDT and FDT  Java, C/C++, and Fortran projects  Can be instrumented and run from within eclipse  Each project can be given multiple build configurations corresponding to available TAU makefiles  All TAU configuration options are available  Paraprof tool can be launched automatically

62 Performance Technology for Productive, High-End Parallel ComputingSun62 TAU Eclipse Integration TAU configuration TAU experimentation

63 Performance Technology for Productive, High-End Parallel ComputingSun63 TAU Eclipse Future Work  Development of the TAU Eclipse plugins for Java and the CDT/FDT is ongoing  Planned features include:  Full integration with the Eclipse Parallel Tools project  Database storage of project performance data  Refinement of the plugin settings interface to allow easier selection of TAU runtime and compiletime options  Accessibility of TAU configuration and commandline tools via the Eclipse UI

64 Performance Technology for Productive, High-End Parallel ComputingSun64 Porting TAU to Sun Solaris 10 / Opteron  Already supported earlier verions of Solaris  Already supported Linux and Windows  Will work directly with Sun Opteron systems  Now have full support for Solaris 10  Intel-based systems  Opteron systems  Profiling and tracing with all options  Compilers supported  Sun Studio 10 compilers (C,C++, and F90)  Parallel models supported  MPI, OpenMP, hybrid

65 Performance Technology for Productive, High-End Parallel ComputingSun65 Porting Challenge – Nested OpenMP  TAU did not support nested parallelism before at all  Opari could instrument  but does not distinguish nesting  Lack of OpenMP support for performance tools  determining thread information for nested threads  Want to build a portable mechanism for OpenMP  OpenMP may provide runtime call in the future  Explore what can be implemented otherwise now  Now supports nested OpenMP parallelism  OpenMP runtime system independent  Depends on language feature

66 Performance Technology for Productive, High-End Parallel ComputingSun66 Nested Parallelism Implementation in TAU  TAU normally uses omp_get_thread_num()  Identify threads with globally unique identifier (tid)  Use to create measurement structures for thread  event callstack, profile objects, …  Approach breaks down for nested parallelism  omp_get_thread_num() returns id in current team  Not globally unique  Use #pragma threadprivate() directive  Gives thread local storage (TLS) for identifying threads  TAU generates tid for each thread first come, first serve  Supported in Intel and Sun compilers (probably IBM)  Tested on Sun (Sparc, x64) and SGI Prism (Itanium2)

67 Performance Technology for Productive, High-End Parallel ComputingSun67 ZeptoOS and TAU  DOE OS/RTS for Extreme Scale Scientific Computation  ZeptoOS  scalable components for petascale architectures  Argonne National Laboratory and University of Oregon  University of Oregon  Kernel-level performance monitoring  OS component performance assessment and tuning  KTAU (Kernel Tuning and Analysis Utilities)  integration of TAU infrastructure in Linux kernel  integration with ZeptoOS  installation on BG/L  Port to 32-bit and 64-bit Linux platforms

68 Performance Technology for Productive, High-End Parallel ComputingSun68 Linux Kernel Profiling using TAU – Goals  Fine-grained kernel-level performance measurement  Parallel applications  Support both profiling and tracing  Both process-centric and system-wide view  Merge user-space performance with kernel-space  User-space: (TAU) profile/trace  Kernel-space: (KTAU) profile/trace  Detailed program-OS interaction data  Including interrupts (IRQ)  Analysis and visualization compatible with TAU

69 Performance Technology for Productive, High-End Parallel ComputingSun69 KTAU System Architecture

70 Performance Technology for Productive, High-End Parallel ComputingSun70 KTAU on BG/L I/O Node

71 Performance Technology for Productive, High-End Parallel ComputingSun71 KTAU Usage Models  Daemon-based monitoring (KTAU-D)  Use KTAU-D to monitor (profile/trace) a single process (e.g., CIOD) or entire IO-Node kernel  No access to source code of user-space program  CIOD kernel-activity available though CIOD source N/A  ‘Self’ Monitoring  A user-space program can be instrumented (e.g., with TAU) to access its OWN kernel-level trace/profile data  ZIOD (ZeptoOS IO-D) source (when available) can be instrumented  Can produce MERGED user-kernel trace/profile

72 Performance Technology for Productive, High-End Parallel ComputingSun72 KTAU-D Profile Data  KTAU-D can be used to access profile data (system- wide and individual process) of BGL IO-Node  Data is obtained at the start and stop of KTAUD, and then the resulting profile is generated  (Work in progress)  Currrently flat profiles with inclusive/exclusive times and Function call counts are produced  (Future work: Call-graph profiles).  Profile data is viewed using ParaProf visualization tool

73 Performance Technology for Productive, High-End Parallel ComputingSun73 KTAU-D Trace  KTAU-D can be used to access system-wide and individual process trace data of BGL IO-Node  Trace from KTAU-D is converted into TAU trace-format which then can be converted into other formats  Vampir, Jumpshot  Trace from KTAU-D can be used together (merged) with trace from TAU to monitor both user and kernel space activities  (Work in progress)

74 Performance Technology for Productive, High-End Parallel ComputingSun74 Experiment to Show KTAU in Use  Workload  “iotest” benchmark on BGL  2, 4, 16, 32, 48, and 64 compute nodes  Use KTAU in IO-Node ZeptoOS Kernel  Collect trace data  KTAU-D on IO-Node periodically monitors system activities and dumps out trace-data  We visualize the activities in the trace using Vampir

75 Performance Technology for Productive, High-End Parallel ComputingSun75 Experiment Setup (Parameters)  KTAU:  Enable all instrumentation points  Number of kernel trace entries per proces = 10K  KTAU-D:  System-wide tracing  Accessing trace every 1 second and dump trace output to a file in user’s home directory through NFS  IOTEST:  An mpi-based benchmark (open/write/read/close)  Running with default parameters (blocksize = 16MB)

76 Performance Technology for Productive, High-End Parallel ComputingSun76 SYS_WRITE KTAU Trace of CIOD running 2, 4, 8, 16, 32 nodes As the number of compute node increase, CIOD has to handle larger amount of sys_call being forwarded. 1,769 sys_write 3,142 sys_write 5,838 sys_write 10,980 sys_write 37,985 sys_write

77 Performance Technology for Productive, High-End Parallel ComputingSun77 Zoomed View of CIOD Trace (8 compute nodes)

78 Performance Technology for Productive, High-End Parallel ComputingSun78 TAU Performance System Status  Computing platforms  IBM, SGI, Cray, HP, Sun, Hitachi, NEC, Linux clusters, Apple, Windows, …  Programming languages  C, C++, Fortran 90/95, UPC, HPF, Java, OpenMP, Python  Thread libraries  pthreads, SGI sproc, Java,Windows, OpenMP  Communications libraries  MPI-1/2, PVM, shmem, …  Compilers  IBM, Intel, PGI, GNU, Fujitsu, Sun, NAG, Microsoft, SGI, Cray, HP, NEC, Absoft, Lahey, PathScale, Open64

79 Performance Technology for Productive, High-End Parallel ComputingSun79 Project Affiliations (selected)  Lawrence Livermore National Lab  Hydrodynamics (Miranda), radiation diffusion (KULL)  Open Trace Format (OTF) implementation on BG/L  Argonne National Lab  ZeptoOS project and KTAU  Astrophysical thermonuclear flashes (Flash)  Center for Simulation of Accidental Fires and Explosion  University of Utah, ASCI ASAP Center, C-SAFE  Uintah Computational Framework (UCF)  Oak Ridge National Lab  Contribution to the Joule Report (S3D, AORSA3D)

80 Performance Technology for Productive, High-End Parallel ComputingSun80 Project Affiliations (continued)  Sandia National Lab  Simulation of turbulent reactive flows (S3D)  Combustion code (CFRFS)  Los Alamos National Lab  Monte Carlo transport (MCNP)  SAIC’s Adaptive Grid Eulerian (SAGE)  CCSM / ESMF / WRF climate/earth/weather simulation  NSF, NOAA, DOE, NASA, …  Common component architecture (CCA) integration  Performance Evaluation Research Center (PERC)  DOE SciDAC center

81 Performance Technology for Productive, High-End Parallel ComputingSun81 Support Acknowledgements  Department of Energy (DOE)  Office of Science  MICS, Argonne National Lab  ASC/NNSA  University of Utah ASC/NNSA Level 1  ASC/NNSA, Lawrence Livermore National Lab  Department of Defense (DoD)  HPC Modernization Office (HPCMO)  Programming Environment and Training (PET)  NSF Software and Tools for High-End Computing  Research Centre Juelich  Los Alamos National Laboratory  ParaTools

82 Performance Technology for Productive, High-End Parallel ComputingSun82 Acknowledgements  Dr. Sameer Shende, Senior Scientist  Alan Morris, Senior Software Engineer  Wyatt Spear, PRL staff  Scott Biersdorff, PRL staff  Kevin Huck, Ph.D. student  Aroon Nataraj, Ph.D. student  Kai Li, Ph.D. student  Li Li, Ph.D. student  Adnan Salman, Ph.D. student  Suravee Suthikulpanit, M.S. student

Download ppt "Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department."

Similar presentations

Ads by Google