Download presentation
Presentation is loading. Please wait.
Published byShannon Anthony Modified over 9 years ago
1
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D The Case for Monitoring and Testing David Montoya CScADS July 15, 2013 LA-UR-13-25132
2
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D From a Production Computing Perspective Where do traditional performance analysis tools fit in the process and what is the usage model? Low use / usage entry cost / skill required What is the usage model that will increase awareness and both increase application and drive environment efficiency? Monitor health of both applications and system resources Baseline and track Proper balance of tools to track and probe
3
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Target Usage – Monitoring and Testing User Understand how applications are utilizing platform resources Diagnose problems Adjust mapping of processes onto resources to optimize for: minimum resource use, minimum power consumption, shortest run-time System/Software Administrators Diagnose problems / Discover root causes Ensure health and balance of the system Mitigate effects of errors Develop better utilization policies for all resources System Architects Develop a deep understanding of interactions between system components (hardware, firmware, system software, application) Develop new architectural features to address current shortcomings
4
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Current State of Affairs No longer enough to analyze the performance of the application. There is a wide rage of node/processor architectures that are evolving that force closer assessment to the environment. Increasing scale of resources and compute environment, machine failure rates come to the forefront – MTTF / MTTI New resources such as burst buffers, file system architectures, IO approaches(PLFS), and tools, programming models, etc…. That impact resource utilization and performance. Issues such as power management having larger impact
5
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Moving toward tighter integration As scale increases, the computing architecture becomes more integrated with sub-systems to provide services. Distributed approaches for those services are evolving. Additional run-time systems that are more tightly integrated are evolving. We have come full circle to where the compute environments are no longer individual components or systems that are loosely coupled but architected systems that need to behave in a more holistic manner. A focus of the HPC performance analysis capability needs to move from application performance to its ability to perform in a given computing environment – and the environment’s performance. This is a move for balance, resource utilization and targets application flexibility.
6
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D The current tool box and evolution Typical monitoring systems target failure detection, uptime, and resource state/trend overview: Information targeted to system administration Collection intervals of minutes Relatively high overhead (both compute node and aggregators) Application profiling/debugging/tracing tools: Collection intervals of sub seconds (even sub-millisecond) Typically requires linking (i.e. tools may perturb the application profile) Limits on scale Don’t account for external applications competing for the same resource (monitoring tool example) -Lightweight Distributed Metric Service (LDMS): Continuous data collection, transport, storage as a system service Targets system administrators, users, and applications Enables collection of a reasonably large number of metrics with collection periods that enable job-centric resource utilization analysis and run-time anomaly detection Variable collection period (~seconds) On-node interface to run-time data
7
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D How do you move forward? Data and integration.. You need to understand the health of the system, where there is stress, tie it back to application behavior. Aspects of traditional application analysis but includes system monitoring of all key subsystems with the ability to assess the impact of the application behavior and resource interaction. integration of the data to provide assessment of the application, the various subsystems, and then the ability to apply solutions to better balance, enact efficiencies, establish throughput.. Monitoring and Testing Collect system and subsystem data. network, file systems, compute nodes, resource manager data, etc.. Currently collaborating with monitoring tools development (SNL, others). Taking inventory via Monitoring and Testing Summit.
8
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D LANL Monitoring and Testing Summit Monitoring / Testing Frameworks Splunk Zenoss RabbitMQ LDMS framework Monitoring Infrastructure OVIS – HPC system analysis Gazebo Testing Framework CTS Testing Framework Application: MTT OpenMPI - testing Darshan IO analysis EAP and LAP dashboards ByFL Network: IB Performance monitoring IB Monitoring IB Error monitoring ibperf_seq, ibperf_ring, ibperf_agg, mpiring IDS Project (security) Network Monitoring in Splunk DISCOM Testing Trilab Data Transfer Cat 2 function/Performance testing
9
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D LANL Monitoring and Testing Summit – cont. File Systems: File Systems Monitoring in Splunk New System Integration testing FStools, file system tools for the users File system treewalk File System Health Check Splunk FTA monitoring PLFS Regression and Performance Testing Panfs Release File System Testing and Analysis HPSS Monitoring Cluster/Node Baler -- Log file analysis tool – LDMS node collection Automatic Library Tracking Database (ALTD) General software usage tracking Cielo DRAM and SRAM monitoring – HPCSTATs (Reporting more than monitoring) Moab Logs CBTF based GPU/Nvidia monitoring GPU/Cluster Testing SecMon / Security Monitoring via Zenoss Splunk Cluster Testing New System Integration Post DST /Utilization testing Software testing
10
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Next Steps Assess efforts, integrate Assess data, integrate Assess information view to target users, integrate Start over
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.