Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D The Case for Monitoring and Testing David.

Slides:



Advertisements
Similar presentations
Tivoli Software from IBM Storage Resource Management Webcast
Advertisements

Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Software Modeling SWE5441 Lecture 3 Eng. Mohammed Timraz
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.
© Copyright Lumension Security Lumension Security PatchLink Enterprise Reporting™ 6.4 Overview and What’s New.
Introduction to the new mainframe: Large-Scale Commercial Computing © Copyright IBM Corp., All rights reserved. Chapter 7: Systems Management.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
Chapter 13 Embedded Systems
NetFlow Analyzer Drilldown to the root-QoS Product Overview.
Polaris Financial Technologies Welcomes the members of Hyderabad chapter for the 2nd event on 4 th July 14 held by PACE (The Testing Practice)
1 FM Overview of Adaptation. 2 FM RAPIDware: Component-Based Design of Adaptive and Dependable Middleware Project Investigators: Philip McKinley, Kurt.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
INTRUSION DETECTION SYSTEMS Tristan Walters Rayce West.
Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.
H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.
By Mr. Abdalla A. Shaame.  An operating system is a software component that acts as the core of a computer system.  It performs various functions and.
Presented by INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used?
Chapter 17: Watching Your System BAI617. Chapter Topics Working With Event Viewer Performance Monitor Resource Monitor.
UML - Development Process 1 Software Development Process Using UML (2)
1 Autonomic Computing An Introduction Guenter Kickinger.
Chapter 6 System Engineering - Computer-based system - System engineering process - “Business process” engineering - Product engineering (Source: Pressman,
Security Baseline. Definition A preliminary assessment of a newly implemented system Serves as a starting point to measure changes in configurations and.
Current Job Components Information Technology Department Network Systems Administration Telecommunications Database Design and Administration.
APC InfraStruxure TM Central Smart Plug-In for HP Operations Manager Manage Power, Cooling, Security, Environment, Rack Access and Physical Layer Infrastructure.
Microsoft ® Official Course Module 10 Optimizing and Maintaining Windows ® 8 Client Computers.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
DISTRIBUTED COMPUTING
Cluster Reliability Project ISIS Vanderbilt University.
McGraw-Hill/Irwin © The McGraw-Hill Companies, All Rights Reserved BUSINESS PLUG-IN B17 Organizational Architecture Trends.
Simple introduction to HDFS Jie Wu. Some Useful Features –File permissions and authentication. –Rack awareness: to take a node's physical location into.
Chapter 4 Realtime Widely Distributed Instrumention System.
Chapter 14 Part II: Architectural Adaptation BY: AARON MCKAY.
Service Oriented Architecture (SOA) Dennis Schwarz November 21, 2008.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Novell Compliance Management Platform Update CMP & CMP Extension for SAP Environments Leo Castro Product Marketing Manager Patrick Gookin.
Chapter 5 McGraw-Hill/Irwin Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Cluster Consistency Monitor. Why use a cluster consistency monitoring tool? A Cluster is by definition a setup of configurations to maintain the operation.
Take enterprise virtualization to the next level
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
CSE 303 – Software Design and Architecture
Arch-1 9.Architecture. Arch-2 What’s Architecture? Description of sub-system –Components/sub-systems –Their interaction Framework for communication.
Creating SmartArt 1.Create a slide and select Insert > SmartArt. 2.Choose a SmartArt design and type your text. (Choose any format to start. You can change.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
1 Acquisition Automation – Challenges and Pitfalls Breakout Session # E11 Name: Jim Hargrove and Allen Edgar Date: Tuesday, July 31, 2012 Time: 2:30 pm-3:45.
ITMT 1371 – Window 7 Configuration 1 ITMT Windows 7 Configuration Chapter 8 – Managing and Monitoring Windows 7 Performance.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Organizations Are Embracing New Opportunities
Monitoring Storage Systems for Oracle Enterprise Manager 12c
Chapter 14 Network Management
Joseph JaJa, Mike Smorul, and Sangchul Song
Chapter 18 MobileApp Design
Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
Monitoring Storage Systems for Oracle Enterprise Manager 12c
Introduction to Cloud Computing
Real IBM C exam questions and answers
IS4680 Security Auditing for Compliance
Mattan Erez The University of Texas at Austin July 2015
Design pattern for cloud Application
Sensor Networks – Motes, Smart Spaces, and Beyond
Data Management Components for a Research Data Archive
Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
Presentation transcript:

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D The Case for Monitoring and Testing David Montoya CScADS July 15, 2013 LA-UR

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D From a Production Computing Perspective Where do traditional performance analysis tools fit in the process and what is the usage model? Low use / usage entry cost / skill required What is the usage model that will increase awareness and both increase application and drive environment efficiency? Monitor health of both applications and system resources Baseline and track Proper balance of tools to track and probe

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Target Usage – Monitoring and Testing User Understand how applications are utilizing platform resources Diagnose problems Adjust mapping of processes onto resources to optimize for: minimum resource use, minimum power consumption, shortest run-time System/Software Administrators Diagnose problems / Discover root causes Ensure health and balance of the system Mitigate effects of errors Develop better utilization policies for all resources System Architects Develop a deep understanding of interactions between system components (hardware, firmware, system software, application) Develop new architectural features to address current shortcomings

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Current State of Affairs No longer enough to analyze the performance of the application. There is a wide rage of node/processor architectures that are evolving that force closer assessment to the environment. Increasing scale of resources and compute environment, machine failure rates come to the forefront – MTTF / MTTI New resources such as burst buffers, file system architectures, IO approaches(PLFS), and tools, programming models, etc…. That impact resource utilization and performance. Issues such as power management having larger impact

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Moving toward tighter integration As scale increases, the computing architecture becomes more integrated with sub-systems to provide services. Distributed approaches for those services are evolving. Additional run-time systems that are more tightly integrated are evolving. We have come full circle to where the compute environments are no longer individual components or systems that are loosely coupled but architected systems that need to behave in a more holistic manner. A focus of the HPC performance analysis capability needs to move from application performance to its ability to perform in a given computing environment – and the environment’s performance. This is a move for balance, resource utilization and targets application flexibility.

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D The current tool box and evolution Typical monitoring systems target failure detection, uptime, and resource state/trend overview: Information targeted to system administration Collection intervals of minutes Relatively high overhead (both compute node and aggregators) Application profiling/debugging/tracing tools: Collection intervals of sub seconds (even sub-millisecond) Typically requires linking (i.e. tools may perturb the application profile) Limits on scale Don’t account for external applications competing for the same resource (monitoring tool example) -Lightweight Distributed Metric Service (LDMS): Continuous data collection, transport, storage as a system service Targets system administrators, users, and applications Enables collection of a reasonably large number of metrics with collection periods that enable job-centric resource utilization analysis and run-time anomaly detection Variable collection period (~seconds) On-node interface to run-time data

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D How do you move forward? Data and integration.. You need to understand the health of the system, where there is stress, tie it back to application behavior. Aspects of traditional application analysis but includes system monitoring of all key subsystems with the ability to assess the impact of the application behavior and resource interaction. integration of the data to provide assessment of the application, the various subsystems, and then the ability to apply solutions to better balance, enact efficiencies, establish throughput.. Monitoring and Testing Collect system and subsystem data. network, file systems, compute nodes, resource manager data, etc.. Currently collaborating with monitoring tools development (SNL, others). Taking inventory via Monitoring and Testing Summit.

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D LANL Monitoring and Testing Summit Monitoring / Testing Frameworks Splunk Zenoss RabbitMQ LDMS framework Monitoring Infrastructure OVIS – HPC system analysis Gazebo Testing Framework CTS Testing Framework Application: MTT OpenMPI - testing Darshan IO analysis EAP and LAP dashboards ByFL Network: IB Performance monitoring IB Monitoring IB Error monitoring ibperf_seq, ibperf_ring, ibperf_agg, mpiring IDS Project (security) Network Monitoring in Splunk DISCOM Testing Trilab Data Transfer Cat 2 function/Performance testing

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D LANL Monitoring and Testing Summit – cont. File Systems: File Systems Monitoring in Splunk New System Integration testing FStools, file system tools for the users File system treewalk File System Health Check Splunk FTA monitoring PLFS Regression and Performance Testing Panfs Release File System Testing and Analysis HPSS Monitoring Cluster/Node Baler -- Log file analysis tool – LDMS node collection Automatic Library Tracking Database (ALTD) General software usage tracking Cielo DRAM and SRAM monitoring – HPCSTATs (Reporting more than monitoring) Moab Logs CBTF based GPU/Nvidia monitoring GPU/Cluster Testing SecMon / Security Monitoring via Zenoss Splunk Cluster Testing New System Integration Post DST /Utilization testing Software testing

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Next Steps Assess efforts, integrate Assess data, integrate Assess information view to target users, integrate Start over