:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Dennis Hoppe (HLRS) ATOM: A near-real time Monitoring Framework for HPC and Embedded Systems
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: WHY? 2
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: It’s all about saving energy Energy consumption is a major challenge in HPC (Exascale Challenge) [Ashby et al., 2010] – Energy consumption must be a design goal in future algorithm design – Standardization of interfaces and APIs to collect energy consumption data (cf. PAPI) is needed – Use of fine-grained measurement tools to evaluate energy saving effects on performance and vice versa Greening of the HPC domain will become as important the greening movement of the automotive domain 3
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: High Demand throughout European Projects EXCESS [EXCESS, 2013] – Build an energy-aware programming framework DreamCloud [DreamCloud, 2013] – Enable dynamic resource allocation to satisfy performance guarantees and minimizing energy consumption Predict performance and energy consumption of applications at run-time for further optimizations Employ monitoring to retrieve detailed application profiles at run-time and for post-processing 4
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Exploiting Monitoring Data in DreamCloud 5
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Properties of Monitoring Solutions Timeliness Granularity Extensibility Architecture Scalability Adaptability Data Storage Visualization Predictability Non-Intrusiveness 6 [Aceta et al., 2013], Katsaros et al., 2011], [Telesca et al., 2014]
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Requirements Analysis (Selection) Key PropertyZabbixNagiosOpenNMS Architecture Non-Intrusiveness Scalability Timeliness( ) Granularity Extensibility Data Storage Visualization Adaptability Predictability 7
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Requirements Analysis (Selection) Key PropertyZabbixNagiosOpenNMS Architecture Non-Intrusiveness Scalability Timeliness( ) Granularity Extensibility Data Storage Visualization Adaptability Predictability 8 None of the existing monitoring solutions satisfies the requirements imposed by current projects! Towards a novel monitoring framework None of the existing monitoring solutions satisfies the requirements imposed by current projects! Towards a novel monitoring framework
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: WHAT? 9
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Key Features of ATOM Analyzing the system’s run-time context Low-intrusive, highly scalable architecture Flexible, language independent plug-in system Light-weight and easy-to-grasp user library Integration with PBS resource manager for on-demand monitoring of applications Interactive web-based front-end for data exploration and analysis 10
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Design Decisions 11 Key PropertyATOM Architecture-agent-based (producer-consumer principle) -Implementation using the programming language C Non-Intrusiveness-a minimal impact on the application’s performance at run-time Scalability-allow high update rates of metrics while being low-intrusive -allow online analytics (planned) Timeliness-allow high update rates of metrics while being low-intrusive Granularity-monitoring at different levels (infrastructure, applications, …) Extensibility-easy-to-use plug-in system via RESTful API Data Storage-efficient data storage, export, and analysis at run-time Visualization-provide basic visualization functionality Adaptability-platform-specific plug-ins support -re-configure plug-ins at run-time Predictability-via plug-in system (outlook)
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: HOW? 12
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: ATOM Deployment on the HLRS/EXCESS Cluster 13 Used for software development, testing, profiling, evaluations within HLRS and for external project partners Cluster is highly configurable and extensible; current power consumption is roughly between 0.5 and 2.0 kW Power measurement framework integrated with PBS system; no further performance overhead is induced while profiling applications
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: HLRS Power and Performance Measurement System 14
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: ATOM Architecture 15 – MONITOR: ATOM monitoring server – ACTOR: ATOM metric collector – Rickshaw (D3.js) – NodeJS – Elasticsearch
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: EXTENSIBILITY 16
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: List of Plug-ins Performance Metrics – PAPI-C – Infiniband – NVIDIA SMI – /proc/meminfo – /proc/vmstat – Iostat Energy Metrics – RAPL – Likwid – hw_power (external measurement system) 17
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Plug-in Development Plug-ins are in general language-independent, as long as they satisfy the communication interface (RESTful API). We currently have implemented two types of plug-ins: – Shell-based plug-ins Good for prototyping Needs to handle configuration on its own, i.e., – update frequency – enable/disable plug-in at run-time Induces extra performance costs – C-based plug-ins Initial extra cost of implementation Simple configuration and integration with the monitoring framework 18
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Shell-based Plugin (/proc/vmstat) 19
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Supporting Application-specific Metrics Provided plug-ins based on Shell-scripting and C applications measure the impact of an application onto the infrastructure (cf. PAPI-C, RAPL) To capture application-specific data, we need code instrumentation! – ATOM user library available in C and Python Our API is based on Application Response Measurement (ARM) standard for monitoring applications [Elarde et al., 2000] : – init() – start() – update() – stop() 20
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: ATOM User Library 21
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: DATA ANALYSIS [MF.EXCESS-PROJECT.EU] 22
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Summary of Experiments 23
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Visualization of Metric Data 24
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: ATOM Export API Applications = Workflows & Tasks (Register and retrieve workflows) – GET /mf/workflows – PUT /mf/workflows/:wid – GET /mf/workflows/:wid Experiments (Retrieve information on experiments) – GET /mf/experiments – GET /mf/experiments/:eid Application Profiles (Retrieve application data) – GET /mf/profiles/:wid – GET /mf/profiles/:wid/:tid – GET /mf/profiles/:wid/:tid/:eid Energy Profiles (HLRS power measurement system) – GET /mf/energy/:wid/:eid – GET /mf/energy/:wid/:tid/:eid 25 Legend: wid =Workflow ID tid =Task ID eid =Experiment ID
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: SUMMARY 26
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Preliminary Experimental Results 27
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Take Away Messages ATOM – is a light-weight, and easy to use monitoring framework focusing on HPC and embedded system support – has fundamental performance and energy metric support, that can be easily extended by a user-friendly plug-in system – offers users various interfaces to explore the profiling data (i.e. front-end, RESTful service, C and Python libraries) 28
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: References [Aceto et al., 2013] – Cloud Monitoring: A Survey, Computer Networks 57(9), [Ashby et al., 2010] – The Opportunities and Challenges of Exascale Computing, Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee at the US Department of Energy Office of Science, [DreamCloud, 2013] – [Elarde et al., 2000] – Performance analysis of application response measurement (ARM) version 2.0 measurement agent software implementations, Performance, Computing, and Communications Conference, [EXCESS, 2013] – [Katsaros et al., 2011] – Monitoring: A fundamental Process to provide QoS Guarantees in Cloud based Platforms, Cloud Computing: Methodology, System, and Applications, [Telesca et al., 2014] – System Performance Monitoring of the ALICE Data Acquisition System with Zabbix, Journal of Physics,