CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Monitoring Use Cases 16 January 2012.
CERN IT Department CH-1211 Genève 23 Switzerland t The Agile Infrastructure Project Monitoring Markus Schulz Pedro Andrade.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Copyright 2009 FUJITSU TECHNOLOGY SOLUTIONS PRIMERGY Servers and Windows Server® 2008 R2 Benefit from an efficient, high performance and flexible platform.
Evaluation of NoSQL databases for DIRAC monitoring and beyond
CompuNet Grid Computing Milena Natanov Keren Kotlovsky Project Supervisor: Zvika Berkovich Lab Chief Engineer: Dr. Ilana David Spring, /
ATSN 2009 Towards an Extensible Agent-based Middleware for Sensor Networks and RFID Systems Dirk Bade University of Hamburg, Germany.
CHEP 2015 Analysis of CERN Computing Infrastructure and Monitoring Data Christian Nieke, CERN IT / Technische Universität Braunschweig On behalf of the.
CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
WP6: Grid Authorization Service Review meeting in Berlin, March 8 th 2004 Marcin Adamski Michał Chmielewski Sergiusz Fonrobert Jarek Nabrzyski Tomasz Nowocień.
By Mihir Joshi Nikhil Dixit Limaye Pallavi Bhide Payal Godse.
Performance and Exception Monitoring Project Tim Smith CERN/IT.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Computer and Automation Research Institute Hungarian Academy of Sciences Presentation and Analysis of Grid Performance Data Norbert Podhorszki and Peter.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overlook of Messaging.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
CERN IT Department CH-1211 Genève 23 Switzerland t IT Monitoring WG IT/CS Monitoring System Virginie Longo September 14th 2011.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
DAM-Alarming Data Analytics from Monitoring, for alarming Summer Student Project 2015 A. Martin, C. Cristovao, G. Domenico thanks to Luca Magnoni IT-SDC-MI.
WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.
A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG meeting 12 th September.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
An approach to Web services Management in OGSA environment By Shobhana Kirtane.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
CERN 21 January 2005Piotr Nyczyk, CERN1 R-GMA Basics and key concepts Monitoring framework for computing Grids – developed by EGEE-JRA1-UK, currently used.
DataTAG is a project funded by the European Union International School on Grid Computing, 23 Jul 2003 – n o 1 GridICE The eyes of the grid PART I. Introduction.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
AMSA TO 4 Advanced Technology for Sensor Clouds 09 May 2012 Anabas Inc. Indiana University.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Platform as a Service (PaaS)
Organizations Are Embracing New Opportunities
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Platform as a Service (PaaS)
A Messaging Infrastructure for WLCG
Overview of big data tools
Presentation transcript:

CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012

Introduction Motivation –Several independent monitoring activities in IT Similar overall approach, different tool-chains, similar limitations –High level services are interdependent Combination of data from different groups necessary, but difficult –Understanding performance became more important Requires more combined data and complex analysis –Move to a virtualized dynamic infrastructure Comes with complex new requirements on monitoring Challenge –Find a shared architecture and tool-chain components while preserving/improving our investment in monitoring 2

CERN/IT >30 monitoring applications –Number of producers: ~40k –Input data volume: ~280 GB per day Covering a wide range of different resources –Hardware, OS, applications, files, jobs, etc. Application-specific monitoring solutions –Using different technologies (including commercial tools) –Sharing similar needs: aggregate metrics, get alarms, etc Limited sharing of monitoring data –Hard to implement complex monitoring queries 3

Architecture Constraints Aggregate monitoring data in a large data store –For storage and combined analysis tasks Make monitoring data easy to access by everyone –Not forgetting possible security constraints Select a simple and well supported data format –Monitoring payload to be schema free Rely on central metadata services –To discover the computer center resources information –E.g. which is the physical node running virtual machine A 4 Data

Architecture Constraints Follow a tool chain approach –Each tool can be easily replaced by a better one Provide well established solutions for each layer of the architecture: transport, storage/analysis, etc. –Adopt whenever possible existing tools –Avoid home grown solutions Allow a phased transition to the new architecture –Existing applications can be gradually integrated 5 Technology

Architecture Constraints Guarantee that all essential monitoring use cases are considered and satisfied –Fast and Furious (FF) Get metrics values for hardware and selected services Raise alarms according to appropriate thresholds –Digging Deep (DD) Curation of hardware and network historical data Analysis and statistics on batch job and network data –Correlate and Combine (CC) Correlation between usage, hardware, and services Correlation between job status and grid status 6 Use Cases

Messaging Broker Architecture Overview 7 Storage Consumer Storage Consumer Producer Sensor Producer Sensor Storage and Analysis Engine Storage and Analysis Engine Operations Tools Operations Tools Operations Consumers Operations Consumers Producer Sensor Producer Sensor Producer Sensor Producer Sensor Dashboards and APIs Dashboards and APIs Apollo Lemon Hadoop Splunk

Architecture Overview All components can be changed easily –Including the messaging system (standard protocol) Messaging and storage as central components –Tools connect either to the messaging or storage Scalability can be addressed –Horizontally scaling –Adding additional layers (e.g. pre-aggregation) Messages based on a common format (JSON) –A simple common schema is being defined to guarantee cross-reference between data sets: timestamp, node, etc. 8

Monitoring data generated by all resources –Monitoring metadata available at the node –Published to messaging using common libraries –May also be produced as a result of pre-aggregation or post-processing tasks Support and integrate closed monitoring solutions –By injecting final results into the messaging layer or exporting relevant data at an intermediate stage Producers and Sensors 9 Sensor Storage Analysis Visualization Messaging Integrated Product

Messaging Broker Monitoring data transported via messaging –Provide a network of messaging brokers –Support for multiple configurations –The needs of each monitoring application must be clearly analyzed and defined Total number of producers and consumers Size of the monitoring message Rate of the monitoring message –Realistic testing environments are required to produce reliable performance numbers First tests with Apollo (ActiveMQ) –Prior positive experience in IT and the experiments 10

Storage and Analysis Monitoring data stored in a common location –Easy the sharing of monitoring data and analysis tools –Allows feeding into the system data already processed –NoSQL technologies are the most suitable solutions Focus on column/tabular solutions First tests with Hadoop (Cloudera distribution) –Prior positive experience in IT and the experiments –Map-reduce paradigm is a good match for the use cases –Has been used successfully at scale –Several related modules available (Hive, HBase) For particular use cases a parallel relational database solution (Oracle) can be considered 11

Notifications and Dashboards Provide an efficient delivery of notifications –Notifications directly sent to correct consumer targets –Possible targets: operators, service managers, etc. Provide powerful dashboards and APIs –Complex queries on cross-domain monitoring data First tests with Splunk 12

Status and Plans 13 Q Definition of monitoring model for notifications Definition and porting of Lemon as monitoring producer Deployment and initial testing of Apollo, Hadoop, Splunk Q Enable operations notifications via the messaging infrastructure Test Splunk as dashboard for operations notifications Consume monitoring data to Hadoop and initial testing Q Several applications publishing to messaging infrastructure More complex data store/analysis with different data sets on Hadoop Q All monitoring data managed via the messaging infrastructure Large scale data store/analysis on Hadoop cluster

Summary A monitoring architecture has been defined –Promotes sharing of monitoring data between apps –Tool-chain approach based on few core components –Several existing external technologies identified A concrete implementation plan is ongoing –It assures a smooth transition for today’s applications –It enables the AI to be continuously monitored –It allows moving towards a common system 14

Thank You ! 15

16 Backup Slides

Links Monitoring WG Twiki (new location!) – Monitoring WG Report (ongoing) – ngReporthttps://twiki.cern.ch/twiki/bin/view/MonitoringWG/Monitori ngReport Agile Infrastructure TWiki – Agile Infrastructure JIRA – 17