Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework.

Slides:



Advertisements
Similar presentations
ALICE G RID SERVICES IP V 6 READINESS
Advertisements

May 2005 Iosif Legrand 1 Iosif Legrand California Institute of Technology May 2005 An Agent Based, Dynamic Service System to Monitor, Control and Optimize.
May 2005 Iosif Legrand 1 Iosif Legrand California Institute of Technology ICFA WORKSHOP Daegu, May 2005 Daegu, May 2005 An Agent Based, Dynamic Service.
MONITORING WITH MONALISA Costin Grigoras. M ONITORING WITH M ON ALISA What is MonALISA ? MonALISA communication architecture Monitoring modules ApMon.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
June 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
Monitoring and controlling VRVS Reflectors Catalin Cirstoiu 3/7/2003.
October 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Institute of Computer Science AGH Performance Monitoring of Java Web Service-based Applications Włodzimierz Funika, Piotr Handzlik Lechosław Trębacz Institute.
Grid Monitoring By Zoran Obradovic CSE-510 October 2007.
Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
September 2005 Iosif Legrand 1 End User Agents: extending the "intelligence" to the edge in Distributed Service Systems Iosif Legrand California Institute.
Experience of xrootd monitoring for ALICE at RDIG sites G.S. Shabratova JINR A.K. Zarochentsev SPbSU.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
Online Monitoring with MonALISA Dan Protopopescu Glasgow, UK Dan Protopopescu Glasgow, UK.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
ACAT 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,
1 Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Catalin Cirstoiu, Ciprian Dobre An Agent Based, Dynamic Service System to Monitor, Control.
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 1/18 Monitoring of a distributed computing system: the AliEn Grid Alice Offline weekly meeting.
Costin Grigoras ALICE Offline. In the period of steady LHC operation, The Grid usage is constant and high and, as foreseen, is used for massive RAW and.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Site operations Outline Central services VoBox services Monitoring Storage and networking 4/8/20142ALICE-USA Review - Site Operations.
LISHEP 2004 Iosif Legrand Iosif Legrand California Institute of Technology DISTRIBUTED SERVICES.
SAN DIEGO SUPERCOMPUTER CENTER Inca TeraGrid Status Kate Ericson November 2, 2006.
Overview of ALICE monitoring Catalin Cirstoiu, Pablo Saiz, Latchezar Betev 23/03/2007 System Analysis Working Group.
Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.
Monitoring with MonALISA Costin Grigoras. What is MonALISA ?  Caltech project started in 2002
1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN.
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
April 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
PPDG February 2002 Iosif Legrand Monitoring systems requirements, Prototype tools and integration with other services Iosif Legrand California Institute.
JAliEn Java AliEn middleware A. Grigoras, C. Grigoras, M. Pedreira P Saiz, S. Schreiner ALICE Offline Week – June 2013.
AliEn central services Costin Grigoras. Hardware overview  27 machines  Mix of SLC4, SLC5, Ubuntu 8.04, 8.10, 9.04  100 cores  20 KVA UPSs  2 * 1Gbps.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
October 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
TF meeting – July 13, 2006 Support for taking actions in MonALISA Costin Grigoras.
MONITORING WITH MONALISA Costin Grigoras. M ON ALISA COMMUNICATION ARCHITECTURE MonALISA software components and the connections between them Data consumers.
1 R. Voicu 1, I. Legrand 1, H. Newman 1 2 C.Grigoras 1 California Institute of Technology 2 CERN CHEP 2010 Taipei, October 21 st, 2010 End to End Storage.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
MONALISA MONITORING AND CONTROL Costin Grigoras. O UTLINE MonALISA services and clients Usage in ALICE Online SE discovery mechanism Data management 3.
January 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology January 2006 An Agent Based, Dynamic Service System to Monitor, Control and.
Storage discovery in AliEn
ECSAC- August 2009, Veli Losink California Institute of Technology
California Institute of Technology
ALICE Monitoring
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Sergio Fantinel, INFN LNL/PD
A Messaging Infrastructure for WLCG
Publishing ALICE data & CVMFS infrastructure monitoring
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif Legrand 25/06/2007 HPDC 2007 Workshop on Grid Monitoring Monterey, California

Contents Monitoring requirements MonALISA overview Application monitoring Monitoring architecture in AliEn Jobs monitoring Traffic monitoring Services monitoring Nodes monitoring Actions framework Feature snapshots

Monitoring Requirements Global view of the entire distributed system Least-intrusive As accurate as possible Best-effort data transport Minimizing the requirements for open ports Providing Near real-time information Long-term history of aggregated data On key parameters like System status Resource usage Helping with Correlating events System debugging Generating reports Taking automated actions based on the monitored data

Data Store MonALISA Overview MonALISA is a dynamic distributed framework Collects any type of information from different systems Aggregates and analyzes it in near-real time Provides support for automated control decisions and global optimization of workflows in complex distributed systems. Data Cache Service & DB Configuration Control (SSL) Predicates & Agents Data (via ML Proxy) Applications Java Client (other service) Agents Filters Data Modules WS Client (other service) Web Service WSDL SOAP Lookup Service Lookup Service Registration Discovery Postgres MySQL

ML Discovery System & Services Hierarchical structure of loosely coupled services Independent & autonomous entities able to Publish their existence Discover other available Jini-enabled services Use a dynamic set of proxies to cooperate with them Network of JINI-LUSs Secure & Public MonALISA services Proxies Clients, Repositories, HL services Global Services or Clients Dynamic load balancing Scalability & Replication Security AAA for Clients Distributed System for gathering and Analyzing Information Distributed Dynamic Discovery-based on a lease Mechanism and REN Agents

ApMon – Application Monitoring MonALISA Service MonALISA Service ApMon APPLICATION Monitoring Data UDP/XDR Mbps_out: 0.52 Status: reading App. Monitoring MB_inout: ApMon Config parameter1: value parameter2: value App. Monitoring... Time;IP;procID Monitoring Data UDP/XDR Monitoring Data UDP/XDR load1: 0.24 processes: 97 System Monitoring pages_in: 83 MonALISA hosts Config Servlet dynamic reloading ApMon configuration generated automatically by a servlet / CGI script No Lost Packages Lightweight library of APIs (C, C++, Java, Perl, Python) that can be used to send any information to MonALISA Services High comm. performance Flexible Accounting Sys Mon

Long History DB Monitoring architecture in AliEn LCG Tools ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent MonALISA LCG Site ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn TQ ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn IS ApMon AliEn Optimizers ApMon AliEn Brokers ApMon MySQL Servers ApMon CastorGrid Scripts ApMon API Services MonaLisaRepository Aggregated Data rss vsz cpu time run time job slots free space nr. of files open files Queued JobAgents cpu ksi2k job status disk used processes load net In/out jobs status sockets migrated mbytes active sessions MyProxy status Alerts Actions

Job status monitoring Global summaries For each/all conditions For each/all sites For each/all users Running & cumulative Error status From job agents From central services Real-time map view Integrated pie charts History plots

Real-time Map

Integrated Pie Charts

History Plots, Annotations

Job Resource Usage Cumulative parameters CPU Time & CPU KSI2K Wall time & Wall KSI2K Read & written files Input & output traffic (xrootd) Running parameters Resident memory Virtual memory Open files Workdir size Disk usage CPU usage Aggregated per site

Job Network Traffic Based on the xrootd transfer from every job Aggregated statistics for Sites (incoming, outgoing, site to site, internal) Storage Elements (incoming, outgoing) Of Read and written files Transferred MB/s

Individual job tracking Based on AliEn shell cmds. top, ps, spy, jobinfo, masterjob Using the GUI ML Client Status, resource usage, per job

AliEn & LCG Services monitoring AliEn services Periodically checked PID check + SOAP call Simple functional tests SE space usage Efficiency LCG environment and tools Integrating the VoBOX tests previously run by ML within the SAM framework Proxy lifetime, gsiscp, LCG CE/SE, Job submission, BDII, Local catalog, software area etc. Error messages in case of failure Efficiency ML Alerts are used for problems notification.

FTD/FTS Monitoring Status of the transfers Transfer rates Success/failures Efficiency via ARDA Experiment Dashboard

VOBox/Head node monitoring Machine parameters, real-time & history Load, memory & swap usage, processes, sockets

Actions framework Based on monitoring information, actions can be taken in ML Service ML Repository Actions can be triggered by Values above/below given thresholds Absence/presence of values Correlation between multiple values Possible actions types Alerts Instant messaging RSS Feeds External commands Event logging MLRepository ML Service Actions based on global information Actions based on local information Traffic Jobs Hosts Apps Temperature Humidity A/C Power … Sensors Local decisions Global decisions

Alerts and actions MySQL daemon is automatically restarted when it runs out of memory Trigger: threshold on VSZ memory usage ALICE Production jobs queue is automatically kept full by the automatic resubmission Trigger: threshold on the number of aliprod waiting jobs Administrators are kept up-to-date on the services’ status Trigger: presence/absence of monitored information

Fact figures Raw parameters when running 4K Jobs Unique data series: 300K with frequency 1-15 minutes Message rate: 16K / minute Site aggregated parameters Message rate: 2K / minute Bandwidth rate: 300Kbps Repository DB Size: 70 GB ~ 700M records Data Reduction Schema 2 Months with 2 minutes bins 1 Year with 30 minutes bins ~Forever with 2 hours bins Response time App -> ML Service – network speed ML Service -> ML Clients Subscribed parameters – network speed One shot requests (history requests) – ~5 seconds Repository dynamic history requests – ~300ms / page No incoming ports are required

Summary The MonALISA framework is used as a primary monitoring tool for the ALICE Grid since 2004 Presently the system is used for monitoring of all (identified) services, jobs and network parameters necessary for the Grid operation and debugging The add-on tools for automatic events notification allow for more efficient reaction to problems The framework design and flexibility answers all requirements for a monitoring system The accumulated information allows to construct and implement automated decision making algorithms, thus increasing further the efficiency of the Grid operations

Thank you! Questions?