CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 1/18 Monitoring of a distributed computing system: the AliEn Grid Alice Offline weekly meeting.

Slides:



Advertisements
Similar presentations
Building Portals to access Grid Middleware National Technical University of Athens Konstantinos Dolkas, On behalf of Andreas Menychtas.
Advertisements

University of Florence – Mon, 19 Dec 2005 – Marco MEONI - 1/30 Monitoring of a distrubuted computing system: the Grid Master Degree – 19/12/2005.
During the last three years, ALICE has used AliEn continuously. All the activities needed by the experiment (Monte Carlo productions, raw data registration,
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.
October 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
A tool to enable CMS Distributed Analysis
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
GRACE Project IST EGAAP meeting – Den Haag, 25/11/2004 Giuseppe Sisto – Telecom Italia Lab.
AliEn uses bbFTP for the file transfers. Every FTD runs a server, and all the others FTD can connect and authenticate to it using certificates. bbFTP implements.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Online Monitoring with MonALISA Dan Protopopescu Glasgow, UK Dan Protopopescu Glasgow, UK.
RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Software Performance Testing Based on Workload Characterization Elaine Weyuker Alberto Avritzer Joe Kondek Danielle Liu AT&T Labs.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
Panda Grid Status Kilian Schwarz, GSI on behalf of PANDA GRID Group (slides to a large extend from Radoslaw Karabowicz)
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
13 May 2004EB/TB Middleware meeting Use of R-GMA in BOSS for CMS Peter Hobson & Henry Nebrensky Brunel University, UK Some slides stolen from various talks.
Site operations Outline Central services VoBox services Monitoring Storage and networking 4/8/20142ALICE-USA Review - Site Operations.
PDC’06 – production status and issues Latchezar Betev TF meeting – May 04, 2006.
Overview of ALICE monitoring Catalin Cirstoiu, Pablo Saiz, Latchezar Betev 23/03/2007 System Analysis Working Group.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly.
CERN – Alice Offline – Thu, 20 Mar 2008 – Marco MEONI - 1 Status of Cosmic Reconstruction Offline weekly meeting.
Monitoring with MonALISA Costin Grigoras. What is MonALISA ?  Caltech project started in 2002
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Implementation and performance analysis of.
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
April 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
PPDG February 2002 Iosif Legrand Monitoring systems requirements, Prototype tools and integration with other services Iosif Legrand California Institute.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
JAliEn Java AliEn middleware A. Grigoras, C. Grigoras, M. Pedreira P Saiz, S. Schreiner ALICE Offline Week – June 2013.
AliEn central services Costin Grigoras. Hardware overview  27 machines  Mix of SLC4, SLC5, Ubuntu 8.04, 8.10, 9.04  100 cores  20 KVA UPSs  2 * 1Gbps.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI VO auger experience with large scale simulations on the grid Jiří Chudoba.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
Status of AliEn2 Services ALICE offline week Latchezar Betev Geneva, June 01, 2005.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
INFSO-RI Enabling Grids for E-sciencE Ganga 4 Technical Overview Jakub T. Moscicki, CERN.
10 March Andrey Grid Tools Working Prototype of Distributed Computing Infrastructure for Physics Analysis SUNY.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.
ALICE FAIR Meeting KVI, 2010 Kilian Schwarz GSI.
INFN-GRID Workshop Bari, October, 26, 2004
Sergio Fantinel, INFN LNL/PD
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
MC data production, reconstruction and analysis - lessons from PDC’04
Scalability Tests With CMS, Boss and R-GMA
Simulation use cases for T2 in ALICE
New developments on the LHCb Bookkeeping
Alice Software Demonstration
Presentation transcript:

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 1/18 Monitoring of a distributed computing system: the AliEn Grid Alice Offline weekly meeting Thursday 3rd February 2005 Marco MEONI

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 2/18 Content Document I’ve been working on since mid Dec 2004 ~100 pages up to now Not too far from the final version Available on (let me discuss the thesis first) 1.ALICE and AliEn ~ 35 pages ~ 65 pages 4.MonALISA adaptations and extensions 3.MonALISA 2.Grid Monitoring 5.PDC 2004 monitoring and results 6.Conclusion and Outlooks

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 3/18 Section I Grid Concepts and Monitoring

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 4/18 Grid, ALICE, AliEn Grid Computing overview “coordinated use of large sets of different, geographically distributed resources in order to allow high-performance computation” ALICE experiment and ALICE Off-line AliEn PULL rather than PUSH architecture, scheduling service does not need to know the status of all other resources in the system, robust and fault tolerant system where resources can come and go at any point in time. possible to interface an entire foreign Grid as a large Computing and Storage Element (LCG)

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 5/18 GMA architecture R-GMA: an example of implementation Jini (Sun) provides the technical basis Grid Monitoring Producer Consumer Registry Transfer Data Store location Lookup location

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 6/18 Section II MonALISA Adaptations and Extensions

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 7/18 Farms monitoring MonALISA Adaptations User Java class to interface MonALISA and bash script to monitor the site A WEB Repository as a front-end Stores history of the monitored data Plots any kind of chart Interfaces to user code (custom consumers, config modules, new charts, distributions) MonALISA Agent WNs CE Bash monitoring script Java interface class Monitored data User codeMonALISA frameworkALICE’s resources

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 8/18 Repository Additional Java thread to feed directly the repository Ad hoc java thread Monitored data TOMCAT JSP/servlets AliEn Jobs Monitoring If the Grid executes jobs then it works! Centralized or distributed? AliEn native APIs to retrieve job status snapshots Job is submitted >1h >3h (Error_I) (Error_A) (Error_S) (Error_E) (Error_R) (Error_V, VT, VN) (Error_SV)

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 9/18 Repository DataBase(s) 7.5 Gb of monitored information, 52M records During DCs data from ~2K monitored parameters arrive every 2/3 mins alimonitor.cern.ch aliweb01.cern.ch Online Replication Data Replication: MASTER DB SPARE DB Grid AnalysisData collecting and Grid Monitoring 1min Averaging process 10 min100 min 60 bins for each basic information FIFO

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 10/18 SourceCategoryNumberExamples AliEn APICE load factors63Run load, queue load SE occupancy62Used space, free space, files number Job information557Running, saving, done, failed Soap callsCERN Network traffic29MBs, files LCGCPU – Jobs48Free CPUs, job running and waiting ML services on MQJob summary34Running, saving, done, failed AliEn parameters15MySQL load, Perl processes ML servicesSites info1060Paging, threads, I/O, processes Job execution efficiencySuccessfuly done jobs / all submitted jobs System efficiencyError (CE) free jobs / all submitted jobs AliRoot efficiencyError (AliROOT) free jobs / all submitted jobs Resource efficiencyRunning (queued) jobs / max_running (queued) Monitored parameters Derived classes… 1868

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 11/18 Extensions Job monitoring by user Repository Web Services Application Monitoring (ApMon) at WNs Grid Analysis AliEn “ps –xxx” commands Job’s JDL Results presented in the same web front end Repository interfaced to ROOT and Carrot

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 12/18 Section III PDC 2004 Monitoring and Results

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 13/18 Start 10/03, end 29/05 (58 days active) Maximum jobs running in parallel: 1450 Average during active period: 430 Sum of all sites Phase 1 (simulation) Successfully done jobs all submitted jobs Error (CE) free jobs all submitted jobs Error (AliROOT) free jobs all submitted jobs

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 14/18 Phase 2 (merging)  as in the 1st phase, general equilibrium in CPU contribution  not sigle site dominating the production  jobs successfully done 76% AliEn, 24% LCG Jobs failureReasonRate SubmissionCE scheduler not responding1% Loading input dataRemote SE not responding3% During executionJob aborted, not started, killed, WN malfunction10% Saving output dataLocal SE not responding2%

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 15/18 Phase 3 (analysis) Occupancy changes respect the number of queued jobs in the local batch system

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 16/18 Salutations…

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 17/18 Credits Federico, Predrag and Peter they could pick up another TS Latchezar continuos help and suggestions, review of my thesis MonALISA team collaborative anytime I needed Guenter very useful integrations my fiancee moral support: “did they hire you just to look at some plots?”

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 18/18 …thanks to all …and all the others I couldn’t find a pic!