Download presentation
Presentation is loading. Please wait.
Published byRalph Simpson Modified over 9 years ago
1
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 1/18 Monitoring of a distributed computing system: the AliEn Grid Alice Offline weekly meeting Thursday 3rd February 2005 Marco MEONI
2
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 2/18 Content Document I’ve been working on since mid Dec 2004 ~100 pages up to now Not too far from the final version Available on http://... (let me discuss the thesis first) 1.ALICE and AliEn ~ 35 pages ~ 65 pages 4.MonALISA adaptations and extensions 3.MonALISA 2.Grid Monitoring 5.PDC 2004 monitoring and results 6.Conclusion and Outlooks
3
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 3/18 Section I Grid Concepts and Monitoring
4
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 4/18 Grid, ALICE, AliEn Grid Computing overview “coordinated use of large sets of different, geographically distributed resources in order to allow high-performance computation” ALICE experiment and ALICE Off-line AliEn PULL rather than PUSH architecture, scheduling service does not need to know the status of all other resources in the system, robust and fault tolerant system where resources can come and go at any point in time. possible to interface an entire foreign Grid as a large Computing and Storage Element (LCG)
5
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 5/18 GMA architecture R-GMA: an example of implementation Jini (Sun) provides the technical basis Grid Monitoring Producer Consumer Registry Transfer Data Store location Lookup location
6
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 6/18 Section II MonALISA Adaptations and Extensions
7
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 7/18 Farms monitoring MonALISA Adaptations User Java class to interface MonALISA and bash script to monitor the site A WEB Repository as a front-end Stores history of the monitored data Plots any kind of chart Interfaces to user code (custom consumers, config modules, new charts, distributions) MonALISA Agent WNs CE Bash monitoring script Java interface class Monitored data User codeMonALISA frameworkALICE’s resources
8
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 8/18 Repository Additional Java thread to feed directly the repository Ad hoc java thread Monitored data TOMCAT JSP/servlets AliEn Jobs Monitoring If the Grid executes jobs then it works! Centralized or distributed? AliEn native APIs to retrieve job status snapshots Job is submitted >1h >3h (Error_I) (Error_A) (Error_S) (Error_E) (Error_R) (Error_V, VT, VN) (Error_SV)
9
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 9/18 Repository DataBase(s) 7.5 Gb of monitored information, 52M records During DCs data from ~2K monitored parameters arrive every 2/3 mins alimonitor.cern.ch aliweb01.cern.ch Online Replication Data Replication: MASTER DB SPARE DB Grid AnalysisData collecting and Grid Monitoring 1min Averaging process 10 min100 min 60 bins for each basic information FIFO
10
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 10/18 SourceCategoryNumberExamples AliEn APICE load factors63Run load, queue load SE occupancy62Used space, free space, files number Job information557Running, saving, done, failed Soap callsCERN Network traffic29MBs, files LCGCPU – Jobs48Free CPUs, job running and waiting ML services on MQJob summary34Running, saving, done, failed AliEn parameters15MySQL load, Perl processes ML servicesSites info1060Paging, threads, I/O, processes Job execution efficiencySuccessfuly done jobs / all submitted jobs System efficiencyError (CE) free jobs / all submitted jobs AliRoot efficiencyError (AliROOT) free jobs / all submitted jobs Resource efficiencyRunning (queued) jobs / max_running (queued) Monitored parameters Derived classes… 1868
11
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 11/18 Extensions Job monitoring by user Repository Web Services Application Monitoring (ApMon) at WNs Grid Analysis AliEn “ps –xxx” commands Job’s JDL Results presented in the same web front end Repository interfaced to ROOT and Carrot
12
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 12/18 Section III PDC 2004 Monitoring and Results
13
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 13/18 Start 10/03, end 29/05 (58 days active) Maximum jobs running in parallel: 1450 Average during active period: 430 Sum of all sites Phase 1 (simulation) Successfully done jobs all submitted jobs Error (CE) free jobs all submitted jobs Error (AliROOT) free jobs all submitted jobs
14
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 14/18 Phase 2 (merging) as in the 1st phase, general equilibrium in CPU contribution not sigle site dominating the production jobs successfully done 76% AliEn, 24% LCG Jobs failureReasonRate SubmissionCE scheduler not responding1% Loading input dataRemote SE not responding3% During executionJob aborted, not started, killed, WN malfunction10% Saving output dataLocal SE not responding2%
15
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 15/18 Phase 3 (analysis) Occupancy changes respect the number of queued jobs in the local batch system
16
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 16/18 Salutations…
17
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 17/18 Credits Federico, Predrag and Peter they could pick up another TS Latchezar continuos help and suggestions, review of my thesis MonALISA team collaborative anytime I needed Guenter very useful integrations my fiancee moral support: “did they hire you just to look at some plots?”
18
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 18/18 …thanks to all …and all the others I couldn’t find a pic!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.