Download presentation
Presentation is loading. Please wait.
Published byBuck Pearson Modified over 8 years ago
1
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t Michel Jouvin (GRIF/LAL) on behalf of James Casey (CERN) (All materials from J. Casey) EGEE France, Lyon April 10, 2008 The Architecture of the WLCG Monitoring System
2
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 2 Outline WLCG Monitoring Working Group –Mandate, background and key principles Technology investigation –Messaging system –Reporting tools Site Monitoring Prototype Example –OSG RSV publication –Job Reliability Monitoring –WLCG/CCRC08 VO-oriented views Summary 2
3
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 3 WLCG Monitoring Working Group The WLCG Monitoring working was set up Nov. 2006 “….help improve the reliability of the grid infrastructure….” “…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …” “… stakeholder are site administrators, grid service managers and operations, VOs, Grid Project management” Now acting as a project rather than a WG –Provides and maintains deliverables –Part of normal operations 3
4
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 4 Rely on Sites “Site administrators are closest to the problems, and need to know about them first” –On the front line to reduce time to respond Initial focus has been on site monitoring Implications –Improved understanding of how to monitor services “Service Cards” developed by EGEE SA3 –Need to deploy components to sites Sometimes an entire monitoring system Needs active participation of site admins
5
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 5 Tell others what you know “If you’re monitoring a site remotely, it’s only polite to give the data to the site” (Chris Brew, RAL). –Remote systems should feed back information to sites Implications –Common publication mechanisms –Integration into fabric monitoring –Discovery of data –Site trust of data – Is it a “backdoor” communications mechanism?
6
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 6 Authority for data… Currently repositories have direct DB connections to (all) other repositories –E.g. SAM, Gridview, Gstat, GOCDB, CIC And they cache and merge and process the data Implications –We have a “Interlinked distributed schema” –Tools should take responsibility for contents of parts of it
7
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 7 Visualization for each community “User-targeted” visualization –All should use the same underlying data Extract information processing out of visualization tools Provide same processed info to all visualizations –Interface with community specific information, e.g. names Implications –Many “similar” dashboards –Everyone sees the same data –Common frameworks/widgets would help
8
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 8 Process Review existing monitoring systems –“Improving reliability is our goal !” Identify gaps Design integrated architecture for monitoring –Prototype some solutions –Reduce to a minimum specific components to develop and maintain Must be usable by whole WLCG –EGEE, OSG, NDG 8
9
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 9 The pieces to work with… The starting point was what we have now: –Availability testing framework – SAM/RSV –Job and Data reliability monitoring – Gridview –Grid topology – GOCDB/Registration DB –Dynamic view of the grid – BDII/CeMon –Accounting – APEL/Gratia –Experiment views – Dashboards –Fabric monitoring – Nagios, LEMON, … –Grid operations tools – CIC Portal They work together right now –To a certain extent ! 9
10
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 10 We’ve got an integration problem ! 10
11
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 11 No monolithic systems Different systems should specialize in their areas of expertise –And not have to also invent all the common infrastructure Implications –Less overlap and duplication of work –Someone needs to manage some common infrastructure –We need to agree on the common infrastructure
12
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 12 Don’t have central bottlenecks “Local problems detected locally shouldn’t require remote services to work out what the problem is” –Still a role for central detection of problem Just they’re reported locally too Lots of central processing done now in SAM/Gridview Implications –Do as much processing locally (or regionally) –Helps scaling – improves robustness –Enables automation - reduces manpower –Harder to deploy
13
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 13 Re-use, don’t re-invent What do we do? –Collect some information, Move it around –Store it, View it, Report on it This is pretty common –We should look at existing systems Already happening for site fabric… –Nagios, LEMON, … Implication –Less code to develop and maintain –Integration nightmare?
14
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 14 Don’t impose systems on sites We can’t dictate a monitoring system –Many (big?) sites already have a deployed system –We have to be pluggable into them Implications –Modular approach –Specifications to define interfaces between existing systems and new components
15
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 15 Broker at the centre 15 Reliablity and persistence of messaging built into the broker network Mitigates the single point of failures we’ve had with previous solutions Message delivery is guaranteed
16
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 16 Plug’n’Play Components Still can end up with spaghetti –A component must take care only of its own job and ignore details of others (e.g. data schema) Tight specification of interaction of components is required –Message format specifications –Standard metadata schema –Message Queue naming schemas –Protocols Standard “Patterns” can act as a basis –http://enterpriseintegrationpatterns.com/http://enterpriseintegrationpatterns.com/ 16
17
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 17 Messaging Systems for Integration We need: –Loose coupling of systems –Distributed components –Reliable delivery of messages –Standard methods of communication –Flexibility to add new producers and consumers of the information without having to reconfigure everything Message Oriented Middleware provides this –And is widely used in similar scenarios 17
18
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 18 Messaging Systems Flexible architecture: –Deliver messages, either in point to point (queue)… –… or multicast mode (topics) –Support Synchronous or Asynchronous communication. Reliable delivery of messages: –Provide reliability to the senders if required –Configurable persistency / Master-Slave. Highly Scalable: –Network of Brokers WLCG Monitoring – some worked examples - 18
19
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 19 ActiveMQ Mature open-source implementation of these ideas –Top-level Apache project –Commercial support available from IONA Easy to integrate –Multiple language + transport protocol support Good performance characteristics –See later … Work done to integrate into our environment –RPMs, Quattor components + templates, LEMON alarms WLCG Monitoring – some worked examples - 19
20
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 20 ActiveMQ Architecture WLCG Monitoring – some worked examples - 20
21
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 21 ActiveMQ Throughput > Consumers > Throughput ?? Consumer Bottleneck! With a larger number of producers, even more messages per second saturating the consumer.
22
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 22 Reporting for WLCG Currently a post-processing of results and graphs in Excel –Much manual work needed ! Try to implement it directly on the GridView DB Using a mature open-source reporting toolkit – JasperReports –UI Report builder – iReports –Web-based report server - OpenReports WLCG Monitoring – some worked examples - 22
23
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 23 JasperReports WLCG Monitoring – some worked examples - 23
24
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 24 Site Monitoring & Nagios More details in next talk: –“Central Europe ROC Nagios Experience” Nagios has shown itself to be a very useful component for building many part of our monitoring solutions –Local Site monitoring –Replacing the SAM execution framework Too hard to maintain, too much centralized to scale –gStat – BDII monitoring Probes within Nagios Publish site results upwards to be part of availability/reliability computation 24
25
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 25 Messaging based archiving and reporting 25
26
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 26 In Production - OSG RSV to SAM RSV – Resource and Service Validation –Uses Gratia as native transport within OSG –And OSG GOC runs a bridge to SAM for WLCG 26
27
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 27 Job Reliability Monitoring Requires to be able to gather job state transitions from all jobs submitted in WLCG resources –EGEE (RB/WMS + Condor_G) + OSG + NDG Only gather this information once –Propagate to interested parties Using existing systems and expertise where possible –Don’t try and deploy components on every WMS/RB/L&B/CE/… –Get ‘cooked’ data from the systems Hook up with Pilot Jobs –Linkage between pilot and experiment jobs as a ‘state change’ 27
28
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 28 Current situation Currently mines L&B log files, and sends them via R-GMA –Requires a specific component on every L&B Loses many records GridView hacks to ‘finish’ unfinished jobs after 24h –Inaccurate results Jobs reported via experiment frameworks –Gathers from many sources – Imperial College XML files, job submission tools, MonAlisa reporting from jobs, R-GMA But some missing information for Condor_G jobs –info between submission and user job starting on WN –Job aborted 28
29
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 29 Proposal Use WLCG Monitoring infrastructure (MSG) for collecting and transporting the data –Messaging system –Standard message formats Work with expert groups to instrument the job submission systems Visualization by Gridview + Dashboards 29
30
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 30 EGEE L&B Notifications means we don’t have to run components mining L&B logfiles –Consumer of notifications can be remote L&B is stated to scale for our needs –Tested at >1 million records/day –Testing of integrating with notifications underway by GridView team Message formats already defined –Old log mining approach will all be moved to messaging system to free GridView from R-GMA dependency 30
31
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 31 Condor_G Condor_G submitter instrumented to create L&B messages –Done by a separate listener process that is started by Condor_G –Limited subset of Condor_G state changes will be sent Listener/reporter can use different transport for reporting –Currently MonAlisa as a transport layer –Will migrate to WLCG messaging system 31
32
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 32 Pilot Jobs L&B client resides on every worker node Can be used to submit additional messages to L&B for a job –Timestamps +environment for Job Wrapper start/end –Timestamp of handover to user job –Linkage of pilot job to experiment job ID –… Benefit is that it’s all in one coherent data structure for a given job 32
33
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 33 EGEE Architecture 33
34
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 34 CMS SAM Portal 34
35
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 35 ServiceMap What’s a ServiceMap? –It’s a gridmap with many different maps, showing different aspects of the WLCG infrastructure –Gridmap : “treemap”-based view of the grid http://gridmap.cern.ch What’s the CCRC’08 ServiceMap? –Service ‘readiness’ –Service availability –Experiment Metrics A single place to see both the VO and the infrastructure view of the grid 35
36
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 36 CCRC’08 ServiceMap …Demo… http://gridmap.cern.ch/ccrc08/servicemap.html 36
37
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 37 WLCG Experiment metrics Show the VO view of the infrastructure Two extra ‘maps’ planned –Reliability (e.g successful data transfer, jobs, …) –Metrics (MB/s, events/s, …) Need interaction with experiments to create these two views
38
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 38 Summary CCRC’08 is a good opportunity to try some new operational tools –And evaluated them in a ‘real-world’ mode The CCRC’08 ServiceMap seems to give a useful view of the grid –Need to iterate on what is useful to show –And fill in the white spaces… Next Steps –MoU calculation and reporting to sites Feedback on all the tools welcome ! 38
39
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 39 Links to CCRC08 tools CCRC’08 ServiceMap http://gridmap.cern.ch/ccrc08/servicemap.html CCRC’08 Observations logbook https://prod-grid-logger.cern.ch/elog/CCRC'08+Observations/ RSS feed : https://prod-grid- logger.cern.ch/elog/CCRC'08+Observations/elog.rdfhttps://prod-grid- logger.cern.ch/elog/CCRC'08+Observations/elog.rdf Reponse tracking logbook https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/ RSS feed : https://prod-grid- logger.cern.ch/elog/CCRC'08+Logbook/elog.rdfhttps://prod-grid- logger.cern.ch/elog/CCRC'08+Logbook/elog.rdf Presentation title - 39
40
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 40 Strategy Summary Converge to standards, but without a big bang Leverage the underlying infrastructures rather than layer lots of systems on top Reduce maintenance/development costs by using commodity components whenever possible Modular and loosely-coupled to adapt to changes in infrastructure and funding models 40
41
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 41 Architecture Summary Our design for a new architecture leverages commodity software components –Probe Execution (Nagios), Messaging (ActiveMQ), Reporting (JasperReports) It is essentially an integration exercise –Make existing tools work together better In order to improve reliability –This is what we will verify over the next 12 months 41
42
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services 42 More Information… GDB Reports on Monitoring by James –Almost every month at GDB or pre-GDB –http://indico.cern.ch/categoryDisplay.py?categId =3l181http://indico.cern.ch/categoryDisplay.py?categId =3l181 Improving Job Reliability –http://indico.cern.ch/conferenceDisplay.py?confI d=20228http://indico.cern.ch/conferenceDisplay.py?confI d=20228 Email James… –Look at CERN directory…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.