Technical workshop: Grid and ROC operations; Planning The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 002356
Technical workshop, Istanbul, 7/12/2004 2 Workshop program DEPLOYMENT: Deployment, installation and certification process - Jozsef Patvarczki, 30 minutes Deployment strategy, operational procedures, organization – Ognjen Prnjat, 30 minutes Discussion and planning – Ognjen Prnjat: 30 minutes RUNTIME OPERATIONS: Monitoring – Min Tsai - 30 mins Helpdesk: EGEE state of the art: Alex Tudose -15 minutes Technical workshop, Istanbul, 7/12/2004 2
Deployment, installation and certification process Jozsef Patvarczki SZTAKI The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 002356
Technical workshop, Istanbul, 7/12/2004 4 Jozsef Jozsef!!! Technical workshop, Istanbul, 7/12/2004 4
Organization, deployment strategy, operational procedures Ognjen Prnjat GRNET The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 002356
Technical workshop, Istanbul, 7/12/2004 6 Overall organization Technical workshop, Istanbul, 7/12/2004 6
Specific responsibilities Middleware installation, deployment and site certification activities: SZTAKI Helpdesk and RC list: ICI Monitoring: UKIM (TBC) Regional application integration and deployment + user support: UoB and TUBITAK EGEE application deployment coordination: IPP Technical workshop, Istanbul, 7/12/2004 7
Site responsibilities Coordinated by GIM per each country Perform full cluster administration (fabric, OS, middleware). Port m/w to target platform if necessary (depending on the reference development platform from EGEE). Perform m/w certification; and customisations if necessary (add region-specific VOs, define service configurations for local VOs, define local configurations at the node). Perform m/w deployment and upgrades as often as necessary. Carry out site certification in collaboration with SZTAKI and CERN. Provide relevant documentation for the site. Provide front-line support for the operational problems for the cluster; and for local and remote users using the cluster (respond to local trouble-tickets related to the node within reasonable timelines). Work towards establishing automated procedures for daily checks and notification systems in case of failure. Participate in developing and running coherent Trouble Ticket (TT) and knowledge base infrastructure. Support local monitoring service. Monitor the resource utilization and SLAs and provide necessary monitoring, accounting and SLA compliance statistics for deliverables and other purposes, on a regular basis. Keep detailed logs of all interventions on the site. Technical workshop, Istanbul, 7/12/2004 8
Deployment, installation and certification First phase: coordination by SZATKI as presented before When production-level is reached: EGEE-SEE ROC Technical workshop, Istanbul, 7/12/2004 9
Technical workshop, Istanbul, 7/12/2004 10 Core services RB (+LBS)+BDII Workload management, logging and bookkeeping, information system per VO MyProxy Extends the valid life of the job by generating a short-lived proxy. VO server / VOMS VO management VO manager: administration RLS File and metadata catalog service Technical workshop, Istanbul, 7/12/2004 10
Deployment strategy (D2.2) Rollout primary sites into EGEE Deploying core services: RB, BDII, MyProxy Deploying centralized services: VOMS, RLS Monitoring and CIC (out of scope) Technical workshop, Istanbul, 7/12/2004 11
Regional applications Phases of deployment 1st: only on local clusters: testing job submission through local UI and CE 2nd: through EGEE core services 3rd: through own core services Middleware adaptation / customisation Regional applications might require some m/w customizations and specific configurations at sites Technical workshop, Istanbul, 7/12/2004 12
Technical workshop, Istanbul, 7/12/2004 13 EGEE applications HEP, BioMed A number of experiments available within each field Only a subset can be supported Some experiments require specific configurations MPI availability Scavenger Grid not suitable for HEP Technical workshop, Istanbul, 7/12/2004 13
Technical workshop, Istanbul, 7/12/2004 14 Runtime operations Repository of RCs: SZTAKI runs this currently RO should maintain Monitoring: Min Tsai to present; UKIM to support (TBC) Operational/user support (helpdesk + TTS) RO, relationship with EGEE must be defined Security: presentation by Auth later Technical workshop, Istanbul, 7/12/2004 14
Discussion Ognjen Prnjat GRNET The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 002356
Technical workshop, Istanbul, 7/12/2004 16 Discussion: goals Consensus on D2.2 deployment strategy EGEE support still unclear: NA4Test vs. GILDA Core services: timeline and responsibilities + teams Cluster support for applications, initial timelines Technical workshop, Istanbul, 7/12/2004 16
Technical workshop, Istanbul, 7/12/2004 17 Deployment strategy Rollout primary sites into EGEE Support of all core services by EGEE? Which ones? GILDA? Deploying core services: RB, BDII, MyProxy Rely on EGEE centralized services: which ones?? Deploying centralized services: VOMS, RLS Monitoring and CIC (out of scope) Technical workshop, Istanbul, 7/12/2004 17
Core services: SEEGRID VO Core services needed VO server / VOMS [Greece - Auth] Is one VOMS enough to support number of VOs? Yes. MyProxy: Auth: running Is one MyProxy enough to support number of VOs? Yes. RB (+LBS)+BDII WHO RLS Is it needed? VO manager: administration – IPP? If have first test-job submission by May 2005 its very good Coordination: SZTAKI Technical workshop, Istanbul, 7/12/2004 18
Regional applications Phases of deployment 1st: only on local clusters: testing job submission through local UI and CE 2nd: through EGEE core services 3rd: through own core services Middleware adaptation / customisation Integrating the regional application with EGEE m/w: any changes to m/w needed? Customised version of m/w? UoB, TUBITAK Dependency on OS / m/w? Maybe should have SL clusters for deployment straight away? Can, and should, UoB application be a part of EGEE BioMed VO? Technical workshop, Istanbul, 7/12/2004 19
Technical workshop, Istanbul, 7/12/2004 20 EGEE applications HEP, BioMed Which experiment to support, on which clusters (based on requirements), and when BioMed: RB institute? LHC: volunteers? Technical workshop, Istanbul, 7/12/2004 20
Technical workshop, Istanbul, 7/12/2004 21 Status of resources Country CPUs Storage (TB) Bulgaria ? Romania Turkey Hungary Albania Bosnia-Herzegovina FYRoM Serbia-Montenegro Croatia Countries: 10 CPUs: ? Storage: ? Technical workshop, Istanbul, 7/12/2004 21
EGEE/LCG Monitoring, Role of GOC, and SEE-GRID strategy Min Tsai CERN The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 002356
Technical workshop, Istanbul, 7/12/2004 23 EGEE/LCG Monitoring Look at the existing monitoring tools that are being used in LCG Grid Operations Centre Technical workshop, Istanbul, 7/12/2004 23
GOC Configuration Database (monitor app slides from D. Kant) Secure Database Management via HTTPS / X.509 Store a Subset of the Grid Information system People, Contact Information, Resources Scheduled Maintenance Monitoring Services Operations Maps Configure other Tools Organisation Structures Secure services - Site News Self Certification Accounting GOC GridSite MySQL SERVER SQL https Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … bdii ce GOC DB can also contain information that is not present in the IS such as: Scheduled maintenance; News; Organisational Structures; Geographic coordinates for maps. se rb RC Technical workshop, Istanbul, 7/12/2004 24
Technical workshop, Istanbul, 7/12/2004 25 EGEE/LCG Monitoring (from D. Kant) Ganglia Monitoring http://gridpp.ac.uk/ganglia Can use Ganglia to monitor a cluster Scalable distributed monitoring system for clusters and grids using RRD for storage and visualisation. RAL Tier-1 Centre LCG PBS Server displays Job status for each VO Get a lot for little effort Ganglia is a scalable distributed monitoring system for clusters and grids which uses RoundRobinDatabaseTool for data storage and visualisation. Its relatively easy to install and you get a lot for little effort. One of its strengths is that it can federate clusters together. Technical workshop, Istanbul, 7/12/2004 25
Federated Cluster Information Ganglia Monitoring Federated Cluster Information Can also use Ganglia to monitor clusters of clusters Ganglia/R-GMA integration through Ranglia. Separate and distinct clusters federated together. Ganglia provides a wealth of information, much of it low level, that can be useful for operations. Ganglia/R-GMA integration through Ranglia. Technical workshop, Istanbul, 7/12/2004 26
Technical workshop, Istanbul, 7/12/2004 27 GridICE - Architecture GRIDICE – Architecture A different kind of monitoring tool – processes / low level metrics / grid metrics Developed by the INFN-GRID Team http://infnforge.cnaf.infn.it/gridice Data harvest via discovery service (postgreSQL) Measurement service Publication service Unlike GPPMON which runs simple functional tests, there are other tools which can monitor services in different ways. For example GRIDICE – monitoring tool for a grid operations center – which has been developed by the INFN grid team. Gridice implements a number of services ranging from A Measurement service which uses monitoring sensor agents to probe “core processes” belonging to a service; and other low level metrics such as memory, cpu A Publisher service to collect this information in a local database (fmonServer) at the site. A Discovery service to find resources and harvest data into a central database. An finally a publisher service a portal to the monitoring data which can be aggregated in different ways. Technical workshop, Istanbul, 7/12/2004 27
Technical workshop, Istanbul, 7/12/2004 28 GridICE – Global View GRIDICE – Global View Different Views of the data: Site / VO / Geographic Resource Usage CPU#, Load, Storage, Job Info List of Sites Web Interface shows what you might see if you want an overall Global view of grid resources. Here you can see a list of Participating sites and a description of resource usage, such as total CPU and storage available. GridIce use Nagios to schedule updates of its central monitoring repository, and the information you see is reasonably up-to-date. The information can be viewed in different ways: for example Geographically or for each VO Technical workshop, Istanbul, 7/12/2004 28
GridICE – Job Monitoring GridICE - Architecture GridICE – Job Monitoring GridICE - Architecture Recently deployed version 1.6.3 on to LCG which features job monitoring: Queued, Running, Finished organised in different ways (site, Vo etc) XML views of data Latest version of GridIce (1.6.3) implements job monitoring features: Current status of running/queued/finished jobs per vo per site. Technical workshop, Istanbul, 7/12/2004 29
Technical workshop, Istanbul, 7/12/2004 31 Gstat (GIIS Monitor) Tool to display and check information published by the site GIIS (bdii update, IS sanity, rgma, core service checks, usage statistics) http://goc.grid.sinica.edu.tw/gstat/ GIIS Monitor which has been developed by the GOC based in Taipei It’s a tool to monitor the grid information system. The primary goal of the application is to detect faults, perform sanity checks and display useful data. Provides an overview of the current grid status and you can drill down to get more information. Technical workshop, Istanbul, 7/12/2004 31
Technical workshop, Istanbul, 7/12/2004 32 Real Time Grid Monitor http://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html A Visualisation tool to track jobs currently running on the grid. Applet queries the logging and bookkeeping service to get information about grid jobs. Why are jobs failing? Why are jobs queued at sites while others are empty? Visualisation tool developed by GridPP at Imperial College. The monitor works by querying the RB Logging and Bookkeeping database for job information. Because the L&B service is continually being updated, the tool shows jobs flowing from the RB to a site for processing or returning back to the source once completed. Tool is useful to get a global picture of trends and quickly identify potential problems such as “job pile-up”, and it help to publicise the grid at conferences to non experts. [ Applet queries files not older than 6 hours. Long running jobs don’t show up] Technical workshop, Istanbul, 7/12/2004 32
Technical workshop, Istanbul, 7/12/2004 33 GPPMON – Job Submission Tests Displays the results of tests against sites. Test: Job Submission Job is a simple test of the grid middleware components e.g. Gatekeeper service, RB service, and the Information System via JDL requirements. This is the GPPMON tool developed by the GridPP Collaboration. Basically, it’s a map which represents the results of kind of test as coloured dot. For example: A job submission test sends a job request to a site through a resource broker. If the job executes successfully, the site is marked with a green dot. If the job fails, the site Is marked with a red dot. This kind of test is testing the functional behaviour of the core services – Do simple jobs run. These maps can be tailored for different communities: for example a grid communities identified by a list of sites in a BDII configuration database such as the one shown here for LCG. This kind of test deals with the functional behaviour core grid services – do simple jobs run. They are lightweight tests which run hourly. However, they have certain limitations e.g. Dteam VO; WN reach (specialised monitoring queues). Technical workshop, Istanbul, 7/12/2004 33
Sites Functional Tests A set of many (~20) various tests that run on WN and checks essential functionality of a site: general WN configuration: RPMs version, Environment, CSH, BrokerInfo... grid tools and services: Replica Manager (local and remote SE involved), lcg-utils, R-GMA new tests are added instantly when new types of problems are detected/reported Executed every morning – now 6am for scavenger grids Relatively easy to install and configure Technical workshop, Istanbul, 7/12/2004 35
Certification Test Results http://lcg-testzone-reports.web.cern.ch/lcg-testzone-reports/cgi-bin/listreports.cgi Test results shown on a web page. As you can see, it’s a large Matrix of data where each row identifies a site and the corresponding test results. One of the difficulties of having too much information, is that it can be difficult to find the information you need. Its also quite detailed - most links allow you to drill down and examine the debug information - again a useful tool for the expert! Technical workshop, Istanbul, 7/12/2004 36
Problems with Monitoring Tools Inconsistent site configuration sources Monitoring tools don’t have same coverage Correlation more difficult Difficult to correlate test results Search through many web pages Time consuming especially with 90 sites! Need a view with only alerts Don’t need to be flooded with information Technical workshop, Istanbul, 7/12/2004 37
Unified Monitoring System (in progress) Site configurations are consistent Only from GocDB Data is sent to single data transport (RGMA) Shared data format Single Console to display all data Single alarm system to monitor all data http://goc.grid.sinica.edu.tw/gocwiki/RgmaUnifiedMonitoringSystem RGMA Issues Reliability: Registry is single point of failure Complex queries not available Older less supported version in production Technical workshop, Istanbul, 7/12/2004 38
Technical workshop, Istanbul, 7/12/2004 39 Other Problems Can’t readily perform test on demand Verify if the problems still exists Some tests run once a day Solution: allow on demand testing through a secure interface Not enough help on error messages What are the possible causes and solutions? Build knowledge base http://goc.grid.sinica.edu.tw/gocwiki/SiteProblemsFollowUpFaq Technical workshop, Istanbul, 7/12/2004 39
Role of Grid Operation Center (CIC Procedure) CIC Current Operation Procedures: http://cic.in2p3.fr Problem detection Sites Functional Test: provides most detailed fault detection Other tools also used and monitored periodically Diagnosis Check detailed reports on monitoring tools Refer to wiki knowledge base Collect information help admin trouble shoot problem Problem tracking Savannah used New solutions go back into wiki Technical workshop, Istanbul, 7/12/2004 40
Role of Grid Operation Center (CIC Procedure) Escalation Mail to site admin and ROC Second mail to ROC Phone to ROC Reported to SA1 Management Deadlines typically 3 days each (1 day for large sites) Run Core services RB, BDII, MyProxy VOMS, RLS Monitoring Technical workshop, Istanbul, 7/12/2004 41
Technical workshop, Istanbul, 7/12/2004 42 Monitoring Strategy Phase I Register as EGEE Resource Centers Automatically get incorporated into monitoring systems TZtest Page GPPMon: regional view Gstat (GIIS Monitor): regional view GridICE Accounting Gain operational experience with these tools Phase II Decide what is lacking then, (service level reports) Develop or contribute to other projects Technical workshop, Istanbul, 7/12/2004 42
EGEE Helpdesk and operational support procedures; SEE-GRID strategy Alex and Alex ICI The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 002356
Technical workshop, Istanbul, 7/12/2004 44 Alex and Alex Alex and Alex!!! Technical workshop, Istanbul, 7/12/2004 44