EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space Operational Architecture of PL-Grid project M.Radecki, T.Szepieniec, M.Krakowian, M.Tomanek and W.Ziajka ACC CYFRONET AGH Cracow Grid Workshop Cracow,
2 Outline Customers in Grid and issues to be solved VO manager Resource Provider VO user Site Administrator PL-Grid approach to resource sharing Bazaar tool Other operational consequences Information & monitoring system improvements Solving operational issues in PL-Grid
3 Grid Customers and Issues Virtual Organization Manager how could I obtain (more) resources for my VO? does QoS for provided resources satisfiy me? VO User I want to use the right resources – i.e. these which are well configured and supported Resource Provider Are all VOs satisfied with resources they get? Site Administrator ooops, something failed in grid middleware at my site, how to fix it quickly?
4 Linking Customer with Provider Well-defined procedure for VO Managers and Resource Providers for making an agreement on sharing resources is indispensable
5 Resource-related SLA and Bazaar tool PL-Grid will use Bazaar tool for implementing process of resource negotiation between VO manager and Resouce Provider Example contract parameters: time boundaries of the contract number of CPUs, disk space with the Quality of Service (guaranteed, best effort etc.) availablity/reliability of resources average acknowledge/response time to trouble tickets
6 Profits from having resource-related SLAs Communication channel between parties established Agreement can be monitored need accounting data monitoring: availability/reliability of services trouble ticket ack/response time First step towards business model in grid Impact on operations verification (certification) of resources for particular contract only certified resources accessible for VO users
7 Infrastructure Monitoring Typically one dedicated VO for monitoring all resources in Grid (e.g. “ops VO” in EGEE) requirement on sites to support this VO for monitoring purposes configured as high-priority VO not always reflects the status of the site according to other VOs PL-Grid uses regular VOs for monitoring special role configured within VO high priority for jobs executed with this role requires subscription of a technican's certificate as VO member reflects the VO status at a site site's service availability/reliability measured within real VOs
8 Grid Information System Improvements Use VO-level Information Systems instead of the global instance VO-scope makes sense for user better scalability: no longer global grid service easier to manage can be handled by the VO as other VO-services: VO Membership Service, File Catalogue, Resource Broker, etc. big VOs may take an effort to establish a high-availability cluster for information service, smaller ones will not require that reduces the network traffic by localizing it Include information only about sites which have an active contract with the VO and resources were verified (certified) Require a separate instance of Information System including all sites for testing/certification purposes
9 Solving Grid Operations Problems Site Administrator's perspective many sources of data: wiki pages, GGUS knowledge base many of them outdated, not providing working recipe Customers (VOs) are pressing on availability/reliability of resources need for quick problem solving need for interactive support – not always efficient PL-Grid support structures Actors Site Administrators 1 st line support Team Regional Operator on Duty (aka “ROD”)
10 Use case: Operations Problem Handling Site Monday, 7 P.M. Regional Dashboard 1 st line support Tuesday, 8 A.M. request for help Tuesday, 9 A.M. Tuesday, 7 P.M. 24h passed Wednesday, 8 A.M. Trouble ticket Problem assistance via instant messages Knowledge Base ROD
11 Summary Identified need for new operational tool for Resource Allocation fill a gap between VO Managers and Resource Providers Improvements to Information System related to RA and VO-scope to provide VO User with list of reliable, well-configured and supported resources Infrastructure monitoring should be realized within the real, existing VOs Procedures for fixing problems can be more efficient with: knowledge base to find if somebody else encountered the problem before st line support team to get interactive contact with the expert Polish NGI got assigned share in EGI global tasks related to: Coordination of resource allocation (O-E-10) - Poland Grid Operation and oversight (O-E-5) - Netherlands and Poland
12 PL-Grid news: (tentative) user registration open!