1 Grid Operations Jinny Chien ASGC June 09, Academia Sinica Slides adapted from the EGEE training material repository:
2 Agenda EGEE Operations Monitoring Systems and Information Portals APROC Services
3 Grid Operations The grid is flat, but Hierarchy of responsibility Essential to scale the operation CICs act as a single Operations Centre Operational oversight responsibility Operator On Duty rotates weekly between CICs Report problems to ROC/RC ROC is responsible for ensuring problem is resolved ROC oversees regional RCs CIC RC ROC RC ROC RC ROC RC ROC OMC RC = Resource Centre ROC = Regional Operations Centre CIC = Core Infrastructure Centre OMC = Operations Management Centre
4 Grid Operations ROCs organizes regional operations Coordinate deployment of middleware Provides centralized Grid services Front line support Grid issues CERN coordinates sites not associated with a ROC User Support Center (GGUS) Single point of contact in FZK Interfaced to other support groups Developers Deployment ROCs Portal Tools and Documentation Monitoring shows a problem Operator- on-duty Site ROC 1 st level support 2 nd level support ROC and Site work to resolve the problem Operator submits a GGUS ticket against the ROC and CC’s the site. The ticket is followed till it is solved
5 Central Application (GGUS) Deployment Support Middleware Support Network Support Operations Support TPM ROC 1 ROC 10 ROC… VO Support Interface Webportal Support Model “Regional Support with Central Coordination" The ROCs and VOs and the other project wide groups such as the Core Infrastructure Center (CIC), middleware groups (JRA), network groups (NA), service groups (SA) areCICJRANA connected via a central integration platform provided by GGUS. Regional Support units User Support units Technical Support units
6 Agenda EGEE Operations Monitoring Systems and Information Portals APROC Services
7 Grid Monitoring Link to monitoring tools at GOC Portal:
8 Grid Operation Center DB (GOCDB) GOC GridSite MySQL Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … ce se bdii rb Information used by: Monitoring Tools Problem tracking Accounting system Secure Database Management via HTTPS / X.509 Store a Subset of the Grid Information system People, Contact Information, Resources Scheduled Maintenance RC SQL https SERVERSERVER Contain information that is not in IS: Scheduled maintenance Node/services that should be monitored Site certification status Site security contacts Organizational Structures Geographic coordinates for maps After Site Registration Site Managers can modify GOCDB information with certificates loaded into browser It is important to keep it up-to- date and accurate.
9 Site Functional Tests I Uses job submission tests Executes every 3 hours Runs series of test scripts on WN to test Site CE and WN job submission capability General WN configuration: RPMs version, Environment, CSH, BrokerInfo... Grid services: Replica Manager, lcg-utils, R-GMA, accounting Provides extensive trouble shooting information Standard for site certification and operations status Used with another tool Freedom of Choice for Resources Used by experiment to remove sites from BDII VOs can white list, black list or depend on SFT results
10 Site Functional Test II
11 GStat: Runs IS Tests Tool to display and monitor information published by Site-BDIIs Useful for finding out why SFT fails Missing or inconsistent entries, Invalid information Availability of site-BDII Services GRIS availability check Usage and performance statistics
12 Different Views of the data: Site / VO / Geographic List of Sites Resource Usage CPU#, Load, Storage, Job Info GridICE – Global View
13 Display shows the processes belonging to the Broker service. Problems are flagged Node Processes GridICE – Expert View
14 Monitoring Tools: Good demo tools Real Time Monitor Collect RB job information Displays job status at each site Indicates when jobs are submitted to CE Google Maps SFT and APEL test results Site location is used Host Certificate Monitoring Check CE host certificate expiration
15 Grid Operations FAQ: GOCWiki GOC Wiki Operation Guides Admins Howtos Trouble shooting FAQ Installation Guides User Guides User FAQ User Tools New: Work-in-progress
16 GGUS Portal Services Browseable tickets Search through solved tickets Useful links (Wiki FAQ) Broadcast tools Latest News GGUS Search Engine Updated documentation (Wiki FAQ)
17 GGUS Portal: Search engine GGUSSearchEngine Ongoing work to make it faster and to search through a widerset of docs and DBs
18 CIC Portal Latest News Maintenance M/W updates, etc.. VO Users EGEE VO Information RC Staff Broadcast tool Weekly site report
19 Problem Detection and Tracking Operations Escalation Procedure Detect problems and performs diagnosis 1. Open ticket for problem tracking in CIC portal Sends notification to Site and ROC Escalation period is 1 to 3 days depending on severity 2. Send second if no response Sends notification to ROC 1-3 days escalation 3. Phone call to ROC 4. If still no response, CIC suggest site is suspended Site removed from Top Level BDII configuration Essentially removed from the Grid
20 Agenda EGEE Operations Monitoring Systems and Information Portals APROC Services
21 APROC Introduction APROC Goal Provide deployment support facilitating Grid expansion Maximize the availability of Grid services APROC established in April 2005 Supports EGEE sites in Asia Pacific 15 sites, 7 countries, > 500 CPUs AustraliaJapanIndiaPakistan KoreaSingaporeTaiwan EGEE CIC CIC-on-duty rotation: EGEE global operations Monitoring tool development: GStat and GGUS Search EGEE ROC Centralized services Monitoring, Diagnosis and Problem tracking M/W release deployment support Security Coordination Site Registration Portal and documentation
22 ASGCCA Servicing Taiwan LCG/EGEE users in Asia Pacific without local CA Production service started from July 2003 Member of both EUGridPMA APGridPMA
23 VO Services APROC hosts centralized services for VOs Host VOMS server VO assigns manager to maintain membership VO supply AUP Host LFC global file catalogue service Resource Broker Load balanced Top-Level BDII Current supporting TWGrid APeSci
24 Daily Operations and Support Goal is to achieve high availability: Review and track GGUS and APROC open tickets Monitor and detect new problems in the region Provide detailed technical support to sites Stay up-to-date with latest operations development Rollout maillist GOCWiki entries APROC support can be contacted Phone TRS Ticketing System
25 Middleware support and Security Coordination Middleware Support Installation support New release testing Supplementary release notes Testing varying site configurations and environments Coordinate updates and upgrades Security Coordination With Operational Security Coordination Team (OSCT) Security Service Challenge completed in March Sending test jobs Sites recover User DN IP address of submission UI UTC Time Executable name
26 Site Registration I Obtain personal and host certificate Send in application form Site registration into GOCDB Learn about the Grid Middleware Additional registration Maillists: APROC and Rollout VO membership APROC TRS support account Site deployment Site certification SFT and GStat tests
27 Site Registration I Site Design Minimum 5 nodes: UI, CE, DPM, MON, WN Bare minimum Configuration CE/WN, DPM, MON UI user space installation available Virtual Machine to reduce physical nodes Require more memory per machine Node specifications Memory: 254MB1GB+ CPU Pentium:500Mhz2Ghz+ Disk: 20GB80GB+ Network: 10MbpsGE Application requirements Job memory requirementsTypical file size Network Public IP required for all service nodes Avoid private IP if WN to DPM throughput is important or configure selective NAT and decent router
28 Site Registration II Join EGEE Production Infrastructure Respond to ticket within 2 working days Dedicated administrator contact Linux Administration experience 1 week upgrade deadline for security releases After certification Attend APF Meetings Biweekly meeting with other APROC member sites Maintain OS software Monitor and address service faults and tickets Enter scheduled downtime/maintenance M/W Upgrade Resource expansion CIC weekly site report Learn about new Grid M/W updates
29 APROC Portal Rollout Highlights Supplemental documentation Getting started links Registration information Contact Info and TRS links lists.grid.sinica.edu.tw/apwiki Supplementary release notes Site Operations Procedures Technical Howtos Trouble Shooting FAQs APF and GDA meeting minutes Feel free to contribute!
30 Summary EGEE Operations is closely monitors Site Faults Good for production sites Dedicated human resource needed to maintain Many monitoring tools and resources Just not easy to find all of them APROC providing EGEE operations support for Asia Pacific Please give us feedback on what we can improve People: Jinny Chien Shu-Ting Liao Jason Shih Min Tsai Contact us: