1 Grid Operations Jinny Chien ASGC June 09, Academia Sinica Slides adapted from the EGEE training material repository:

Slides:



Advertisements
Similar presentations
The gLite Support System Giuseppe LA ROCCA INFN Catania
Advertisements

LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Torsten Antoni – LCG Operations Workshop, CERN 02-04/11/04 Global Grid User Support - GGUS -
Last update 01/06/ :23 LCG 1Maria Dimou- cern-it-gd Maria Dimou IT/GD Site Registration policy & procedures
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Grid Infrastructure and Operations Maite.
08/11/908 WP2 e-NMR Grid deployment and operations Technical Review in Brussels, 8 th of December 2008 Marco Verlato.
EGEE-II INFSO-RI Enabling Grids for E-sciencE AP ROC Min-Hong Tsai ASGC SA1 Transition Meeting May 8 th, 2008
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
INFSO-RI Enabling Grids for E-sciencE GLOBAL GRID USER SUPPORT THE MODEL AND EXPERIENCE IN LCG/EGEE Gilles Mathieu(1), Torsten Antoni(2),
EGEE ARM-2 – 5 Oct LCG Security Coordination Ian Neilson LCG Security Officer Grid Deployment Group CERN.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
GGF12 – 20 Sept LCG Incident Response Ian Neilson LCG Security Officer Grid Deployment Group CERN.
SEE-GRID-SCI Regional Grid Infrastructure: Resource for e-Science Regional eInfrastructure development and results IT’10, Zabljak,
SEE-GRID-SCI SEE-GRID-SCI Operations Procedures and Tools Antun Balaz Institute of Physics Belgrade, Serbia The SEE-GRID-SCI.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
Dave Kant Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005.
INFSO-RI Enabling Grids for E-sciencE EGEE 1 st EU Review – 9 th to 11 th February 2005 CERN.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
EGEE is a project funded by the European Union under contract IST User support in EGEE Alistair Mills Torsten Antoni EGEE-3 Conference 20 April.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
EGEE is a project funded by the European Union under contract IST Support Operation Challenge – 1 SOC-1 Alistair Mills Torsten Antoni ARM-4,
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
INFSO-RI Enabling Grids for E-sciencE Introduction to Grid Computing, EGEE and Bulgarian Grid Initiatives - Plovdiv,
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Communication tools between Grid Virtual.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
EGEE ARM-2 – 5 Oct LCG/EGEE Security Coordination Ian Neilson Grid Deployment Group CERN.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
EGEE is a project funded by the European Union under contract INFSO-RI Grid accounting with GridICE Sergio Fantinel, INFN LNL/PD LCG Workshop November.
INFSO-RI Enabling Grids for E-sciencE User and Virtual Organisation Support in EGEE Flavia Donno, CERN Torsten Antoni, FZK Alistair.
Mardi 8 mars 2016 Status of new features in CIC Portal Latest Release of 22/08/07 Osman Aidel, Hélène Cordier, Cyril L’Orphelin, Gilles Mathieu IN2P3/CNRS.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
INFN GRID Production Infrastructure Status and operation organization Cristina Vistoli Cnaf GDB Bologna, 11/10/2005.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operational Procedures (Contacts, procedures,
II EGEE conference Den Haag November, ROC-CIC status in Italy
SEE-GRID-SCI Grid Operations Procedures Antun Balaz Institute of Physics Belgrade Serbia The SEE-GRID-SCI initiative.
1/3/2006 Grid operations: structure and organization Cristina Vistoli INFN CNAF – Bologna - Italy.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROC model assessment AP ROC ShuTing Liao.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
Scuola Grid - Martina Franca, Thursday 08 November Il Sistema di Supporto INFNGrid & GGUS ( Global Grid User.
INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.
CERN WLCG Grid Storage Systems Deployment Flavia Donno, CERN 6 November 2007 Organization of Storage Support through GGUS Flavia Donno CERN/IT-GD CERN.
Service Availability Monitoring
Il Sistema di Supporto INFNGrid & GGUS (Global Grid User Support )
Regional Operations Centres Core infrastructure Centres
EGEE is a project funded by the European Union
Ian Bird GDB Meeting CERN 9 September 2003
Brief overview on GridICE and Ticketing System
EGEE Asia Pacific Regional Operation Center
The CCIN2P3 and its role in EGEE/LCG
GGUS Partnership between FZK and ASCC
LCG Operations Workshop, e-IRG Workshop
EGEE: Grid Operations & Management
Pierre Girard ATLAS Visit
EGEE Operation Tools and Procedures
Site availability Dec. 19 th 2006
Presentation transcript:

1 Grid Operations Jinny Chien ASGC June 09, Academia Sinica Slides adapted from the EGEE training material repository:

2 Agenda EGEE Operations Monitoring Systems and Information Portals APROC Services

3 Grid Operations The grid is flat, but Hierarchy of responsibility Essential to scale the operation CICs act as a single Operations Centre Operational oversight responsibility Operator On Duty rotates weekly between CICs Report problems to ROC/RC ROC is responsible for ensuring problem is resolved ROC oversees regional RCs CIC RC ROC RC ROC RC ROC RC ROC OMC RC = Resource Centre ROC = Regional Operations Centre CIC = Core Infrastructure Centre OMC = Operations Management Centre

4 Grid Operations ROCs organizes regional operations Coordinate deployment of middleware Provides centralized Grid services Front line support Grid issues CERN coordinates sites not associated with a ROC User Support Center (GGUS) Single point of contact in FZK Interfaced to other support groups Developers Deployment ROCs Portal Tools and Documentation Monitoring shows a problem Operator- on-duty Site ROC 1 st level support 2 nd level support ROC and Site work to resolve the problem Operator submits a GGUS ticket against the ROC and CC’s the site. The ticket is followed till it is solved

5 Central Application (GGUS) Deployment Support Middleware Support Network Support Operations Support TPM ROC 1 ROC 10 ROC… VO Support Interface Webportal Support Model “Regional Support with Central Coordination" The ROCs and VOs and the other project wide groups such as the Core Infrastructure Center (CIC), middleware groups (JRA), network groups (NA), service groups (SA) areCICJRANA connected via a central integration platform provided by GGUS. Regional Support units User Support units Technical Support units

6 Agenda EGEE Operations Monitoring Systems and Information Portals APROC Services

7 Grid Monitoring Link to monitoring tools at GOC Portal:

8 Grid Operation Center DB (GOCDB) GOC GridSite MySQL Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … ce se bdii rb Information used by: Monitoring Tools Problem tracking Accounting system Secure Database Management via HTTPS / X.509 Store a Subset of the Grid Information system People, Contact Information, Resources Scheduled Maintenance RC SQL https SERVERSERVER Contain information that is not in IS: Scheduled maintenance Node/services that should be monitored Site certification status Site security contacts Organizational Structures Geographic coordinates for maps After Site Registration Site Managers can modify GOCDB information with certificates loaded into browser It is important to keep it up-to- date and accurate.

9 Site Functional Tests I Uses job submission tests Executes every 3 hours Runs series of test scripts on WN to test Site CE and WN job submission capability General WN configuration: RPMs version, Environment, CSH, BrokerInfo... Grid services: Replica Manager, lcg-utils, R-GMA, accounting Provides extensive trouble shooting information Standard for site certification and operations status Used with another tool Freedom of Choice for Resources Used by experiment to remove sites from BDII VOs can white list, black list or depend on SFT results

10 Site Functional Test II

11 GStat: Runs IS Tests Tool to display and monitor information published by Site-BDIIs Useful for finding out why SFT fails Missing or inconsistent entries, Invalid information Availability of site-BDII Services GRIS availability check Usage and performance statistics

12 Different Views of the data: Site / VO / Geographic List of Sites Resource Usage CPU#, Load, Storage, Job Info GridICE – Global View

13 Display shows the processes belonging to the Broker service. Problems are flagged Node Processes GridICE – Expert View

14 Monitoring Tools: Good demo tools Real Time Monitor Collect RB job information Displays job status at each site Indicates when jobs are submitted to CE Google Maps SFT and APEL test results Site location is used Host Certificate Monitoring Check CE host certificate expiration

15 Grid Operations FAQ: GOCWiki GOC Wiki Operation Guides Admins Howtos Trouble shooting FAQ Installation Guides User Guides User FAQ User Tools New: Work-in-progress

16 GGUS Portal Services Browseable tickets Search through solved tickets Useful links (Wiki FAQ) Broadcast tools Latest News GGUS Search Engine Updated documentation (Wiki FAQ)

17 GGUS Portal: Search engine GGUSSearchEngine Ongoing work to make it faster and to search through a widerset of docs and DBs

18 CIC Portal Latest News Maintenance M/W updates, etc.. VO Users EGEE VO Information RC Staff Broadcast tool Weekly site report

19 Problem Detection and Tracking Operations Escalation Procedure Detect problems and performs diagnosis 1. Open ticket for problem tracking in CIC portal Sends notification to Site and ROC Escalation period is 1 to 3 days depending on severity 2. Send second if no response Sends notification to ROC 1-3 days escalation 3. Phone call to ROC 4. If still no response, CIC suggest site is suspended Site removed from Top Level BDII configuration Essentially removed from the Grid

20 Agenda EGEE Operations Monitoring Systems and Information Portals APROC Services

21 APROC Introduction APROC Goal Provide deployment support facilitating Grid expansion Maximize the availability of Grid services APROC established in April 2005 Supports EGEE sites in Asia Pacific 15 sites, 7 countries, > 500 CPUs AustraliaJapanIndiaPakistan KoreaSingaporeTaiwan EGEE CIC CIC-on-duty rotation: EGEE global operations Monitoring tool development: GStat and GGUS Search EGEE ROC Centralized services Monitoring, Diagnosis and Problem tracking M/W release deployment support Security Coordination Site Registration Portal and documentation

22 ASGCCA Servicing Taiwan LCG/EGEE users in Asia Pacific without local CA Production service started from July 2003 Member of both EUGridPMA APGridPMA

23 VO Services APROC hosts centralized services for VOs Host VOMS server VO assigns manager to maintain membership VO supply AUP Host LFC global file catalogue service Resource Broker Load balanced Top-Level BDII Current supporting TWGrid APeSci

24 Daily Operations and Support Goal is to achieve high availability: Review and track GGUS and APROC open tickets Monitor and detect new problems in the region Provide detailed technical support to sites Stay up-to-date with latest operations development Rollout maillist GOCWiki entries APROC support can be contacted Phone TRS Ticketing System

25 Middleware support and Security Coordination Middleware Support Installation support New release testing Supplementary release notes Testing varying site configurations and environments Coordinate updates and upgrades Security Coordination With Operational Security Coordination Team (OSCT) Security Service Challenge completed in March Sending test jobs Sites recover User DN IP address of submission UI UTC Time Executable name

26 Site Registration I Obtain personal and host certificate Send in application form Site registration into GOCDB Learn about the Grid Middleware Additional registration Maillists: APROC and Rollout VO membership APROC TRS support account Site deployment Site certification SFT and GStat tests

27 Site Registration I Site Design Minimum 5 nodes: UI, CE, DPM, MON, WN Bare minimum Configuration CE/WN, DPM, MON UI user space installation available Virtual Machine to reduce physical nodes Require more memory per machine Node specifications Memory: 254MB1GB+ CPU Pentium:500Mhz2Ghz+ Disk: 20GB80GB+ Network: 10MbpsGE Application requirements Job memory requirementsTypical file size Network Public IP required for all service nodes Avoid private IP if WN to DPM throughput is important or configure selective NAT and decent router

28 Site Registration II Join EGEE Production Infrastructure Respond to ticket within 2 working days Dedicated administrator contact Linux Administration experience 1 week upgrade deadline for security releases After certification Attend APF Meetings Biweekly meeting with other APROC member sites Maintain OS software Monitor and address service faults and tickets Enter scheduled downtime/maintenance M/W Upgrade Resource expansion CIC weekly site report Learn about new Grid M/W updates

29 APROC Portal Rollout Highlights Supplemental documentation Getting started links Registration information Contact Info and TRS links lists.grid.sinica.edu.tw/apwiki Supplementary release notes Site Operations Procedures Technical Howtos Trouble Shooting FAQs APF and GDA meeting minutes Feel free to contribute!

30 Summary EGEE Operations is closely monitors Site Faults Good for production sites Dedicated human resource needed to maintain Many monitoring tools and resources Just not easy to find all of them APROC providing EGEE operations support for Asia Pacific Please give us feedback on what we can improve People: Jinny Chien Shu-Ting Liao Jason Shih Min Tsai Contact us: