Grid Deployment Overview

Slides:

Advertisements

Similar presentations

LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.

Advertisements

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Grid Infrastructure and Operations Maite.

OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.

INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.

LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.

SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

INFSO-RI Enabling Grids for E-sciencE Integration and Testing, SA3 Markus Schulz CERN IT JRA1 All-Hands Meeting 22 nd - 24 nd March.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

VO Box Issues Summary of concerns expressed following publication of Jeff’s slides Ian Bird GDB, Bologna, 12 Oct 2005 (not necessarily the opinion of)

The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

INFSO-RI Enabling Grids for E-sciencE /10/20054th EGEE Conference - Pisa1 gLite Configuration and Deployment Models JRA1 Integration.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.

LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

II EGEE conference Den Haag November, ROC-CIC status in Italy

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,

Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.

Implementation of GLUE 2.0 support in the EMI Data Area Elisabetta Ronchieri on behalf of JRA1’s GLUE 2.0 Working Group INFN-CNAF 13 April 2011, EGI User.

WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.

1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.

Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005.

Baseline Services Group Status of File Transfer Service discussions Storage Management Workshop 6 th April 2005 Ian Bird IT/GD.

Service Availability Monitoring

Jean-Philippe Baud, IT-GD, CERN November 2007

WLCG IPv6 deployment strategy

Regional Operations Centres Core infrastructure Centres

EGEE Middleware Activities Overview

David Kelsey CCLRC/RAL, UK

StoRM: a SRM solution for disk based storage systems

Open Science Grid Progress and Status

SA1: Grid Operations and Management

LCG Security Status and Issues

Ian Bird GDB Meeting CERN 9 September 2003

POW MND section.

GDB 8th March 2006 Flavia Donno IT/GD, CERN

CREAM Status and Plans Massimo Sgaravatto – INFN Padova

Comparison of LCG-2 and gLite v1.0

EGEE VO Management.

The CREAM CE: When can the LCG-CE be replaced?

Short update on the latest gLite status

Nordic ROC Organization

ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010

Discussions on group meeting

Francesco Giacomini – INFN JRA1 All-Hands Nikhef, February 2008

LCG Operations Workshop, e-IRG Workshop

Data Management cluster summary

EGEE SA1 – Operations Status Overview

Monitoring of the infrastructure from the VO perspective

EGEE: Grid Operations & Management

Leigh Grundhoefer Indiana University

Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002

LHC Data Analysis using a worldwide computing grid

Pierre Girard ATLAS Visit

Report on GLUE activities 5th EU-DataGRID Conference

gLite The EGEE Middleware Distribution

Site availability Dec. 19 th 2006

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

The LHCb Computing Data Challenge DC06

Presentation transcript:

Grid Deployment Overview Ian Bird CERN IT-GD LCG Comprehensive Review 15th November 2005

Major issues from 2004 review “… the type of middleware to be installed at various university computing centers and national/regional organizations may differ, and requested that the LCG Project addresses this issue in order to ensure that all middleware conforms to a given set of interfaces and to an agreed minimum set of functionalities …” “… underlining that interoperability and common interface tools between all Grids should be pursued …” Baseline services group Work on interoperability “… but the LHCC noted outstanding issues concerning the LCG-2 low job success rate, inadequacies of the workload management and data management systems …” Continued effort to fix problems in LCG-2 Significant effort to test and integrate gLite components “… the LHCC noted that the service provided by LCG-2 was much less than production quality” Operational monitoring (CIC on duty) Site functional tests, VO site selection tool

EGEE/LCG-2 Grid Sites : November 2005 Country providing resources Country anticipating joining EGEE/LCG-2 grid: 174 sites, 40 countries >17,000 processors, ~5 PB storage

OSG Production 46 CEs, 15459 CPUs 6 SEs http://osg-cat.grid.iu.edu/

Accounting: - jobs in EGEE Sustained use in excess of 10k jobs/day for many months

Accounting: - cpu time Total accounted time: ~600 k-SI2k-years; Many sites only recently accounted, real use is ~ 2x this

Recent Production Statistics Running Jobs Queued Jobs October, 2005

Interoperability Significant progress since last year Focus on services rather than different grids Baseline Services group and report is the basis Now accepted as common understanding of what services need to be available, where Service challenges have demonstrated basic interaction at the level of data transfers … and has shown up many issues (see later talk) FNAL, BNL, NDGF all fully participating in SC3 Interoperability also at the level of job submission addressed in parallel What next? Need experiment statement of priorities Interoperation – sharing operational oversight

Interoperability EGEE – OSG: EGEE – ARC: In both cases: Job submission demonstrated in both directions Done in a sustainable manner EGEE BDII and GIP deployed at OSG sites Will also go into VDT EGEE WN tools installed as a grid job on OSG nodes Small fixes to job managers to set up environment correctly EGEE – ARC: 2 workshops held (September, November) to agree strategy and tasks Longer term want to agree standard interfaces to grid services Short term: EGEEARC: Try to use Condor component that talks to ARC CE ARCEGEE: discussions with EGEE WMS developers to understand where to interface Default solution: NDGF acts as a gateway In both cases: Catalogues are experiment choices – generally local catalogues use local grid implementations

Interoperation Goal: to improve level of “round-the-clock” operational coverage OSG have been to all of the EGEE operations workshops Latest was arranged as a joint workshop Can we share operational oversight? Gain more coverage (2 shifts/day) Share monitoring tools and experience Site Functional tests (SFT) Common application environment tests Strong interest from both sides User support workflows – interface Now: Write a short proposal of what we can do together Both EGEE and OSG have effort to work on this Follow up in future operations workshops

Operations Operator on duty Simplified VO selection of good sites Started November 2004 Crucial in stabilising sites Essential tools : GIIS monitor and Site Functional Tests Simplified VO selection of good sites VO can select set of functional tests that it requires Can white- or black-list sites Can include VO-specific tests (e.g. sw environment) SFT framework provides dynamic selection of “good” sites SFT’s have evolved to become stricter as lessons are learned Normally >80% of sites pass SFTs NB of 180 sites, some are not well managed

SFT - report Shows results matrix with all sites Selection of “critical” tests for each VO to define which sites are good/bad Detailed test log available for troubleshooting and debugging Deployed on two machines at CERN (load distribution, fault tolerance)

GIIS Monitor (GStat) Monitoring tool for Information System: Periodically queries all Site BDIIs (but not Top-level BDIIs) Checks if Site BDIIs are available Checks integrity of published information Checks for missing entities, attributes Detects and reports information about some of the Services: RB, MyProxy, LFC but doesn’t monitor them Detects duplicated services in some cases (eg. 2 global LFC servers a single VO)

Prototype site availability metric Using current data schema and R-GMA - integrate monitoring information from SFT and GStat Summary generator uses list of critical tests to generate a summary per site - binary value (good/bad) generated every 1h Metric generator integrates the summaries over time period (1 day…) to generate availability metric

Evolution of SFT metric

Availability

CIC-on-duty operations CIC-on-duty: currently 6 teams (CERN, IN2P3, RAL, INFN, Russia, Taipei) working in weekly shifts The operators look at emerging alarms (CIC Dashboard) and the monitoring tools (for details) and report problems Problems are submitted as tickets to GGUS and both ROC and sites are notified ROC is responsible for timely problem solution - otherwise ticket is escalated Priorities and deadlines for tickets are set depending on site size (number of CPUs) Everything is described in detail in the Operations Manual

CIC Dashboard Main tool for CIC-on-duty Problem categories •` Main tool for CIC-on-duty Makes CIC-on-duty job much easier Integrated view of monitoring tools (summary) - shows only failures and assigned tickets Detailed site view with table of open tickets and links to monitoring results Single tool for ticket creation and notification emails with detailed problem categorisation and templates Ticket browser with highlighting expired tickets Well maintained - adapts quickly to new requirements/suggestions •` Sites list (reporting new problems) Test summary (SFT,GSTAT) GGUS Ticket status

Service measurement – extending the metrics Class Comment SRM 2.1 C Monitoring of SE LFC C/H FTS Base on SC experience CE Monitored by SFT now RB Job monitor exists Top level BDII Can be included in Gstat Site BDII H Monitored by Gstat MyProxy VOMS R-GMA Effort identified for each service Will all be integrated into SFT framework Required to monitor MoU service commitments

Checklist for a new service User support procedures (GGUS) Troubleshooting guides + FAQs User guides Operations Team Training Site admins CIC personnel GGUS personnel Monitoring Service status reporting Performance data Accounting Usage data Service Parameters Scope - Global/Local/Regional SLAs Impact of service outage Security implications Contact Info Developers Support Contact Escalation procedure to developers Interoperation Documented issues First level support procedures How to start/stop/restart service How to check it’s up Which logs are useful to send to CIC/Developers and where they are SFT Tests Client validation Server validation Procedure to analyse these error messages and likely causes Tools for CIC to spot problems GIIS monitor validation rules (e.g. only one “global” component) Definition of normal behaviour Metrics CIC Dashboard Alarms Deployment Info RPM list Configuration details (for yaim) Security audit

User support - status The functionality and usability of the GGUS system has improved more tickets submitted, more customers and general appreciation of the service GGUS coordinates the effort and operations The interfaces with the ROCs are quite practical and make the system function as a whole. Most ROCs have established functional interfaces with GGUS, the others are working on it. The ticket traffic is increasing. We still do not know what a realistic figure would be for the number of ticket to be expected. Know how to scale the system A lot of metrics established to measure the performance of the system (performance of a supporter/support unit, tickets solved/week/VOs, # of tickets filed in Wiki pages, etc.). The measures refer only to the central system. Measures for each ROC are also available. We need more specialized supporters in order to help the supporters at CERN who now are the main source of knowledge and help.

User support statistics Tickets 2005-09 110 tickets first 15 days in October September October

Baseline Services Group set up early this year to agree the common set of fundamental services required by the experiments Expose experiment plans Identify commonalities Identify implementations Report to PEB in July Follow up (September – November) Understand progress on implementations Propose performance and reliability “metrics” Open issues (e.g. implementation of fine grained authorisation)

Baseline Services: Priorities A: High priority, mandatory service B: Standard solutions required, experiments could select different implementations C: Common solutions desirable, but not essential Service ALICE ATLAS CMS LHCb Storage Element A Basic transfer tools Reliable file transfer service A/B Catalogue services B Catalogue and data management tools C Compute Element Workload Management VO agents VOMS Database services Posix-I/O Application software installation Job monitoring tools Reliable messaging service Information system

Baseline Services – status – 1 Storage Element Castor, dCache, DPM all available with SRM v1.1. Plans to implement most of SRM 2.1 in response to stated requirements Expected to be available for SC4 for dpm, Castor, dCache Global space reservation will be late in dCache, … Agreed to a standard SRM test suite as the definition of conformity (to be supplied by CERN) Basic transfer tools Gridftp, srmCopy. Gridftp available, new version from GT4 should be far more reliable, should be deployed for SC4. srmCopy – in progress for dpm, others: exist Reliable file transfer service gLite FTS available, in production in SC3. New version for inter-VO scheduling and with interface to srmCopy under test now. Globus RFT (US) – in principle is available, bit no plans at the moment to integrate with experiment software?

Baseline Services – status – 2 Catalogue services LFC – integrated with POOL, in use by all experiments as a local file catalogue, also usable as central catalogue. Globus RLS – integrated with POOL; will be used in OSG by US-ATLAS/CMS (??) gLite Fireman – integrated with POOL, is available for experiment testing Catalogue and data management tools Lcg-utils: quite robust, added timeout handling. POOL command line catalogue tools available with POOL. gLite-IO-xxx tools – somewhat parallel to lcg-utils. Need to understand which gLite components will be deployed. Propose that lcg-utils (or similar) should absorb all these sets of CLI tools with a consistent naming and options.

Baseline Services – status – 3 Compute element Existing Globus/Condor-G based CE deployed in LCG-2 and OSG. Both planning on moving to Condor-C based CE (via gLite in EGEE), not clear if timescales are understood yet. This new CE is not yet very stable. ARC (Nordugrid) have completely different incompatible CE. Not clear on future plans. Workload management Existing RB; expect gLite RB soon to replace this. Other implementations: Condor-G based; advertised as being lighter weight, but not clear if add in missing functionality such as the logging and bookkeeping service. Work clearly needs to be done on RB performance. Other developments coming out of the woodwork (Panda in US-ATLAS), … VO agents Prototype VO box put together by LCG: based on a standard CE, includes gsi-enabled ssh, proxy renewal services etc, access to site CE, etc. Being deployed now for ALICE, ATLAS, LHCb. Edge-service in OSG is probably a more sophisticated implementation. Collaboration on this topic is being discussed; but good to get some experience quickly with the prototypes.

Baseline Services – status – 4 VOMS VOMS – the only implementation. Various management interfaces – for LCG all have agreed on the VOMRS service with coupling to CERN HR database (and potentially others). This is there and being tested now. Service lookup: LCMAPS/LCAS in EGEE. Privilege project in OSG. Again discussing collaboration. Direct integration in SE is in progress for dpm and dCache. Database services Oracle and MySQL. SC exercise is defining what this service needs to look like. 3D project to enumerate services needed POSIX-I/O GFAL: library. Exists, not used as an I/O service except in tests. GFAL library is however at the heart of many of the lcg-utils. gLite-I/O: very different security model. Not tested yet by any application. In general needs more input from the experiments on what is appropriate.

Baseline Services – status – 5 Application software installation Service exists in EGEE. In OSG app sw is installed by hand by VO. Can be improved/adapted as requested (but no real experiment requests in this area at the moment) Job monitoring tools Subject to continuous development and evolution. EGEE is in a better situation than OSG here (RB-logging and bookkeeping), other functionality recently added (live view of stdout/stderr of grid jobs, etc.) Reliable messaging service Experiments variously using things like Jabber. No general service set up yet. Need a discussion on what this should look like. Information system EGEE and OSG use very closely the same schema – recently jointly updated to GLUE 1.2. ARC use very different schema. Common agreement on joint development of GLUE 2.0 (to start now), using experience gained by all – quite clear ideas on what is needed now. Will not be backwards compatible with previous version (but there is a migration path). Common schema desirable for job monitoring, accounting, information, monitoring, …

VO Boxes Originally proposed in BS discussions Recognise that all experiments had need for long-lived “service-like” processes at many sites Were being done in an ad-hoc way Proposed to provide a clear mechanism for managing these needs OSG – Edge Service Framework addresses same issue Based on virtual machine technology All 4 experiments have this need Generated a lot of excited discussion Security concerns Management and resource concerns (# boxes) Not in grid philosophy … … but as of now no grid services provide these additional needs

VO Boxes - 2 Status now: Need to better understand: Being deployed in Tier 1 (and some Tier 2) sites for SC3/SC4 (CMS has specific services run by CMS staff “near” Tier 1) Operations workshop Security document and questionnaire Operations policy and questionnaire Both reasonable proposals – experiments providing info Need to better understand: What are the common services that can be (eventually) extracted as generic components? Asynchronous registration service, catalogue updates, messaging service, … And … Will not hold up SC3/4 Many Tier 2 sites await more details; assessment by Tier 1s Must generalise as many services as possible May still need to provide long-lived VO-specific services at a site Better to do in an understood, managed, container than ad-hoc

This session … Service challenges – report and SC4 plans Experiment report of SC3 Site experiences in SC3 2 Tier 1s, UK Tier 2s Security update – issues