John Gordon and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct GOSC Oct 28.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Torsten Antoni – LCG Operations Workshop, CERN 02-04/11/04 Global Grid User Support - GGUS -
Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Wofgang Thöne, Institute For Scientific Computing – EGEE-Meeting August 2004 Welcome to the User.
Oxford Jan 2005 RAL Computing 1 RAL Computing Implementing the computing model: SAM and the Grid Nick West.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
John Gordon CCLRC RAL Grid Operations Centre Update Trevor Daniels LCG Grid Deployment Board 10 th November 2003.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
VOX Project Status T. Levshina. Talk Overview VOX Status –Registration –Globus callouts/Plug-ins –LRAS –SAZ Collaboration with VOMS EDG team Preparation.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
Dave Kant Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005.
John Gordon CCLRC e-Science Centre LCG Deployment in the UK John Gordon GridPP10.
Dave Kant Grid Operations Centre LCG Workshop CERN 24/3/04.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
LHCb planning for DataGRID testbed0 Eric van Herwijnen Thursday, 10 may 2001.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
13 May 2004EB/TB Middleware meeting Use of R-GMA in BOSS for CMS Peter Hobson & Henry Nebrensky Brunel University, UK Some slides stolen from various talks.
WP3 Information and Monitoring Steve Fisher / RAL 23/9/2003.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
Grid Operations Centre LCG Accounting Trevor Daniels, John Gordon GDB 8 Mar 2004.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
GDB March User-Level, VOMS Groups and Roles Dave Kant CCLRC, e-Science Centre.
INFSO-RI Enabling Grids for E-sciencE OSG-LCG Interoperability Activity Author: Laurence Field (CERN)
Grid Operations Centre LCG SLAs and Site Audits Trevor Daniels, John Gordon GDB 8 Mar 2004.
Dave Kant Monitoring ROC Workshop Milan 10-11/5/04.
Presenter Name Facility Name UK Testbed Status and EDG Testbed Two. Steve Traylen GridPP 7, Oxford.
Enabling Grids for E-sciencE INFSO-RI Tools for CIC Operations, Bologna, 24th May Monitoring workflow in EGEE GOC DB is used to get the list.
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA gLite Information System Pedro Rausch IF.
LCG Accounting John Gordon Grid Deployment Board 13 th January 2004.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
Andrew McNab - Manchester HEP - 17 September 2002 UK Testbed Deployment Aim of this talk is to the answer the questions: –“How much of the Testbed has.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
John Gordon CCLRC RAL Grid Operations LCG Grid Deployment Board FNAL, 9th October 2003.
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
VOX Project Status T. Levshina. 5/7/2003LCG SEC meetings2 Goals, team and collaborators Purpose: To facilitate the remote participation of US based physicists.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
The GridPP DIRAC project DIRAC for non-LHC communities.
CERN Certification & Testing LCG Certification & Testing Team (C&T Team) Marco Serra - CERN / INFN Zdenek Sekera - CERN.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
John Gordon Grid Accounting Update John Gordon (for Dave Kant) CCLRC e-Science Centre, UK LCG Grid Deployment Board NIKHEF, October.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
II EGEE conference Den Haag November, ROC-CIC status in Italy
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Grid Operations Centre Progress to Aug 03
Regional Operations Centres Core infrastructure Centres
EGEE is a project funded by the European Union
LCG Monitoring and Accounting
Site availability Dec. 19 th 2006
Presentation transcript:

John Gordon and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations

John Gordon Outline The monitoring tools How we use them in operations What is still to be done

John Gordon Grid Operations Once middleware has been developed, tested and deployed, grid operations are the set of actions and procedures to keep a grid running for the users.

John Gordon The Vision GOC Processes and Activities –Coordinating Grid Operations –Defining Service Level Parameters –Monitoring Service Performance Levels –First-Level Fault Analysis –Interacting with Local Support Groups –Coordinating Security Activities –Operations Development

John Gordon Have we delivered? Coordinating Grid Operations Defining Service Level Parameters Monitoring Service Performance Levels First-Level Fault Analysis Interacting with Local Support Groups Coordinating Security Activities Operations Development Yes, RAL, CERN & Taipei No up or down Yes Policies, not operation Monitoring and accounting

John Gordon Monitoring the Grid is a Challenge!

John Gordon  Why We Monitor Keep systems up and running Notice failures; grid-wide services MDS; Knowing what services a site should be running  no point raising an alert if the site isn’t meant to run it!  definition of services and which sites run them (SLA)  What Tools Do We Use Job Submission; GridIce; Nagios; GIIS Monitor How – Database Developments Planned nagios Monitoring Overview

John Gordon We have only fragmentary information about the services that sites are running. We don’t know what RBs/SEs/Sites the VOs are using for data challenges. We don’t know what the core services are and who is running them. We don’t have a toolkit to test specific core services. We have to concentrate on functional behaviour of services e.g If an RB sends your job to a CE, then we must assume the RB is working fine. Is this the only test of a RB? Not all the tests that we perform are effective at finding problems so we must take tests written by the experts and integrate them into GOC monitoring. We must develop tests which simulate the life cycle of real applications in a Grid environment. There are lots of monitoring tools available, so we need to bring them together. Do we spend time investigating new tools, or make the ones which we already have better? …and probably lots more! Monitoring Challenges

John Gordon There are many frameworks which can be used to monitor distributed environments MAPCENTRE GPPMON GRIDICE NAGIOS MONALISA GIIS Monitor / / Ganglia –Example: Mapcentre 30 sites ~ 500 lines in config file (static version) –Example: Nagios 30 sites, 12 individual config files with dependencies –Developed Tools to Configure these services to make the job easier NAGIOS, MAPCENTER and GPPMON Monitoring Services

John Gordon GOC Configuration Database GOC GridSite MySQL Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … ce se bdii rb Monitoring Secure Database Management via HTTPS / X.509 People, Contact Information, Resources Scheduled Maintenance RC SQL https SERVERSERVER

John Gordon GOC Job Submission Flow Diagram Simple job forked on CE using globus GOC (UI) Build List of CE, RB Resources JOB Script GLOBUS.CE create CE sent acknowledge globus-job-run CE SITE DB SQL QUERY wget received acknowledgement GPPMON - 2

John Gordon GPPMON - 3 JOB Script RB.CE create RB sent acknowledge edg-job-submit GOC (UI) Build List of CE, RB Resources SITE DB SQL QUERY CE Other.GlueCEUniqueID wget received acknowledgement WN CE Simple job through local jobmanager on CE via Resource Broker Job MatchMaking

John Gordon LCG2 Site Status: 21 July am GPPMON – 1

John Gordon GRIDICE - 1

John Gordon

John Gordon Ganglia Monitoring Can use Ganglia to monitor a cluster RAL Tier-1 Centre LCG PBS Server displays Job status for each VO

John Gordon Ganglia Monitoring - 2 Can also use Ganglia to monitor clusters of clusters

John Gordon  Provide ROCs with a package to monitor the resources in the region Tailored Monitoring ROCs may upload their own maps JAVA GUI to automate site locations on the map  Hierarchical view of Resources Example GridPP made up of virtual T2 centres Regional Monitoring - 1 EGEE FranceUK/I GridPP LondonT2 IMPERIAL QMUL ScotGrid Edinburgh S.E.E

John Gordon LCG2 Site Status: 21 July am GPPMON – 1

John Gordon   Active map to select individual regions Regional Monitoring - 2

John Gordon Regional Monitoring - 3 UK/I Monitoring displays GRIDPP and NGS resources.

John Gordon Replica Manager Tests - 1 GOC to take over site certification testing which is done by CERN deployment team on a daily basis (e.g reports by Piotr Nyczyk) First step toward this involved running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3 rd party copies from remote SE e.g Castorgrid Demonstrates that we can integrate other peoples tools into GPPMON Development of a portal which will: –Make it easy to retrieve debug information from the job output. –Connect with information provided by other monitoring tools e.g Taipei GIIS Monitor. –Provide testing “on-demand” to site administrators through a secure interface.

John Gordon Results of each test are shown as a coloured index on the map. Distinguish between jobs that have completed, or have failed or still running. Replica Manager Tests - 2

John Gordon Description of the tests Job Outputs GIIS Monitor Information Replica Manager Tests - 3

John Gordon GIIS Monitor Developed by MinTsai (GOC Taipei) Tool to display and check information published by the site GIIS

John Gordon Job Accounting -1 Program publishes PBS log file information through RGMA to the GOC GOC aggregates data across all sites.

John Gordon Job Accounting - 2 Offline testing of program using data from the CORE sites completed. Development of an accounting portal underway to provide accounting on- demand for each site, and aggregated for each EGEE region Challenge! Deal with large database 1 ROW per LCGPBS Job per Site!

John Gordon GridPP Accounting

John Gordon EDG-network monitoring

John Gordon Security Worked with Security Group Defined a Security Policy –and auditing procedures Have a list for security contacts –but not really exercised it yet –still need to define procedures in the event of security incidents

John Gordon Keeping the Work Flowing Regular monitoring of job submission –shows sites that have problems running jobs Nagios tracks individual services –plus certificate lifetime RM tests show whether data can be moved GridICE and Ganglia show what is running Limited by RB behaviour –we can see that jobs are not getting to sites but not why.

John Gordon What we have delivered? A set of monitoring tools A monitoring regime Two GOCs (RAL and Taipei) Security Policy

John Gordon Still to do Effective problem tracking –we see site problems and get them fixed –but don’t manage long-term problems Integration with User Support –we track problems we see –but problems users notice not effectively dealt with Automatic alerts –Nagios does but EMS from Taipei looks promising Remote repair –agents until middleware can support this directly Security Deploy accounting Distribute monitoring to EGEE ROCs and others

John Gordon What Next ? (1) RSS used to send tailored streams –sites, ROCs, management can all decide what to subscribe to Accounting –being tested in LCG C&T testbed –should be in next LCG release –Then get T2 accounts keep your pbs log and msgs and gatekeeper logs

John Gordon Monitoring Feeds GOC server generates a lot of monitoring information. Need a way to give this information to the right people e.g site administrators Really Simple Syndication (RSS) is an XML schema Used by many sites which want to syndicate content e.g BBC, Slashdot Client Pull model: GOC creates RSS formatted documents, clients pull these feeds which render them in html.

John Gordon Aggregator RSSReader (Windows Client) GOC generates RSS feeds which clients can pull using an RSS aggregator. Aggregators available for Linux, Windows and MacOS The aggregator shown displays test results for the RAL CE. These results are archived and popup on the desktop when the feed is updated.

John Gordon What next? (2) GGUS developments –operations issued forwarded to UK GSC helpdesk Weekly LCG GDA Operations Meeting –see next slide EGEE ROCs taking support load –UK ready? EGEE CICs taking operations load on weekly rotation

John Gordon Proposal 2 hour weekly meeting, with VRVS for remote participation – –use the existing GDA slot –Fully open meeting Weekly operations reports (written in advance - previous Friday evening) from –Each EGEE ROC (NE should include Nordugrid ops) –Taipei GOC –Grid3 (covering FNAL and BNL Tier 1’s) –Other LCG Tier 1 sites (where different from the above) - Triumf, Tokyo – others? –ROCs and Tier1s will report on and represent the sites they support Weekly reports (written submitted in advance) from customers: –LHC experiments –Bio-med –Others as they come on-line During the meeting only issues should be brought up and resolved Need to have good representation from ROCs and Tier 1s Need application reps involved in grid work to attend Once a month have more general discussions (presentation style): eg: –Middleware developments –Larger issues - batch system problems, etc Minutes, attendance and problems will be public

John Gordon UK view RAL CIC will take on part of ongoing GOC work –including development for LCG/EGEE UK/I ROC will monitor and support UK/I sites –Helpdesk/DTeam/GOC –Maps tailored for Tier2s