Next Generation of Grid Services for the NorduGrid NEC 2003 Varna, September 19, 2003 Oxana Smirnova.

Slides:



Advertisements
Similar presentations
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MyProxy and EGEE Ludek Matyska and Daniel.
Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Minimum intrusion GRID. Build one to throw away … So, in a given time frame, plan to achieve something worthwhile in half the time, throw it away, then.
Minimum intrusion GRID. Build one to throw away … So, in a given time frame, plan to achieve something worthwhile in half the time, throw it away, then.
Current Monte Carlo calculation activities in ATLAS (ATLAS Data Challenges) Oxana Smirnova LCG/ATLAS, Lund University SWEGRID Seminar (April 9, 2003, Uppsala)
NorduGrid: the light-weight Grid solution LCSC 2003 Linköping, October 23, 2003 Oxana Smirnova.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
1 Deployment of an LCG Infrastructure in Australia How-To Setup the LCG Grid Middleware – A beginner's perspective Marco La Rosa
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Overview of the NorduGrid Information System Balázs Kónya 3 rd NorduGrid Workshop 23 May, 2002, Helsinki.
Grid Computing - AAU 14/ Grid Computing Josva Kleist Danish Center for Grid Computing
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
NorduGrid Architecture and tools CHEP2003 – UCSD Anders Wäänänen
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Enabling Grids for E-sciencE ENEA and the EGEE project gLite and interoperability Andrea Santoro, Carlo Sciò Enea Frascati, 22 November.
The NorduGrid Project Oxana Smirnova Lund University November 3, 2003, Košice.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Quick Introduction to NorduGrid Oxana Smirnova 4 th Nordic LHC Workshop November 23, 2001, Stockholm.
Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.
First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
MTA SZTAKI Hungarian Academy of Sciences Introduction to Grid portals Gergely Sipos
INFSO-RI Enabling Grids for E-sciencE OSG-LCG Interoperability Activity Author: Laurence Field (CERN)
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Tools and techniques for managing virtual machine images Andreas.
Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.
The NorduGrid toolkit user interface Mattias Ellert Presented at the 3 rd NorduGrid workshop, Helsinki,
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
J Jensen / WP5 /RAL UCL 4/5 March 2004 GridPP / DataGrid wrap-up Mass Storage Management J Jensen
NorduGrid's ARC: A Grid Solution for Decentralized Resources Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration ISGC 2005, Taiwan.
CREAM Status and plans Massimo Sgaravatto – INFN Padova
Grid Computing with Debian, Globus and ARC Mattias Ellert, Uppsala Universitet (.se) Steffen Möller, Universität zu Lübeck (.de) Anders Wäänänen, Niels.
Resource access in the EGEE project Massimo Sgaravatto INFN Padova
The EDG Testbed Deployment Details
OGF PGI – EDGI Security Use Case and Requirements
Oxana Smirnova, Jakob Nielsen (Lund University/CERN)
Introduction to Distributed Platforms
Chapter 25 Domain Name System.
StoRM: a SRM solution for disk based storage systems
VOs and ARC Florido Paganelli, Lund University
U.S. ATLAS Grid Production Experience
Data Bridge Solving diverse data access in scientific applications
Peter Kacsuk – Sipos Gergely MTA SZTAKI
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
EGEE VO Management.
Grid2Win: Porting of gLite middleware to Windows XP platform
Ruslan Fomkin and Tore Risch Uppsala DataBase Laboratory
LCGAA nightlies infrastructure
Short update on the latest gLite status
THE STEPS TO MANAGE THE GRID
Network Requirements Javier Orellana
NorduGrid and LCG middleware comparison
LCG middleware and LHC experiments ARDA project
Stephen Burke, PPARC/RAL Jeff Templon, NIKHEF
Leigh Grundhoefer Indiana University
EGEE Middleware: gLite Information Systems (IS)
Wide Area Workload Management Work Package DATAGRID project
gLite The EGEE Middleware Distribution
Grid Computing Software Interface
Information Services Claudio Cherubino INFN Catania Bologna
The LHCb Computing Data Challenge DC06
Presentation transcript:

Next Generation of Grid Services for the NorduGrid NEC 2003 Varna, September 19, 2003 Oxana Smirnova

2 Outline  Project overview  Services in details  Present vs future

3 NorduGrid: some facts  NorduGrid is: –A Globus-based Grid middleware solution of choice in Scandinavia and Finland –A large international 24/7 production quality Grid facility –A resource routinely used by researchers since summer 2002 –A freely available software –A project in development  NorduGrid is NOT: –Based on EDG/LCG solutions –ATLAS- or HEP-dedicated tool –A testbed anymore –A finalized solution

4 What makes NorduGrid different from other Grids 1.It is stable by design: a)The nervous system: distributed yet stable Information System (Globus’ MDS 2.2+patches) b)The heart(s): Grid Manager, the service to be installed at master nodes (based on Globus, replaces GRAM) c)The brain(s): User Interface, the client/broker that can be installed anywhere as a standalone module (makes use of Globus) 2.It is light-weight, portable and non-invasive: a)Resource owners retain full control; Grid Manager is effectively a yet another user (with many faces though) b)No requirements w.r.t. OS, resource configuration, etc. c)Clusters need not be dedicated d)Runs on top of existing Globus installation (e.g. VDT) 3.Strategy: start with something simple that works for users and add functionality gradually

5 Current standing  Project involves several Nordic universities and HPC centers  Will continue for 4-5 years more –Forms the ”North European Grid Federation” of the EGEE together with the Dutch Grid, Belgium and Estonia –Will provide middleware for the ”Nordic Data Grid Facility” and related projects (SWEGRID, Danish Grid etc)  Shares authentication and authorization mechanisms with EDG/LCG-1/EGEE  The first milestone reached in time: it provides essential facilities for the ATLAS Data Challenge (up to 15% contribution during the summer holidays)  Current stable release:  Next milestone: the first release, NorduGrid 1.0  Any new development will be in the Grid Services framework

6 The resources  Almost everything the Nordic academics can provide (ca 1000 CPUs in total): –4 dedicated test clusters (3-4 CPUs) –Some junkyard-class second-hand clusters (4 to 80 CPUs) –Few university production-class facilities (20 to 60 CPUs) –Two world-class clusters in Sweden, listed in Top500 (238 and 398 CPUs)  Other resources come and go –Canada, Japan – test set-ups –CERN, Dubna – clients –It’s open, anybody can join or part  People: –the “core” team: 6 persons –local sysadmins are only called up when users need an upgrade

7 A snapshot

8 Components Information System

9 How does it work  Information system knows everything Information system –Substantially re-worked and patched Globus MDS –Distributed and multi-rooted –Allows for a mesh topology –No need for a centralized broker  The server (“Grid manager”) on each gatekeeper does most of the jobGrid manager –Pre- and post- stages files –Interacts with LRMS –Keeps track of job status –Cleans up the mess –Sends mails to users  The client (“User Interface”) does the brokering, Grid job submission, monitoring, termination, retrieval, cleaning etcUser Interface –Interprets user’s job task –Gets the testbed status from the information system –Forwards the task to the best Grid Manager –Does some file uploading, if requested

Information System  Uses Globus’ MDS 2.2 –Soft-state registration allows creation of any dynamic structure –Multi-rooted tree –GIIS caching is not used by the clients –Several patches and bug fixes are applied  A new schema is developed, to serve clusters –Clusters are expected to be fairly homogeneous

Front-end and the Grid Manager  Grid Manager replaces Globus’ GRAM, still using Globus Toolkit TM 2 libraries  All transfers are made via GridFTP  Added a possibility to pre- and post-stage files, optionally using Replica Catalog information  Caching of pre-staged files is enabled  Runtime environment support

The User Interface  Provides a set of utilities to be invoked from the command line:  Contains a broker that polls MDS and decides to which queue at which cluster a job should be submitted –The user must be authorized to use the cluster and the queue –The cluster’s and queue’s characteristics must match the requirements specified in the xRSL string (max CPU time, required free disk space, installed software etc) –If the job requires a file that is registered in a Replica Catalog, the brokering gives priority to clusters where a copy of the file is already present –From all queues that fulfills the criteria one is chosen randomly, with a weight proportional to the number of free CPUs available for the user in each queue –If there are no available CPUs in any of the queues, the job is submitted to the queue with the lowest number of queued job per processor ngsubto submit a task ngstatto obtain the status of jobs and clusters ngcatto display the stdout or stderr of a running job nggetto retrieve the result from a finished job ngkillto cancel a job request ngcleanto delete a job from a remote cluster ngrenewto renew user’s proxy ngsyncto synchronize the local job info with the MDS ngcopyto transfer files to, from and between clusters ngremoveto remove files

Job Description: extended Globus RSL (&(executable="recon.gen.v5.NG") (arguments="dc lumi hlt.pythia_jet_17.zebra" "dc lumi02.recon hlt.pythia_jet_17.eg7.602.ntuple" "eg7.602.job" “999") (stdout="dc lumi02.recon hlt.pythia_jet_17.eg7.602.log") (stdlog="gridlog.txt")(join="yes") ( |(&(|(cluster="farm.hep.lu.se")(cluster="lscf.nbi.dk")(*cluster="seth.hpc2n.umu.se"*)(cluster="login-3.monolith.nsc.liu.se")) (inputfiles= ("dc lumi hlt.pythia_jet_17.zebra" "rc://grid.uio.no/lc=dc1.lumi ,rc=NorduGrid,dc=nordugrid,dc=org/zebra/dc lumi hlt.pythia_jet_17.zebra") ("recon.gen.v5.NG" " ("eg7.602.job" " ("noisedb.tgz" " ) (inputfiles= ("dc lumi hlt.pythia_jet_17.zebra" "rc://grid.uio.no/lc=dc1.lumi ,rc=NorduGrid,dc=nordugrid,dc=org/zebra/dc lumi hlt.pythia_jet_17.zebra") ("recon.gen.v5.NG" " ("eg7.602.job" " ) (outputFiles= ("dc lumi02.recon hlt.pythia_jet_17.eg7.602.log" "rc://grid.uio.no/lc=dc1.lumi02.recon ,rc=NorduGrid,dc=nordugrid,dc=org/log/dc lumi02.recon hlt.pythia_jet_17.e g7.602.log") ("histo.hbook" "rc://grid.uio.no/lc=dc1.lumi02.recon ,rc=NorduGrid,dc=nordugrid,dc=org/histo/dc lumi02.recon hlt.pythia_jet_17. eg7.602.histo") ("dc lumi02.recon hlt.pythia_jet_17.eg7.602.ntuple" "rc://grid.uio.no/lc=dc1.lumi02.recon ,rc=NorduGrid,dc=nordugrid,dc=org/ntuple/dc lumi02.recon hlt.pythia_jet_1 7.eg7.602.ntuple")) (jobname="dc lumi02.recon hlt.pythia_jet_17.eg7.602") (runTimeEnvironment="ATLAS-6.0.2") (CpuTime=1440)(Disk=3000)(ftpThreads=10))

Task flow Grid Manager Gatekeeper GridFTP RSL Front-end Cluster

What NorduGrid 1.0 will provide…

Stability, reliability  The facility works non-stop since May 2001  Grid components do not require special attention of sysadmins  Failure of a single cluster/node does not cripple the whole system  Comes complete with a test-suite

Portability  Runs on any Linux flavor and Solaris –Tests run on Tru64 –Sources are freely available to port anywhere  Does not require very special cluster configuration –Nothing needs to be installed on worker nodes, only on a gatekeeper

Smart front-end and client  Grid Manager replaces Globus GRAM –All transfers are done via GridFTP –Pre- and post- staging of files (ev. using other protocols)  Caching of pre-staged files  Support for runtime environment  Extended RSL  Decentralized resource brokering by each client –Relies on Information system

User-oriented services  Support for a variety of applications –ATLAS is the major customer Data Challenges: of 19 ATLAS “sites”, ramped up contribution from 2% to 15% within a year due to expansion (ca. 4TB of data, more than 4750 single jobs 2 to 30 hours long) Higgs and B-physics team: own Data Challenges –Combined LEP analysis –Theoretical high energy physics: tests of PYTHIA components –Quantum lattice models  Simple monitoring –Users can make own choice of resources

NorduGrid 1.0 issues (to be addressed by next releases):

Scalability  Only 8 production-level clusters have the NorduGrid installed –There are no dedicated clusters available, hence Grid users have to compete for the resources –Only half of the owners let all the resources to be used by the Grid  Only 2-3 regular users (ATLAS) and ~5-7 occasional ones –That’s still a lot given the small size of the Scandinavian physics community  Maximum observed load is only about 160 concurrent jobs, plus ca. 200 queued –This is due to the internal limits and priorities set by resource owners

QoS assurance  Clusters can provide false information and yet stay in the system –The issue is solvable technically, but not politically  Clusters can provide poor service, such as not updating credentials or experiencing local hardware problems causing massive job failures – and yet they can not be suspended/banned –Also a political issue, not technical  No warnings prior to service and users credentials expiration  No job recovery and/or rescheduling mechanisms

Logging, accounting, auditing  Transient nature of all the information –Very difficult to trace back eventual problems –Impossible to provide precise usage figures –Very difficult to keep track of who done what  General lack of corresponding services –Unfortunately, this is an issue for any Grid solution worldwide, maybe EGEE could help? –Meanwhile, a “logger” service is being slowly introduced

Authorization and authentication  As every Grid project worldwide, NorduGrid uses rudimentary solutions from Globus2: –No Virtual Organization management –No VO-based authorization or access rights control –Insufficient and ambiguous information stored in certificates –No automatic certificate/key download by jobs –No proxy management No secure proxy storing/delegating service No automatic proxy extension/renewal  These are also the issues which should be addressed globally (GGF, EGEE?) –VOMS is being implemented by NorduGrid, EDG/LCG-1

Information system  Globus MDS is known to be unreliable –NorduGrid has provided several patches, but the implementation is still unsatisfactory  Has hierarchical rather than mesh structure –A single failed machine can deem entire country resources unavailable  Does not use caching –Frequent timeouts  Anonymous access –Potentially sensitive information is unprotected

Data management system  There is no DMS satisfying a HEP experiment requirements on the market – a global problem  NorduGrid supports only most basic Globus Replica Catalog functionality –No Storage Element concept –No mass storage support –Centralized – single point of failure –Slow –Questionable scalability –No VO-based access control  A “Clever SE” is being developed

Front-end functionality  Used GridFTP protocol lacks a lot of functionality  So far, is interfaced only to PBS  A shared file system between the front- end and the worker nodes is still required  No interactivity –Does not provide any channel between a running job and a user

Workload management  Resource brokering is nothing more than user request matching –No pre-allocation –No cost evaluation –No load balancing  No rescheduling and resubmission –Should rely on the “logger”  No VO-based management  No cross-cluster parallelism

Installation and configuration  Provided in a half a dozen of separate packages (tarballs or RPMs) –Plus twice that amount of external software like patched Globus, patched perl-ldap etc.  No installer provided  No installation instructions provided  Configuration is a very time-consuming task –XML-based configuration procedure is being considered

Manpower  No full-time developers at the moment –Bug fixes is the major activity before the first release –130+ bugs and enhancement requests registered since May 2003  No researchers from computer sciences are volunteering to get involved  Reduced user support  Some contribution via EGEE is anticipated

Summary  NorduGrid pre-release (currently ) works reliably  Release 1.0 is slowly but surely on its way; many fixes are needed  Much functionality will still be missing from 1.0 –Most of that functionality is missing from other Grids as well – the problem is global  We welcome new users and resources –As soon as political issues are solved, users “enjoy the power” –Beware, support is somewhat limited at the moment  Future: Nordic countries are committed to use and support NorduGrid –Some resources (Sweden) have to use LCG-1, but plan to use NorduGrid on top (or in parallel) –The project is open for members and contributions from everywhere