1 DIRAC WMS & DMS A.Tsaregorodtsev, CPPM, Marseille ICFA Grid Workshop,15 October 2006, Sinaia.

Slides:



Advertisements
Similar presentations
INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
Advertisements

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
1 Grid services based architectures Growing consensus that Grid services is the right concept for building the computing grids; Recent ARDA work has provoked.
Stuart K. PatersonCHEP 2006 (13 th –17 th February 2006) Mumbai, India 1 from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication('DaVinci',
Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.
BESIII distributed computing and VMDIRAC
Grid Initiatives for e-Science virtual communities in Europe and Latin America DIRAC TEAM CPPM – CNRS DIRAC Grid Middleware.
K. Harrison CERN, 20th April 2004 AJDL interface and LCG submission - Overview of AJDL - Using AJDL from Python - LCG submission.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
Computational grids and grids projects DSS,
Marianne BargiottiBK Workshop – CERN - 6/12/ Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.
Cosener’s House – 30 th Jan’031 LHCb Progress & Plans Nick Brook University of Bristol News & User Plans Technical Progress Review of deliverables.
Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
INFSO-RI Enabling Grids for E-sciencE Workload Management System Mike Mineter
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
LHCb week, 27 May 2004, CERN1 Using services in DIRAC A.Tsaregorodtsev, CPPM, Marseille 2 nd ARDA Workshop, June 2004, CERN.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
Your university or experiment logo here LHCb Development Glenn Patrick Raja Nandakumar GridPP18, 20 March 2007.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
1 DIRAC Interfaces  APIs  Shells  Command lines  Web interfaces  Portals  DIRAC on a laptop  DIRAC on Windows.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.
1 LHCb File Transfer framework N. Brook, Ph. Charpentier, A.Tsaregorodtsev LCG Storage Management Workshop, 6 April 2005, CERN.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.
Transformation System report Luisa Arrabito 1, Federico Stagni 2 1) LUPM CNRS/IN2P3, France 2) CERN 5 th DIRAC User Workshop 27 th – 29 th May 2015, Ferrara.
1 DIRAC Job submission A.Tsaregorodtsev, CPPM, Marseille LHCb-ATLAS GANGA Workshop, 21 April 2004.
DIRAC Review (12 th December 2005)Stuart K. Paterson1 DIRAC Review Workload Management System.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
GRID Security & DIRAC A. Casajus R. Graciani A. Tsaregorodtsev.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
1 LHCb view on Baseline Services A.Tsaregorodtsev, CPPM, Marseille Ph.Charpentier CERN Baseline Services WG, 4 March 2005, CERN.
1 DIRAC agents A.Tsaregorodtsev, CPPM, Marseille ARDA Workshop, 7 March 2005, CERN.
CHEP 2006, February 2006, Mumbai 1 DIRAC, the LHCb Data Production and Distributed Analysis system A.Tsaregorodtsev, CPPM, Marseille CHEP 2006,
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
GAG meeting, 5 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, Marseille N. Brook, Bristol/CERN GAG Meeting, 5 July 2004, CERN.
The GridPP DIRAC project DIRAC for non-LHC communities.
1 DIRAC Data Management Components A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
Consorzio COMETA - Progetto PI2S2 UNIONE EUROPEA Grid2Win : gLite for Microsoft Windows Elisa Ingrà - INFN.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
1 DIRAC Project Status A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille 10 March, DIRAC Developer meeting.
1 DIRAC project A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
1 Building application portals with DIRAC A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille 27 April 2010, Journée LuminyGrid, Marseille.
Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.
DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.
Jean-Philippe Baud, IT-GD, CERN November 2007
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Grid Deployment Board meeting, 8 November 2006, CERN
Short update on the latest gLite status
Status and plans for bookkeeping system and production tools
Production Manager Tools (New Architecture)
Presentation transcript:

1 DIRAC WMS & DMS A.Tsaregorodtsev, CPPM, Marseille ICFA Grid Workshop,15 October 2006, Sinaia

2 Introduction  DIRAC is a distributed data production and analysis system for the LHCb experiment  Includes workload and data management components Uses LCG services whenever possible  Was developed originally for the MC data production tasks  The goal was: integrate all the heterogeneous computing resources available to LHCb Minimize human intervention at LHCb sites  The resulting design led to an architecture based on a set of services and a network of light distributed agents

3 History  DIRAC project started in September 2002  First production in the fall 2002  PDC1 in March-May 2003 was the first successful massive production run  Complete rewrite of DIRAC by DC2004 in May 2004, incorporation of LCG resources (DIRAC2)  Keeping the same architecture  Extending DIRAC for distributed analysis tasks in autumn 2005 DIRAC – Distributed Infrastructure with Remote Agent Control

4 Production with DataGrid (Dec 2002) Eric van Herwijnen Edit Prod.Mgr Work flow Editor Production Editor Instantiate Workflow Job request Status updates DataGRID CE Production data Scripts Production DB Production Server Bookkeeping info Bookkeeping Updates Input sandbox: Job+ProdAge nt DataGRID Agent

5 DIRAC Services, Agents and Resources DIRAC Job Management Service DIRAC Job Management Service Agent Production Manager Production Manager GANGA DIRAC API JobMonitorSvc JobAccountingSvc Job monitor ConfigurationSvc FileCatalogSvc BookkeepingSvc BK query webpage BK query webpage FileCatalog browser FileCatalog browser Services Agent MessageSvc Resources LCG Grid WN Site Gatekeeper Tier1 VO-box

6 DIRAC Services  DIRAC Services are permanent processes deployed centrally or running at the VO-boxes and accepting incoming connections from clients (UI, jobs, agents)  Reliable and redundant deployment  Running with watchdog process for automatic restart on failure or reboot  Critical services have mirrors for extra redundancy and load balancing  Secure service framework:  DISET: XML-RPC protocol for client/service communication with GSI authentication and fine grained authorization based on user identity, groups and roles  PyOpenSSL module updated to deal with the GSI security tools

7 Configuration service  Master server at CERN is the only one allowing write access  Redundant system with multiple read-only slave servers running at sites on VO-boxes for load balancing and high availability  Automatic slave updates from the master information  Watchdog to restart the server in case of failures

8 WMS Service  DIRAC Workload Management System is itself composed of a set of central services, pilot agents and job wrappers  Realizes the PULL scheduling paradigm  Pilot agents deployed at LCG Worker Nodes pull the jobs from the central Task Queue  The central Task Queue allows to apply easily the VO policies by prioritization of the user jobs  Using the accounting information and user identities, groups and roles  The job scheduling is late  Job goes to a resource for immediate execution

9 Task Queue 1 DIRAC workload management Job Receiver Job Database Optimizer Prioritizer Optimizer Data Optimizer XXX Priority Calculator Accounting Service LHCb policy, quotas Job requirements, ownership Job priority Task Queue 1 VOMS info Agent Director Agent 1 Agent 2 … Match Maker Resources (WNs) Central Services

10 Other Services  Job monitoring service  Getting job heartbeats and status reports  Service the job status to clients ( users ) Web and scripting interfaces  Bookkeeping service  Receiving, storing and serving job provenance information  Accounting service  Receives accounting information for each job  Generates reports per time period, specific productions or user groups  Provides the necessary information for taking policy decisions

11 DIRAC Agents  Light easy to deploy software components running close to a computing resource to accomplish specific tasks  Written in Python, need only the interpreter for deployment  Modular easily configurable for specific needs  Running in user space  Using only outbound connections  Agents based on the same software framework are used in different contexts  Agents for centralized operations at CERN E.g. Transfer Agents used in the SC3 Data Transfer phase Production system agents  Agents at the LHCb VO-boxes  Pilot Agents deployed as LCG jobs

12 DIRAC workload management Job Receiver Job Receiver Job JDL Sandbox Job Input JobDB Job Receiver Job Receiver Job Receiver Job Receiver Data Optimizer Data Optimizer Task Queue LFC checkData Agent Director Agent Director checkJob RB Pilot Job CE WN Pilot Agent Pilot Agent Job Wrapper Job Wrapper execute (glexec) User Application User Application fork Matcher CE JDL Job JDL getReplicas WMS Admin WMS Admin getProxy SE uploadData VO-box putRequest Agent Monitor Agent Monitor checkPilot getSandbox Job Monitor Job Monitor

13 Community Overlay Network TQ DIRAC WMS WN Pilot Agent Pilot Agent Monitoring Logging WN Pilot Agent Pilot Agent WN Pilot Agent Pilot Agent  DIRAC Central Services and Pilot Agents form a dynamic distributed system as easy to manage as an ordinary batch system  Uniform view of the resources independent of their nature Grids, clusters, PCs  Prioritization according to VO policies and accounting  Possibility to reuse batch system tools  E.g. Maui scheduler GRID

14 WMS optimizations with Pilot agents  The combination of pilot agents running right on the WNs with the central Task Queue allows fine optimization of the workload on the VO level  The WN reserved by the pilot agent is a first class resource - there is no more uncertainly due to delays in in the local batch queue  Pilot agent can perform different scenarios of user job execution: Filling the time slot with more jobs Running complementary jobs in parallel Preemption of the low priority job Etc  Especially interesting for the Distributed Analysis activity

15 Dealing with failures  Typically MC data produced at Tier2 centers is stored in the corresponding Tier1 SE  In case of a failure to store and/or register output data in the job  Data is stored in one of the other Tier1 SEs  A data replication request is sent to one of the VO- boxes for later retry  Other failed operations can also result in setting a request on the VO-boxes  Bookkeeping metadata  Some job state reports

16 VO-boxes  LHCb VO-boxes are machines offered by Tier1 sites to insure safety and efficiency of the grid operations  Standard LCG software is maintained by the site managers;  LHCb software is maintained by the LHCb administrators;  Recovery of failed data transfers and bookkeeping operations.  VO-boxes are behaving in a completely non-intrusive way  Access site grid services via standard interfaces  Main advantage – geographical distribution  VO-boxes are set up now in Barcelona, Lyon, RAL and CERN. More boxes will be added as necessary.  Any job can set requests on any VO-box in a round-robin way for redundancy and load-balancing

17 DM Components  DIRAC Data Management tools are built on top of or provide interfaces to the existing services  The main components are:  Storage Element client and Storage access plug-ins SRM, GridFTP, HTTP, SFTP, FTP, …  Replica Manager – high level operations Uploading, replication, registration Best replica finding Failure retries with alternative data access methods  File Catalogs LFC Processing Database  High level tools for automatized bulk data transfers  See presentation by A.C.Smith

18 File Catalog  In the past several catalogs have been tried out in the same framework  AliEn, BK Replica tables  Now the LFC was chosen is the main catalog  Single Master LFC instance with write access  Multiple read-only replicas are foreseen All sharing the same entire replica information Synchronized by the underlying ORACLE streaming replication mechanism Accessed in a load-balanced round-robin way  Other File Catalogs can be used  Sharing the same interface  Processing Database as a File Catalog Specialized catalog with the capability to trigger actions when new data is registered

19 DIRAC user interfaces  Command line with a generic JDL for workload description  dirac-proxy-init dirac-job-submit dirac-job-get-status dirac-job-get-output dirac-job-get-logging-info …  Other commands:  Data manipulation (copy, copyAndRegister, replicate, etc)  Services administration Executable = “/bin/cat”; Arguments = “MyFile”; StdOutput = “std.out”; StdError = “std.err”; InputSandbox = {“MyFile”}; OutputSandbox = {“std.out”,“std.err”}; Requirements = MaxCPUTime > 10; Site = “LCG.CERN.ch”; Priority = 50

20 The DIRAC API from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication('DaVinci', 'v12r15') job.setInputSandbox(['DaVinci.opts', 'lib ']) job.setInputData(['/lhcb/production/DC04/DST/ _ _10.dst']) job.setOutputSandbox(['DVNtuples.root', 'DaVinci_v12r15.log']) jobid = dirac.submit(job,verbose=1) print "Job ID = ",jobid  The DIRAC API provides a transparent way for users to submit production or analysis jobs to LCG  Can be single application or complicated DAGs  While it may be exploited directly, the DIRAC API also serves as the interface for the GANGA Grid front-end to perform distributed user analysis for LHCb

21 Job Monitoring

22 Job Monitoring

23 Job Monitoring  Job monitoring pages are rather simple but functional  Monitoring the DIRAC jobs but not LCG Pilot Agent jobs  Studying the use of the Dashboard for the LCG job monitoring  MonALISA client is incorporated into DIRAC as well  Not used so far mainly for the lack of manpower  Interested in using the MonALISA system for monitoring complex system states Services availability and status System use patterns

24 DIRAC on Windows  DIRAC is implemented (almost) entirely in Python – porting to Windows was relatively easy  No Globus libraries  Security: PyOpenSSL+OpenSSL  GridFTP client from.NetGridFTP project Needs.Net installed  User Interface part  Full job submission to DIRAC/LCG, monitoring and output retrieval  Full Windows based analysis chain with Bender – data analysis in python.  Agent part  Getting jobs and executing on the Windows PC Problems with getting Gauss/GEANT4 running on Windows; Boole, Brunel, DaVinci are OK  Studying the practical use of the Windows resources

25 DIRAC infrastructure  4 instances of the WMS service at CERN  Test, production, user, data  Plan to merge the production and the user system  Single server for each instance Except for production where Director Agent is running on a separate machine  VO-boxes at all the 6 Tier1 centers and at CERN  LCG services dedicated to LHCb  2 dedicated RB + a gLite RB soon  LFC write and read-only instances  Classic SE for log files storage

26 Processing Database  The suite of Production Manager tools to facilitate the routine production tasks:  define complex production workflows  manage large numbers of production jobs  Transformation Agents prepare data reprocessing jobs automatically as soon as the input files are registered in the Processing Database via a standard File Catalog interface  Minimize the human intervention, speed up standard production

27 DIRAC production performance  Up to 8000 simultaneous production jobs  The throughput is only limited by the capacity available on LCG  ~80 distinct sites accessed through LCG or through DIRAC directly

28 Conclusions  DIRAC has grown into a versatile and flexible system to manage a community workload running on a variety of computing resources  The Overlay Network paradigm employed by the DIRAC system proved to be efficient in integrating heterogeneous resources in a single reliable system for simulation data production  The system is now extended to deal with the Distributed Analysis tasks  DIRAC becomes a complete system, but still a lot of development and tidying up ahead.

29 Pilot agents  Pilot agents are deployed on the Worker Nodes as regular jobs using the standard LCG scheduling mechanism  Form a distributed Workload Management system  Once started on the WN, the pilot agent performs some checks of the environment  Measures the CPU benchmark, disk and memory space  Installs the application software  If the WN is OK the user job is retrieved from the central DIRAC Task Queue and executed  In the end of execution some operations can be requested to be done asynchronously on the VO-box to accomplish the job

30 Distributed Analysis  The Pilot Agent paradigm was extended recently to the Distributed Analysis activity  The advantages of this approach for users are:  Inefficiencies of the LCG grid are completely hidden from the users  Fine optimizations of the job turnaround It also reduces the load on the LCG WMS  The system was demonstrated to serve dozens of simultaneous users with about 2Hz submission rate  The limitation is mainly in the capacity of LCG RB to schedule this number of jobs