DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.

Slides:



Advertisements
Similar presentations
Ying Ying Li Windows Implementation of LHCb Experiment Workload Management System DIRAC LHCb is one of the four main high energy physics experiments at.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
SEE-GRID-SCI Hands-On Session: Workload Management System (WMS) Installation and Configuration Dusan Vudragovic Institute of Physics.
1 Grid services based architectures Growing consensus that Grid services is the right concept for building the computing grids; Recent ARDA work has provoked.
CHEP 2004, 27 September - 1 October 2004, Interlaken1 DIRAC – the distributed production and analysis for LHCb A.Tsaregorodtsev, CPPM, Marseille CHEP 2004,
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Workload Management Massimo Sgaravatto INFN Padova.
IFIN-HH LHCB GRID Activities Eduard Pauna Radu Stoica.
Stuart K. PatersonCHEP 2006 (13 th –17 th February 2006) Mumbai, India 1 from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication('DaVinci',
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
1 BIG FARMS AND THE GRID Job Submission and Monitoring issues ATF Meeting, 20/06/03 Sergio Andreozzi.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Cosener’s House – 30 th Jan’031 LHCb Progress & Plans Nick Brook University of Bristol News & User Plans Technical Progress Review of deliverables.
Backdrop Particle Paintings created by artist Tom Kemp September Grid Information and Monitoring System using XML-RPC and Instant.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
DOSAR Workshop, Sao Paulo, Brazil, September 16-17, 2005 LCG Tier 2 and DOSAR Pat Skubic OU.
LHCb week, 27 May 2004, CERN1 Using services in DIRAC A.Tsaregorodtsev, CPPM, Marseille 2 nd ARDA Workshop, June 2004, CERN.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Virtual Batch Queues A Service Oriented View of “The Fabric” Rich Baker Brookhaven National Laboratory April 4, 2002.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.
CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.
DIRAC Review (12 th December 2005)Stuart K. Paterson1 DIRAC Review Workload Management System.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
INFSO-RI Enabling Grids for E-sciencE Policy management and fair share in gLite Andrea Guarise HPDC 2006 Paris June 19th, 2006.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
1 LHCb view on Baseline Services A.Tsaregorodtsev, CPPM, Marseille Ph.Charpentier CERN Baseline Services WG, 4 March 2005, CERN.
The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
LHCb system for distributed MC production (data analysis) and its use in Russia NEC’2005, Varna, Bulgaria Ivan Korolko (ITEP Moscow)
1 DIRAC agents A.Tsaregorodtsev, CPPM, Marseille ARDA Workshop, 7 March 2005, CERN.
CHEP 2006, February 2006, Mumbai 1 DIRAC, the LHCb Data Production and Distributed Analysis system A.Tsaregorodtsev, CPPM, Marseille CHEP 2006,
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
CE design report Luigi Zangrando
CH-1211 Genève 23 Job efficiencies at CERN Review of job efficiencies at CERN status report James Casey, Daniel Rodrigues, Ulrich Schwickerath.
Resource access in the EGEE project Massimo Sgaravatto INFN Padova
Workload Management Workpackage
ALICE & Clouds GDB Meeting 15/01/2013
Moving the LHCb Monte Carlo production system to the GRID
Design rationale and status of the org.glite.overlay component
DIRAC Production Manager Tools
Workload Management System ( WMS )
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Grid Deployment Board meeting, 8 November 2006, CERN
Simulation use cases for T2 in ALICE
The Ganga User Interface for Physics Analysis on Distributed Resources
R. Graciani for LHCb Mumbay, Feb 2006
Wide Area Workload Management Work Package DATAGRID project
Gridifying the LHCb Monte Carlo production system
Production Manager Tools (New Architecture)
The LHCb Computing Data Challenge DC06
Presentation transcript:

DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of Oxford The LHCB Experiment Job Management Services LHCb Particle Physics detector at CERN It will generate data at 40MB/s from 2007  that‘s 3.4 TB/day 500 physicists 60 institutes/sites Data Challenge 2004(DC04) Production: Monte Carlo Simulation Goal: 10 % of the final real data 60 TB of simulated and reconstructed data DC04 Goal Execute 200 K. coarse grain jobs ~3000 Worker nodes (WN) Access to heterogeneous computing systems, including LCG, continuously during 4 months Workload Management System(WMS) Moun Yoke Vertex Shielding Tracker Calorimeters RICH-2 Coil Muon Detector RICH-1 Job Receiver Receives jobs Inserts the JDL into the Job Data Base Notifies Optimizers that a new job has arrived via an Instant Messaging mechanism Optimizers Optimizers are pluggable modules Generic optimizer functionality  Extracts new job from DB on notification  Inserts it into a tas Queue with a particular rank  Sorts Jobs in different task Queue o e.g. by site, job type or requirement Match-Maker The match-maker does matching between the job JDL and resource capabilities taking into account task queue priorities User: Submits jobs to the system Job is described with the Job Description Language (JDL) Proposal A scheduler/Workload Management system for « High Throughput Computing »,flexible, scalable and robust Implemented as a service in DIRAC* *See « DIRAC – The distributed production and analysis for LHCB », Track 4 – Distributed Computing Services for detailscc Agent Participates in scheduling decision Pulls a job from JMS Sets environment common to the whole site Implemented as a set of pluggable modules Scheduling Module If the resource is available  asks for job from Match-Maker providing description of its corresponding CE  sends jobs to CE,.g. batch system The criterion of resource availability depends on the nature of the CE Example: Agent A meta-scheduling and 3-Tiers architecture Job Management Services Agent JobWrapper Realizes “pull” scheduling paradigm Able to work with Data Management tools A Virtual Organisation/Community scheduler  DIRAC’s aim is to consume all the LHCb resources accessible whatever their nature is

c Computing resources Performance in DC’04 Service Oriented Architecture  Services exposed via simple XML-RPC interface  accessible over HTTP 99% Python Using Condor ClassAds, and Condor Matchmaker for job/resource description language and the match-making operation MySQL for Job DB, Job Queues and Job Status Internal interfaces by instant messaging technologies* Result* Deployed on 20 “normal”, and 40 “LCG” sites Effectively saturated LCG and all available computing resources during the 2004 Data Challenge Supported 4000 simultaneous jobs across 60 sites Produced, transferred, and replicated 58 TB of data, plus meta-data Consumed over 400 CPU years during last 3 months Matching Time Averaged 420ms match time over 60,000 jobs Queued jobs grouped by categories Matches performed by category Typically 1,000 to 20,000 jobs queued The concept of the light, easy to customize and deploy agents as components of a distributed work load management system proved to be very useful The scalability of the system allowed to saturate all available resource during the recent Data Challenge exercise Conclusion Computing Element(CE) Uniform abstraction of a computing resource Described by a resource description Language (RDL) Offers a simple Service API:  SubmitJob()  KillJob()  JobStatus() CE interface available for PBS LSF BQS NQS Sun Grid Engine Condorc Condor A standalone PC LCG … and others Typical Job 2 GB local storage MB transferred at end hours execution DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of Oxford Agent Deployment Two Modes Static  An agent deployed on a gateway submits jobs to the local computing system  Under the control of the local site administrator Dynamic  Submitting jobs which deploy agents on the WN  Once the agent on the WN, it acts like a normal agent on a “virtual” DIRAC site LCG integration example: Technical point of view *See «Grid Information and Monitoring System using XML-RPC and Instant Messaging for DIRAC », Track 4 – Distributed Computing Services for details *See « Results of the LHCb experiment Data Challenge 2004 », Track 5 – “ Distributed Computing Systems and Experiences “ for details The Job is wrapped in a script, called the job wrapper and then submitted to the local queue Main operations:  Monitoring & Accounting messages *  I/O data, sandbox  can provide a connection for interaction with the user owner of the job The Job Wrapper This script steers the job execution on the Worker node *See « A Lightweight Monitoring and Accounting System for LHCb DC04 Production », Track 4 – Distributed Computing Services for details