Download presentation
Presentation is loading. Please wait.
Published byCynthia Joseph Modified over 8 years ago
1
DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of Oxford The LHCB Experiment Job Management Services LHCb Particle Physics detector at CERN It will generate data at 40MB/s from 2007 that‘s 3.4 TB/day 500 physicists 60 institutes/sites Data Challenge 2004(DC04) Production: Monte Carlo Simulation Goal: 10 % of the final real data 60 TB of simulated and reconstructed data DC04 Goal Execute 200 K. coarse grain jobs ~3000 Worker nodes (WN) Access to heterogeneous computing systems, including LCG, continuously during 4 months Workload Management System(WMS) Moun Yoke Vertex Shielding Tracker Calorimeters RICH-2 Coil Muon Detector RICH-1 Job Receiver Receives jobs Inserts the JDL into the Job Data Base Notifies Optimizers that a new job has arrived via an Instant Messaging mechanism Optimizers Optimizers are pluggable modules Generic optimizer functionality Extracts new job from DB on notification Inserts it into a tas Queue with a particular rank Sorts Jobs in different task Queue o e.g. by site, job type or requirement Match-Maker The match-maker does matching between the job JDL and resource capabilities taking into account task queue priorities User: Submits jobs to the system Job is described with the Job Description Language (JDL) http://dirac.cern.ch Proposal A scheduler/Workload Management system for « High Throughput Computing »,flexible, scalable and robust Implemented as a service in DIRAC* *See « DIRAC – The distributed production and analysis for LHCB », Track 4 – Distributed Computing Services for detailscc Agent Participates in scheduling decision Pulls a job from JMS Sets environment common to the whole site Implemented as a set of pluggable modules Scheduling Module If the resource is available asks for job from Match-Maker providing description of its corresponding CE sends jobs to CE,.g. batch system The criterion of resource availability depends on the nature of the CE Example: Agent A meta-scheduling and 3-Tiers architecture Job Management Services Agent JobWrapper Realizes “pull” scheduling paradigm Able to work with Data Management tools A Virtual Organisation/Community scheduler DIRAC’s aim is to consume all the LHCb resources accessible whatever their nature is
2
c Computing resources Performance in DC’04 Service Oriented Architecture Services exposed via simple XML-RPC interface accessible over HTTP 99% Python Using Condor ClassAds, and Condor Matchmaker for job/resource description language and the match-making operation MySQL for Job DB, Job Queues and Job Status Internal interfaces by instant messaging technologies* Result* Deployed on 20 “normal”, and 40 “LCG” sites Effectively saturated LCG and all available computing resources during the 2004 Data Challenge Supported 4000 simultaneous jobs across 60 sites Produced, transferred, and replicated 58 TB of data, plus meta-data Consumed over 400 CPU years during last 3 months Matching Time Averaged 420ms match time over 60,000 jobs Queued jobs grouped by categories Matches performed by category Typically 1,000 to 20,000 jobs queued The concept of the light, easy to customize and deploy agents as components of a distributed work load management system proved to be very useful The scalability of the system allowed to saturate all available resource during the recent Data Challenge exercise Conclusion Computing Element(CE) Uniform abstraction of a computing resource Described by a resource description Language (RDL) Offers a simple Service API: SubmitJob() KillJob() JobStatus() CE interface available for PBS LSF BQS NQS Sun Grid Engine Condorc Condor A standalone PC LCG … and others Typical Job 2 GB local storage 300-600 MB transferred at end 15-24 hours execution DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of Oxford http://dirac.cern.ch Agent Deployment Two Modes Static An agent deployed on a gateway submits jobs to the local computing system Under the control of the local site administrator Dynamic Submitting jobs which deploy agents on the WN Once the agent on the WN, it acts like a normal agent on a “virtual” DIRAC site LCG integration example: Technical point of view *See «Grid Information and Monitoring System using XML-RPC and Instant Messaging for DIRAC », Track 4 – Distributed Computing Services for details *See « Results of the LHCb experiment Data Challenge 2004 », Track 5 – “ Distributed Computing Systems and Experiences “ for details The Job is wrapped in a script, called the job wrapper and then submitted to the local queue Main operations: Monitoring & Accounting messages * I/O data, sandbox can provide a connection for interaction with the user owner of the job The Job Wrapper This script steers the job execution on the Worker node *See « A Lightweight Monitoring and Accounting System for LHCb DC04 Production », Track 4 – Distributed Computing Services for details
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.