CHEP 2004, 27 September - 1 October 2004, Interlaken1 DIRAC – the distributed production and analysis for LHCb A.Tsaregorodtsev, CPPM, Marseille CHEP 2004,

Slides:



Advertisements
Similar presentations
ATLAS/LHCb GANGA DEVELOPMENT Introduction Requirements Architecture and design Interfacing to the Grid Ganga prototyping A. Soroko (Oxford), K. Harrison.
Advertisements

Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
Database Architectures and the Web
High Performance Computing Course Notes Grid Computing.
CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
1 Grid services based architectures Growing consensus that Grid services is the right concept for building the computing grids; Recent ARDA work has provoked.
Stuart K. PatersonCHEP 2006 (13 th –17 th February 2006) Mumbai, India 1 from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication('DaVinci',
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
AliEn uses bbFTP for the file transfers. Every FTD runs a server, and all the others FTD can connect and authenticate to it using certificates. bbFTP implements.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.
DISTRIBUTED COMPUTING
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Grid Initiatives for e-Science virtual communities in Europe and Latin America DIRAC TEAM CPPM – CNRS DIRAC Grid Middleware.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Real Time Monitor of Grid Job Executions Janusz Martyniak Imperial College London.
Cosener’s House – 30 th Jan’031 LHCb Progress & Plans Nick Brook University of Bristol News & User Plans Technical Progress Review of deliverables.
Backdrop Particle Paintings created by artist Tom Kemp September Grid Information and Monitoring System using XML-RPC and Instant.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
DOSAR Workshop, Sao Paulo, Brazil, September 16-17, 2005 LCG Tier 2 and DOSAR Pat Skubic OU.
LHCb week, 27 May 2004, CERN1 Using services in DIRAC A.Tsaregorodtsev, CPPM, Marseille 2 nd ARDA Workshop, June 2004, CERN.
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
Author - Title- Date - n° 1 Partner Logo EU DataGrid, Work Package 5 The Storage Element.
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
Your university or experiment logo here LHCb Development Glenn Patrick Raja Nandakumar GridPP18, 20 March 2007.
Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.
1 LHCb File Transfer framework N. Brook, Ph. Charpentier, A.Tsaregorodtsev LCG Storage Management Workshop, 6 April 2005, CERN.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Implementation and performance analysis of.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.
Transformation System report Luisa Arrabito 1, Federico Stagni 2 1) LUPM CNRS/IN2P3, France 2) CERN 5 th DIRAC User Workshop 27 th – 29 th May 2015, Ferrara.
JAliEn Java AliEn middleware A. Grigoras, C. Grigoras, M. Pedreira P Saiz, S. Schreiner ALICE Offline Week – June 2013.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
1 LHCb view on Baseline Services A.Tsaregorodtsev, CPPM, Marseille Ph.Charpentier CERN Baseline Services WG, 4 March 2005, CERN.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
LHCb system for distributed MC production (data analysis) and its use in Russia NEC’2005, Varna, Bulgaria Ivan Korolko (ITEP Moscow)
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
1 DIRAC agents A.Tsaregorodtsev, CPPM, Marseille ARDA Workshop, 7 March 2005, CERN.
CHEP 2006, February 2006, Mumbai 1 DIRAC, the LHCb Data Production and Distributed Analysis system A.Tsaregorodtsev, CPPM, Marseille CHEP 2006,
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
GAG meeting, 5 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, Marseille N. Brook, Bristol/CERN GAG Meeting, 5 July 2004, CERN.
1 DIRAC WMS & DMS A.Tsaregorodtsev, CPPM, Marseille ICFA Grid Workshop,15 October 2006, Sinaia.
1 DIRAC Data Management Components A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
1 DIRAC project A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.
Jean-Philippe Baud, IT-GD, CERN November 2007
StoRM: a SRM solution for disk based storage systems
INFN GRID Workshop Bari, 26th October 2004
DIRAC Production Manager Tools
INFN-GRID Workshop Bari, October, 26, 2004
The LHCb Software and Computing NSS/IEEE workshop Ph. Charpentier, CERN B00le.
LHCb Grid Computing LHCb is a particle physics experiment which will study the subtle differences between matter and antimatter. The international collaboration.
DIRAC Data Management: consistency, integrity and coherence of data
INFNGRID Workshop – Bari, Italy, October 2004
Production Manager Tools (New Architecture)
Presentation transcript:

CHEP 2004, 27 September - 1 October 2004, Interlaken1 DIRAC – the distributed production and analysis for LHCb A.Tsaregorodtsev, CPPM, Marseille CHEP 2004, 27September – 1 October 2004, Interlaken

CHEP 2004, 27 September - 1 October 2004, Interlaken2 Authors  DIRAC development team TSAREGORODTSEV Andrei, GARONNE Vincent, STOKES- REES Ian, GRACIANI-DIAZ Ricardo, SANCHEZ-GARCIA Manuel, CLOSIER Joel, FRANK Markus, KUZNETSOV Gennady, CHARPENTIER Philippe  Production site managers BLOUW Johan, BROOK Nicholas, EGEDE Ulrik, GANDELMAN Miriam, KOROLKO Ivan, PATRICK Glen, PICKFORD Andrew, ROMANOVSKI Vladimir, SABORIDO-SILVA Juan, SOROKO Alexander, TOBIN Mark, VAGNONI Vincenzo, WITEK Mariusz, BERNET Roland ITEP Universidade Federal do Rio de Janeiro

CHEP 2004, 27 September - 1 October 2004, Interlaken3 Outline  DIRAC in a nutshell  Design goals  Architecture and components  Implementation technologies  Interfacing to LCG  Conclusion

CHEP 2004, 27 September - 1 October 2004, Interlaken4 DIRAC in a nutshell  LHCb grid system for the Monte-Carlo simulation data production and analysis  Integrates computing resources available at LHCb production sites as well as on the LCG grid  Composed of a set of light-weight services and a network of distributed agents to deliver workload to computing resources  Runs autonomously once installed and configured on production sites  Implemented in Python, using XML-RPC service access protocol DIRAC – Distributed Infrastructure with Remote Agent Control

CHEP 2004, 27 September - 1 October 2004, Interlaken5 DIRAC scale of resource usage  Deployed on 20 “DIRAC”, and 40 “LCG” sites  Effectively saturated LCG and all available computing resources during the 2004 Data Challenge  Supported 3500 simultaneous jobs across 60 sites  Produced, transferred, and replicated 65 TB of data, plus meta-data  Consumed over 425 CPU years during last 4 months See presentation by J.Closier Wed 16:30 [403]

CHEP 2004, 27 September - 1 October 2004, Interlaken6 DIRAC design goals  Light implementation  Must be easy to deploy on various platforms  Non-intrusive No root privileges, no dedicated machines on sites  Must be easy to configure, maintain and operate  Using standard components and third party developments as much as possible  High level of adaptability  There will be always resources outside LCGn domain Sites that can not afford LCG, desktops, …  We have to use them all in a consistent way  Modular design at each level  Adding easily new functionality

CHEP 2004, 27 September - 1 October 2004, Interlaken7 DIRAC architecture

CHEP 2004, 27 September - 1 October 2004, Interlaken8 DIRAC architecture  Services Oriented Architecture (SOA)  Inspired by OGSA/OGSI “grid services” concept  Followed LCG/ARDA RTAG architecture blueprint  ARDA/RTAG proposal  Open architecture with well defined interfaces  allowing for replaceable, alternative services  providing choices and competition

CHEP 2004, 27 September - 1 October 2004, Interlaken9 DIRAC Services and Resources DIRAC Job Management Service DIRAC Job Management Service DIRAC CE LCG Resource Broker Resource Broker CE 1 DIRAC Sites Agent CE 2 CE 3 Production manager Production manager GANGA UI User CLI JobMonitorSvc JobAccountingSvc AccountingDB Job monitor ConfigurationSvc FileCatalogSvc MonitoringSvc BookkeepingSvc BK query webpage BK query webpage FileCatalog browser FileCatalog browser User interfaces DIRAC services DIRAC resources DIRAC Storage DiskFile gridftp bbftp rfio WMS

CHEP 2004, 27 September - 1 October 2004, Interlaken10 DIRAC workload management  Realizes PULL scheduling paradigm  Agents are requesting jobs whenever the corresponding resource is free  Using Condor ClassAd and Matchmaker for finding jobs suitable to the resource profile  Agents are steering job execution on site  Jobs are reporting their state and environment to central Job Monitoring service See poster presentation by V.Garonne, Wed [365]

CHEP 2004, 27 September - 1 October 2004, Interlaken11 Matching efficiency  Averaged 420ms match time over 60,000 jobs  Using Condor ClassAds and Matchmaker  Queued jobs grouped by categories  Matches performed by category  Typically 1,000 to 20,000 jobs queued

CHEP 2004, 27 September - 1 October 2004, Interlaken12 Agent Container JobAgent PendingJobAgent BookkeepingAgent TransferAgent MonitorAgent CustomAgent DIRAC: Agent modular design  Agent is a container of pluggable modules  Modules can be added dynamically  Several agents can run on the same site  Equipped with different set of modules as defined in their configuration  Data management is based on using specialized agents running on the DIRAC sites

CHEP 2004, 27 September - 1 October 2004, Interlaken13 File Catalogs  DIRAC incorporated 2 different File Catalogs  Replica tables in the LHCb Bookkeeping Database  File Catalog borrowed from the AliEn project  Both catalogs have identical XML-RPC service interfaces  Can be used interchangeably  This was done for redundancy and gaining experience

CHEP 2004, 27 September - 1 October 2004, Interlaken14 Data management tools  DIRAC Storage Element is a combination of a standard server and a description of its access in the Configuration Service  Pluggable transport modules: gridftp,bbftp,sftp,ftp,http, …  DIRAC ReplicaManager interface (API and CLI)  get(), put(), replicate(), register(), etc  Reliable file transfer  Request DB keeps outstanding transfer requests  A dedicated agent takes file transfer request and retries it until it is successfully completed  Using WMS for data transfer monitoring

CHEP 2004, 27 September - 1 October 2004, Interlaken15 Other Services  Configuration Service  Provides configuration information for various system components (services,agents,jobs)  Bookkeeping (Metadata+Provenance) database  Stores data provenance information  see presentation by C.Cioffi, Mon [392]  Monitoring and Accounting service  A set of services to monitor job states and progress and to accumulate statistics of resource usage  see presentation by M.Sanchez, Thu [388]

CHEP 2004, 27 September - 1 October 2004, Interlaken16 Implementation technologies

CHEP 2004, 27 September - 1 October 2004, Interlaken17 XML-RPC protocol  Standard, simple, available out of the box in the standard Python library  Both server and client  Using expat XML parser  Server  Very simple socket based  Multithreaded  Supports up to 40 Hz requests rates  Client  Dynamically built service proxy  Did not feel a need for something more complex  SOAP, WSDL, …

CHEP 2004, 27 September - 1 October 2004, Interlaken18 Instant Messaging in DIRAC  Jabber/XMPP IM  asynchronous, buffered, reliable messaging framework  connection based Authenticate once “tunnel” back to client – bi-directional connection with just outbound connectivity (no firewall problems, works with NAT)  Used in DIRAC to  send control instructions to components (services, agents, jobs) XML-RPC over Jabber Broadcast or to specific destination  monitor the state of components  interactivity channel with running jobs See poster presentation by I.Stokes-Rees, Wed [368]

CHEP 2004, 27 September - 1 October 2004, Interlaken19 Services reliability issues  If something can fail, it will do  Regular back-up of underlying database  Journaling of all the write operations  Running more than one instance of the service  E.g. configuration service running at CERN and in Oxford  Running services reliably:  Runit set of tools: Automatic restart on failure Automatic logging with possibility of rotation Simple interface to pause, stop, and send control signals to services Runs service in daemon mode Can be used entirely in user space

CHEP 2004, 27 September - 1 October 2004, Interlaken20 Interfacing to LCG

CHEP 2004, 27 September - 1 October 2004, Interlaken21 Dynamically deployed agents How to involve the resources where the DIRAC agents are not yet installed or can not be installed ?  Workload management with resource reservation  Sending agent as a regular job  Turning a WN into a virtual LHCb production site  This strategy was applied for DC04 production on LCG:  Effectively using LCG services to deploy DIRAC infrastructure on the LCG resources  Efficiency:  >90 % success rates for DIRAC jobs on LCG  While 60% success rates of LCG jobs No harm for the DIRAC production system  One person ran the LHCb DC04 production in LCG  Most intensive use of LCG2 up to now ( > 200 CPU years )

CHEP 2004, 27 September - 1 October 2004, Interlaken22 Conclusions  The Service Oriented Architecture is essential for building ad hoc grid systems meeting the needs of particular organizations out of the components available on the “market”  The concept of the light, easy to customize and deploy agents as components of a distributed work load management system proved to be very useful  The scalability of the system allowed to saturate all available resource during the recent Data Challenge exercise including LCG