Grid Deployment Board meeting, 8 November 2006, CERN

Slides:



Advertisements
Similar presentations
LHCb on the Grid A Tale of many Migrations
Advertisements

Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
A Computation Management Agent for Multi-Institutional Grids
Haga clic para cambiar el estilo de título Haga clic para modificar el estilo de subtítulo del patrón DIRAC Framework A.Casajus and R.Graciani (Universitat.
SEE-GRID-SCI Hands-On Session: Workload Management System (WMS) Installation and Configuration Dusan Vudragovic Institute of Physics.
INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
1 Grid services based architectures Growing consensus that Grid services is the right concept for building the computing grids; Recent ARDA work has provoked.
CoreGRID Workpackage 5 Virtual Institute on Grid Information and Monitoring Services Authorizing Grid Resource Access and Consumption Erik Elmroth, Michał.
Stuart K. PatersonCHEP 2006 (13 th –17 th February 2006) Mumbai, India 1 from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication('DaVinci',
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Interactive Job Monitor: CafMon kill CafMon tail CafMon dir CafMon log CafMon top CafMon ps LcgCAF: CDF submission portal to LCG resources Francesco Delli.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
INFSO-RI Enabling Grids for E-sciencE EGEE is a project funded by the European Union under contract INFSO-RI Grid Accounting.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.
Transformation System report Luisa Arrabito 1, Federico Stagni 2 1) LUPM CNRS/IN2P3, France 2) CERN 5 th DIRAC User Workshop 27 th – 29 th May 2015, Ferrara.
1 DIRAC Job submission A.Tsaregorodtsev, CPPM, Marseille LHCb-ATLAS GANGA Workshop, 21 April 2004.
DIRAC Review (12 th December 2005)Stuart K. Paterson1 DIRAC Review Workload Management System.
GRID Security & DIRAC A. Casajus R. Graciani A. Tsaregorodtsev.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
Proxy management mechanism and gLExec integration with the PanDA pilot Status and perspectives.
CHEP 2006, February 2006, Mumbai 1 DIRAC, the LHCb Data Production and Distributed Analysis system A.Tsaregorodtsev, CPPM, Marseille CHEP 2006,
1 DIRAC WMS & DMS A.Tsaregorodtsev, CPPM, Marseille ICFA Grid Workshop,15 October 2006, Sinaia.
Maarten Litmaath (CERN), EGEE’08 1 Pilot Job Frameworks Review Introduction Summary GDB presentation.
Security and VO management enhancements in Panda Workload Management System Jose Caballero Maxim Potekhin Torre Wenaus Presented by Maxim Potekhin at HPDC08.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
ECGI meeting on job priorities on May 15th 2006, CNAF Bologna How LHCb thinks to use/integrate g-PBox (or single components) and when Gianluca Castellani.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.
Job Priorities and Resource sharing in CMS A. Sciabà ECGI meeting on job priorities 15 May 2006.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
CE design report Luigi Zangrando
Maarten Litmaath, GDB, 2008/06/11 1 Pilot Job Frameworks Review GDB working group mandated by WLCG MB on Jan. 22, 2008 Mission –Review security issues.
DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.
WLCG IPv6 deployment strategy
Grid Computing: Running your Jobs around the World
DGAS A.Guarise April 19th, Athens
OGF PGI – EDGI Security Use Case and Requirements
Design rationale and status of the org.glite.overlay component
Workload Management System ( WMS )
Workload Management System
Joint JRA1/JRA3/NA4 session
INFN-GRID Workshop Bari, October, 26, 2004
John Gordon, STFC-RAL GDB 10 October 2007
DIRAC services.
Accounting at the T1/T2 Sites of the Italian Grid
Introduction to Grid Technology
CRC exercises Not happy with the way the document for testbed architecture is progressing More a collection of contributions from the mware groups rather.
Short update on the latest gLite status
Glexec/SCAS Pilot: IN2P3-CC status
Job workflow Pre production operations:
Artem Trunov and EKP team EPK – Uni Karlsruhe
THE STEPS TO MANAGE THE GRID
Simulation use cases for T2 in ALICE
VMDIRAC status Vanessa HAMAR CC-IN2P3.
The LHCb Computing Data Challenge DC06
Presentation transcript:

Grid Deployment Board meeting, 8 November 2006, CERN LHCb glexec use case A.Tsaregorodtsev, CPPM, Marseille Grid Deployment Board meeting, 8 November 2006, CERN

Outline WMS with Pilot Agents Job prioritization problem and solution Use of glexec in LHCb Status

Introduction (1) LHCb has its own WMS capable of steering jobs of the LHCb users on different resources LCG, standalone clusters, PCs The LHCb WMS (DIRAC) uses the Pilot Agent (Job) paradigm Increased reliability, elimination of the “black holes’ More precise late scheduling Central VO Task Queue The Pilot Agent paradigm leads naturally to a Job Prioritization schema with generic VO Pilot Agents Prioritization in the central Task Queue

DIRAC workload management LFC checkData Job Receiver JDL Job Receiver Data Optimizer Task Queue Sandbox Job Input JobDB Job JDL Agent Director checkJob getReplicas Agent Monitor checkPilot Matcher CE JDL WMS Admin getProxy Job Monitor RB Pilot Job SE uploadData getSandbox CE VO-box putRequest DIRAC services LCG Workload On WN Job Wrapper execute (glexec) Pilot Agent WN User Application fork

Job prioritization task LHCb computing activities are varied and each kind of activity has different priority, Simulation and reconstruction jobs Production and user analysis jobs LHCb user community is divided into groups of interest and each group will have its well defined share of the LHCb resources Consumption according to the shares should be ensured Each user activity can consist of different tasks with different importance for the user User should be able to control the priority of his own tasks

Job Prioritization strategy Submitted jobs are inserted in the Central Task Queue and ordered by their priorities The priorities are either set by hand or calculated based on the LHCb policies and accounting E.g., using a Maui scheduling engine For each job in the Task Queue a Pilot Agent is sent to LCG Keeping track of the original job identity now Generic VO Pilot Agent eventually When the Pilot starts on a WN it asks for a job to the Matcher service The highest priority suitable job is served “Centralized” prioritization as opposed to the “0n-site” prioritization

Community Overlay Network DIRAC Central Services and Pilot Agents form a dynamic distributed system as easy to manage as an ordinary batch system Uniform view of the resources independent of their nature Grids, clusters, PCs Prioritization according to VO policies and accounting Possibility to reuse batch system tools E.g. Maui scheduler DIRAC WMS TQ Monitoring Logging Scheduler “GRID” WN Pilot Agent WN Pilot Agent WN Pilot Agent

Generic VO pilot agents Central Job prioritization strategy only makes sense if all the LHCb jobs (User and Production) are treated the same way together. Generic VO pilot agents are needed Each agent should take the highest priority job independently of its ownership Implications The VO WMS should be as secure as the LCG WMS User credential delegation Interaction with the site policy enforcement system Job owner traceability This is where the glexec functionality is necessary

Job submission to DIRAC WMS User jobs are submitted via a JobReceiver service using DISET security mechanism GSI authentication of the user credentials User authorization is done based on the DIRAC internal configuration accessible to and managed by the LHCb administrators The user proxy is passed to DIRAC via an SSL encrypted channel No new private/public key pair generated The proxies of the users whose jobs are in the system are stored in the MySQL database LCG proxy cache tools can be incorporated if necessary

Respecting Site policy Generic VO Pilot Agent should not run the jobs of the users in a site black list How to enforce it: glexec utility Consults site policy enforcement box Changes the user identity for running the user application Executes the user application Accounting: LHCb does not require finer grained accounting than the VO level on the sites Group or user level accounting is done by the LHCb VO Already exists

User proxy delegation User grid job performs operations needing user credentials Access to data, to catalogs, etc DIRAC WMSAdministrator service can serve user proxies to the requests of the LHCb administrator users DISET secure service Uses MyProxy server to serve only proxies with a minimally required life time Pilot Agent should acquire the credentials (proxy) of the actual owner of a job For the moment using the DIRAC tools Eventually LCG/gLite provided proxy delegation mechanism We would be happy to try it out The user proxy is passed to the glexec together with the user workload

Job traceability Job traceability – how to know which workload is being executed on a WN at each moment Possible solutions Log file on the WN, for example: Providing a VO service to query this same information The information is kept in the DIRAC Job Monitoring service anyway System auditing tracing problems to the id set by glexec 2006-10-11 03:05:03: Starting job 00001455_00000744 2006-10-11 03:05:03: Job owner /O=GRID-FR/C=FR/O=CNRS/OU=CPPM/CN=Andrei Tsaregorodtsev 2006-10-11 03:05:03: Job top process id 12345 2006-10-11 08:04:34: Finished job 00001455_00000744 2006-10-11 08:04:34: CPU consumption for 00001455_00000744: 34875.4 s

Where we are We are practically ready to proceed with the testing of the overall schema and glexec in particular Pilot Agent is generating dynamically job wrapper scripts Wrappers are passed to the LRMS on the DIRAC sites Can be passed to the glexec as the executable workload The LHCb/DIRAC fine grained status and accounting reports are done within the wrapper scripts This will not be deteriorated by the glexec generating a new process group The user proxy delegation are done by DIRAC tools Job priorities evaluation is in place Rather simple ones for the moment More elaborated mechanisms are in the works

DIRAC workload management Central Services LHCb policy, quotas VOMS info Accounting Service Priority Calculator Job requirements, ownership Task Queue 1 Job priority Task Queue 1 Task Queue 1 Optimizer Prioritizer Job Receiver Optimizer Data Optimizer XXX Job Database Match Maker Agent Director Resources (WNs) Agent 1 Agent 2 … Agent 2

Where we are We discussed with CC/IN2P3, Lyon, about the possible tests Everybody agrees, need to sort out technical details of the glexec deployment More details to be clarified for the sites that do workload optimization based on the job properties For example, CPU bound vs I/O bound jobs

Conclusions LHCb intends to employ “central strategy” for the job prioritization problem This solution necessitates the use of generic Pilot Agents running on LCG nodes and executing arbitrary user jobs glexec is the crucial part of it which ensures the job traceability and and application of the site policies LHCb is ready to do the tests of the glexec functionality in the full job prioritization procedure

Backup slides

User proxy delegation in DIRAC LCG secure access DISET secure access DIRAC WMS User job submission Job Receiver Agent Director LCG RB JobDB WN MyProxy Proxy store Agent glexec WMS Admin Job