EGI-InSPIRE RI-261323 EGI-InSPIRE EGI-InSPIRE RI-261323 Direct gLExec integration with PanDA Fernando H. Barreiro Megino Simone Campana.

Slides:



Advertisements
Similar presentations
Haga clic para cambiar el estilo de título Haga clic para modificar el estilo de subtítulo del patrón DIRAC Framework A.Casajus and R.Graciani (Universitat.
Advertisements

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
Business Unit or Product Name © 2007 IBM Corporation Introduction of Autotest Qing Lin.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Successful Common Projects: Structures and Processes WLCG Management.
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
EMI is partially funded by the European Commission under Grant Agreement RI Argus Policies Tutorial Valery Tschopp - SWITCH EGI TF Prague.
Open Science Grid OSG CE Quick Install Guide Siddhartha E.S University of Florida.
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
Pilot Jobs John Gordon Management Board 23/10/2007.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Direct gLExec integration with PanDA Fernando H. Barreiro Megino CERN IT-ES-VOS.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
LCG Pilot Jobs + glexec John Gordon, STFC-RAL GDB 7 November 2007.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Glite. Architecture Applications have access both to Higher-level Grid Services and to Foundation Grid Middleware Higher-Level Grid Services are supposed.
EGI-InSPIRE EGI-InSPIRE RI DDM solutions for disk space resource optimization Fernando H. Barreiro Megino (CERN-IT Experiment Support)
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
LCG Support for Pilot Jobs John Gordon, STFC GDB December 2 nd 2009.
WLCG Authentication & Authorisation LHCOPN/LHCONE Rome, 29 April 2014 David Kelsey STFC/RAL.
HPC pilot code. Danila Oleynik 18 December 2013 from.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Proxy management mechanism and gLExec integration with the PanDA pilot Status and perspectives.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Maarten Litmaath (CERN), EGEE’08 1 Pilot Job Frameworks Review Introduction Summary GDB presentation.
Security and VO management enhancements in Panda Workload Management System Jose Caballero Maxim Potekhin Torre Wenaus Presented by Maxim Potekhin at HPDC08.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI CERN and HelixNebula, the Science Cloud Fernando Barreiro Megino (CERN IT)
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI VOMS Proxy Lifetime UCB 21 Aug 2012 David Kelsey STFC.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Questionnaires to Cloud technology providers and sites Linda Cornwall, STFC,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI GLUE 2: Deployment and Validation Stephen Burke egi.eu EGI OMB March 26 th.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
ALICE WLCG operations report Maarten Litmaath CERN IT-SDC ALICE T1-T2 Workshop Torino Feb 23, 2015 v1.2.
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
Maarten Litmaath, GDB, 2008/06/11 1 Pilot Job Frameworks Review GDB working group mandated by WLCG MB on Jan. 22, 2008 Mission –Review security issues.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.
Start-SPPowerShell – Introduction to PowerShell for SharePoint Admins and Developers Paul BAker.
Jean-Philippe Baud, IT-GD, CERN November 2007
L’analisi in LHCb Angelo Carbone INFN Bologna
Multi User Pilot Jobs update
StoRM: a SRM solution for disk based storage systems
MyProxy Server Installation
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL
Workload Management System
POW MND section.
BOSS: the CMS interface for job summission, monitoring and bookkeeping
BOSS: the CMS interface for job summission, monitoring and bookkeeping
John Gordon, STFC-RAL GDB 10 October 2007
Extended OSG client for WLCG
BOSS: the CMS interface for job summission, monitoring and bookkeeping
Tweaking the Certificate Lifecycle for the UK eScience CA
Grid Deployment Board meeting, 8 November 2006, CERN
Initial job submission and monitoring efforts with JClarens
WMS Options: DIRAC and GlideIN-WMS
Site availability Dec. 19 th 2006
MB Maarten Litmaath CERN v1.0
The LHCb Computing Data Challenge DC06
Presentation transcript:

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Direct gLExec integration with PanDA Fernando H. Barreiro Megino Simone Campana Ramon Medrano (CERN IT-ES-VOS) 9/30/2016 Fernando H. Barreiro Megino1

EGI-InSPIRE RI Introduction The WLCG Grid job and Worker Node Security Assessment ( requires the usage of glExechttp://cern.ch/go/7vK9 gLExec acts as a light-weight 'gatekeeper’ 1.Take grid credentials as input 2.Consider local site policies to authenticate and authorize the credentials 3.Switch to new identity and execution sandbox to run a given command gLExec usage comes for free with GlideInWMS, but alternatively we can integrate directly gLExec with PanDA Pilot 9/30/2016 Fernando H. Barreiro Megino2

EGI-InSPIRE RI Common Analysis Framework 9/30/2016 Fernando H. Barreiro Megino3 (Optional) Client Service VO-specific client PanDA monitor and Dashboard Historical views Data Mgmt Services PanDA pilot Computing Element … Client sideServer sideGrid resources PanDA components VO specific, external components glideIns PanDA Server GlideIn WMS GlideInWMS components PanDA Pilot Factories Job trans Data Adaptor glexec PanDA pilot Job trans

EGI-InSPIRE RI Direct gLExec integration: Stage 1 9/30/2016 Fernando H. Barreiro Megino4 PanDA Server PanDA pilot gLExec Computing Element … Job MyProxy Job User proxy Job PanDA pilot gLExec Job Implementation steps 1.MyProxy & gLExec standalone tests 2.Refactoring of MyProxyUtils 3.gLExec SAM test 4.Integration into PanDA pilot

EGI-InSPIRE RI  1. MyProxy & gLExec standalone tests  Gain experience in using MyProxy and gLExec 1.Upload & download proxies and test delegation between Fernando and Ramon 2.Test gLExec through bsub on lxbatch (i.e. no pilot involved yet). Switching identity from Fernando to Ramon 1.Run id, voms-proxy-info … 2.Copy a file in&out from EOS SCRATCHDISK 9/30/2016 Fernando H. Barreiro Megino5

EGI-InSPIRE RI  2. Refactoring of MyProxyUtils  Few years ago Jose Caballero wrote MyProxyUtils python wrapper to MyProxy and gLExec some set-up for PanDA pilot Re-factored the library: Included usage of gLExec tools not available back in the days Environment wrap/unwrap scripts ( Creation of secure sandbox through mkgltempdir ( Reviewed the logic (e.g. removed chmod 777 of pilot directory before identity switching) Effort in imposing coding standards: Side-objective of our activity is to help Paul Nilsson improve the overall pilot code Validated MyProxyUtils by repeating previous standalone tests Problems setting target-directory mode in mkgltempdir (to be followed up) Ulrich had to fix the installation of the gLExec tools and myproxy-logon on part of the lxbatch WNs Are we the first people to use the auxiliary gLExec tools? pylint check 9/30/2016 Fernando H. Barreiro Megino6

EGI-InSPIRE RI gLExec SAM test Existing gLExec SAM test is not very complete Switch from one proxy to itself and do nothing as target user We would like to extend the test 1.Use MyProxy to download a different target credential 2.Check that the gLExec tools are installed in every site EGI sites should have them (Maarten Litmaath) OSG sites do not have mkgltempdir (Dave Dykstra, Igor Sfiligoi) Alternatively the script could be shipped with pilot 3.Check that the full gLExec workflow succeeds in every site Alessandro pointed us to the SAM gLExec code and Ramon will implement the changes 9/30/2016 Fernando H. Barreiro Megino7

EGI-InSPIRE RI Integration with PanDA Pilot Pilot Collect local info Job recovery Additional cleanup Multi-job loop Check proxy Check local space Collect local info Get job Fork sub process Clean up Monitoring loop Check log size (Check workDir) Looping check Check local space Setup job Transfer input Transfer output Execute payload Signal handler Abort and clean up gLExec Signal handler Abort and clean up 9/30/2016 Fernando H. Barreiro Megino8 Once user job is downloaded, switch identity ASAP

EGI-InSPIRE RI Integration with PanDA Pilot This is the hardest part of Stage 1 model implementation Guidance of Paul Nilsson is important Improving the pilot’s long-term sustainability Getting familiar with the pilot code Re-factoring the code for easier maintenance Taking out the main pilot module (~4500 lines) is a lot of work Splitting main pilot module (pilot.py) into two modules Moving functions/classes between the two modules or to utility library when needed Fixing warnings and errors that arise from separation Modularize the code Need to serialize/de-serialize python environment (variables, constants…) in order to share through gLExec Ramon wrote a configuration manager Dictionary-like way to store configuration values In-depth serializable (json and pickle depending on available libraries) Thread/multi-process safe Solve permission problems that will arise from running parts of the pilot under two different users and sandboxes (e.g. merging log files) Testing will not be trivial Still a lot of work to be done 9/30/2016 Fernando H. Barreiro Megino9

EGI-InSPIRE RI Drawbacks of Stage 1 model Users have to manage their proxy on MyProxy What if it expires? ATLAS uses a large number of pilot certificates Users would have to delegate to all pilot certificates Worker nodes are hammering MyProxy server John Hover tested successfully MyProxy at ~25Hz (March 2012), this means over 2M accesses should be possible per day. Test conditions currently unknown MyProxy server as single point of failure These shortcomings could be solved by the model proposed for Stage 2 Proposal. Not approved yet 9/30/2016 Fernando H. Barreiro Megino10

EGI-InSPIRE RI Stage 2: PanDA server caching and client integration 9/30/2016 Fernando H. Barreiro Megino11 MyProxy 4. User proxy and job 3. User proxy 2. User proxy 2. Job PanDA pilot glExec Computing Element … Job PanDA pilot glExec Job PanDA client 1. User proxy and job PanDA Server Proxy cache 5. Notification: Proxy about to expire Disclaimer: This model is only a proposal and has not been discussed or presented for approval

EGI-InSPIRE RI Questions? 9/30/2016 Fernando H. Barreiro Megino12