Security and VO management enhancements in Panda Workload Management System Jose Caballero Maxim Potekhin Torre Wenaus Presented by Maxim Potekhin at HPDC08.

Slides:



Advertisements
Similar presentations
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MyProxy and EGEE Ludek Matyska and Daniel.
Advertisements

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Haga clic para cambiar el estilo de título Haga clic para modificar el estilo de subtítulo del patrón DIRAC Framework A.Casajus and R.Graciani (Universitat.
CoreGRID Workpackage 5 Virtual Institute on Grid Information and Monitoring Services Authorizing Grid Resource Access and Consumption Erik Elmroth, Michał.
A Model for Grid User Management Rich Baker Dantong Yu Tomasz Wlodek Brookhaven National Lab.
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
Riccardo Bruno INFN.CT Sevilla, Sep 2007 The GENIUS Grid portal.
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
Publication and Protection of Site Sensitive Information in Grids Shreyas Cholia NERSC Division, Lawrence Berkeley Lab Open Source Grid.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
Grid User Management System Gabriele Carcassi HEPIX October 2004.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks David Kelsey RAL/STFC,
OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Mine Altunay July 30, 2007 Security and Privacy in OSG.
14.1/21 Part 5: protection and security Protection mechanisms control access to a system by limiting the types of file access permitted to users. In addition,
Pilot Jobs John Gordon Management Board 23/10/2007.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Direct gLExec integration with PanDA Fernando H. Barreiro Megino CERN IT-ES-VOS.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
INFSO-RI Enabling Grids for E-sciencE LCAS/LCMAPS and WSS Site Access Control boundary conditions David Groep NIKHEF.
Overview of Privilege Project at Fermilab (compilation of multiple talks and documents written by various authors) Tanya Levshina.
Role Based VO Authorization Services Ian Fisk Gabriele Carcassi July 20, 2005.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Conference name Company name INFSOM-RI Speaker name The ETICS Job management architecture EGEE ‘08 Istanbul, September 25 th 2008 Valerio Venturi.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.
Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.
INFSO-RI Enabling Grids for E-sciencE EGEE is a project funded by the European Union under contract INFSO-RI Grid Accounting.
VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.
Glite. Architecture Applications have access both to Higher-level Grid Services and to Foundation Grid Middleware Higher-Level Grid Services are supposed.
INFSO-RI Enabling Grids for E-sciencE glexec deployment models local credentials and grid identity mapping in the presence of complex.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL May 8 th, 2008.
WLCG Authentication & Authorisation LHCOPN/LHCONE Rome, 29 April 2014 David Kelsey STFC/RAL.
INFSO-RI Enabling Grids for E-sciencE glexec on worker nodes David Groep NIKHEF.
INFSO-RI Enabling Grids for E-sciencE Policy management and fair share in gLite Andrea Guarise HPDC 2006 Paris June 19th, 2006.
Proxy management mechanism and gLExec integration with the PanDA pilot Status and perspectives.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
VOX Project Tanya Levshina. 05/17/2004 VOX Project2 Presentation overview Introduction VOX Project VOMRS Concepts Roles Registration flow EDG VOMS Open.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
INFSO-RI Enabling Grids for E-sciencE DGAS, current status & plans Andrea Guarise EGEE JRA1 All Hands Meeting Plzen July 11th, 2006.
Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.
Running User Jobs In the Grid without End User Certificates - Assessing Traceability Anand Padmanabhan CyberGIS Center for Advanced Digital and Spatial.
INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Antonio Fuentes RedIRIS Barcelona, 15 Abril 2008 The GENIUS Grid portal.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Dynamic Accounts: Identity Management for Site Operations Kate Keahey R. Ananthakrishnan, T. Freeman, R. Madduri, F. Siebenlist.
OGF PGI – EDGI Security Use Case and Requirements
Glexec deployment models local credentials and grid identity mapping in the presence of complex schedulers David Groep NIKHEF.
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Grid Deployment Board meeting, 8 November 2006, CERN
Leigh Grundhoefer Indiana University
Presentation transcript:

Security and VO management enhancements in Panda Workload Management System Jose Caballero Maxim Potekhin Torre Wenaus Presented by Maxim Potekhin at HPDC08 – EGEE/OSG Workshop on VO management in Production Grids Boston, June 24 th 2008 Open Science Grid Brookhaven National Laboratory

Outline Panda is a comprehensive Workload Management System that performs aggregation of job requests, allocation of Grid resources to requests according to pre-defined criteria, and tracking of job execution. Central to the concept of Panda is the use of pilot jobs which probe the environment on the remote worker node, before pulling down the payload job from the server and executing it. Such design allows for improved logging and monitoring capabilities, made possible by the pilot being a "smart wrapper" for the payload job. Recently, we enhanced the overall security of the Panda system by optionally allowing the pilot job to change UID via the use of gLExec wrapper. This is based on credentials of the end user (requestor) of the payload job, deposited on a caching service (MyProxy) and retrieved by the pilot prior to payload execution. In addition, work is under way to implement a resource allocation mechanism within Panda, based on user affiliation with Virtual Organizations (VOs) and resources allocated by individual sites to VOs.

The Panda job lifecycle A brief outline of the job lifecycle within the Panda System before recent security enhancements: –a user registers their job with the Panda Service (which may be comprised of more than one actual servers). This includes registering the payload definition (the “transformation” in Panda terminology), which is a URL of the script to be executed. –separately, the “pilot” processes are created on the WNs of the target Grid sites. –the pilots probe the environment and contact the Panda Service via https, requesting assignment of a job –they receive the URL of the transformation script (job payload script) –the transformation script is downloaded by the pilot from the Panda Service and executed, and records are correspondingly updated in the Panda Service database.

Pilot Instance (on the Worker Node) get a PandaID get the transform URL get the transformation run the job (under Pilot UID) mark job as done in the DB Panda Service http(s) transformation repository The Panda job lifecycle (pre-2008) User (Client) Submits a job via transformation Autopilot (pilot submission agent)

The Panda job lifecycle: earlier security concerns In the Panda version which existed until recently, the payload was always executed on the remote site under the UID traceable to a “production account”, namely the one from which the Pilot jobs were launched, but not to the account of the user who submitted jobs for execution. While an audit of user activity is still possible as it is recorded in the Panda database, there is admittedly a lack of immediate authorization mechanism at the user-account level, which is a concern to certain sites. –solution: to employ the gLExec mechanism to optionally modify UID from a production account to a user-specific account. This approach includes GUMS or LCAS/LCMAPS as the backend repository of DN- to-UID mapping. A VOMS proxy caching service such as MyProxy will be necessary to properly handle credentials. In principle a user can make use of the pilot’s proxy to initiate actions with the pilot proxy’s authority –We are addressing this by instantiating pilots with limited proxies and by using a special “role=pilot” VOMS annotations to limit the privileges of the Pilot process as it runs on the Worker Node.

The Panda job lifecycle: introduction of gLExec gLExec functionality is enabled on specific sites only, the ones who –Have such requirement in their site policy –Are actually ready to support it, along with relevant middleware Only select US sites satisfy both requirements at the moment gLExec can operate in “log only” mode (which helps in auditing and incident investigation) and also in setuid mode, when the Pilot Job’s uid is changed to that of the user who runs the job. In the latter case, such change is done based on the X509 proxy that must be provided prior to invocation of this utility. In our approach, the proxy is obtained from an instance of the MyProxy server (see the following diagram) and is not cached in any of the elements of the Panda system itself gLExec: –is a Grid version of suexec program –runs as setuid process on the CE –performs the switch based on results from GUMS/LCAS/LCMAPS mapping –can be configured to restrict which users allowed to invoke it

Pilot Instance (on the Worker Node) get a PandaID get the transform URL get the transform get the user proxy determine the user’s DN run job under “real” UID via gLExec mark job as done in the DB gLExec MyProxy GUMS/ LCAS/ LCMAPS The Panda job lifecycle: introduction of gLExec User (Client) submits a job deposits a VOMS proxy Autopilot (pilot submission agent) Panda Service transform repository

gLExec integration: additional considerations gLExec is activated not in every pilot job, but only in those running on sites that are marked in the Panda database as “gLExec enabled”. Possible errors due to misconfiguration of gLExec, MyProxy client software etc on sites, are reported to the Panda service and logged for future debugging Because of change of identity while the job (Pilot) has already started on the remote batch system, the $HOME variable will be redefined. This may necessitate copying files to a new location. The environment specified by the Pilot when it starts is reset upon execution of gLExec. Variables such as $LD_LIBRARY_PATH and others need to be re- established. periodic renewal of the user proxies is done by a process forked from the Pilot and run with the Pilot identity security check:verification that the user DN and the “logname” mach. Otherwise, a malicious user could delegate a proxy using a different “logname” from his and runs jobs that he should not (logname is an identifier provided by the user to the MyProxy service when depositing the credentials there).

gLExec integration: additional considerations In the architecture being described, we are consciously avoiding having to implement an application-specific (Panda specific) cache of credentials, choosing instead to rely on established middleware elements such as MyProxy We are working on ways to minimize and mitigate impact of potential security incidents which theoretically may involve the Pilot proxy being compromised. Such measures include managing lifetimes of proxies and utilizing additional unique keys stored on the Panda service, which must be used to retrieve proxies from a MyProxy instance (note it’s not the same as caching proxies in Panda!).

gLExec integration: current status and plans BNL is running an instance of MyProxy specifically for Panda use, which ensures optimal configuration, control and security Panda jobs are been executed for testing purposes in gLExec mode on two US sites, BNL and FNAL BNL installation of gLExec is using the GUMS system as backend, as opposed to LCAS/LCMAPS which will be used by EGEE EGEE is currently working to resolve remaining issues in gLExec/LCAS/LCMAPS interface - Panda team is ready to expand testing and job submission to EGEE sites when this is done Possibility of GUMS installation on EGEE sites? Tested platform, expertise available.

Management of user access, VOs and resources in Panda Panda features several tiers of user access control: – A valid X509 certificate proxy is required to submit jobs to Panda – The Panda database table containing user info has a column to flag users who must be denied access for whatever reason (thus allowing for a quick ban during a security incident) – Before a job is dispatched for execution in the Panda service, the user’s DN is checked against a gridmap file that is periodically generated based on the VOMS service data, thus ensuring an existing affiliation of the user with an authorized VO Work is currently under way to use ReSS and similar tools to automatically populate “native” database tables in Panda with information mapping VOs to sites – More optimal distribution of workload according to resources made available, in near time Additionally, a resource allocation table is being added to the Panda database, which, in conjunction with addiotional job broker logic, will make it easier to follow usage level guidelines for each VO/site pair – Contains sub-VO (working group) level of detail

Conclusions Panda development team is proactive in finding ways to adhere to emerging site policies and improve the security of its Pilot-based execution framework We are open to all and any integration tests on EGEE sites, in that area Work has started on enhancing Panda with VO and resource management capabilities