Glexec deployment models local credentials and grid identity mapping in the presence of complex schedulers David Groep NIKHEF.

Slides:



Advertisements
Similar presentations
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MyProxy and EGEE Ludek Matyska and Daniel.
Advertisements

29 June 2006 GridSite Andrew McNabwww.gridsite.org VOMS and VOs Andrew McNab University of Manchester.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
OSG AuthZ Architecture AuthZ Components Legend VO Management Services Grid Site GUMS Site Services SAZ CE Gatekeeper Prima Is Auth? Yes / No SE SRM gPlazma.
INFSO-RI Enabling Grids for E-sciencE Glexec overview Gerben Venekamp NIKHEF.
INFSO-RI Enabling Grids for E-sciencE gLExec, SCAS and the paths forward Introduction to pilot jobs and gLExec and SCAS framework.
INFSO-RI Enabling Grids for E-sciencE gLExec and OS compatibility David Groep Nikhef.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Supporting further and higher education The Akenti Authorisation System Alan Robiette, JISC Development Group.
EGEE is a project funded by the European Union under contract IST Gap analysis draft v2 Olle Mulmo, David Groep, Joni Hahkala JRA3 Gap, 10.
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.
8-Jul-03D.P.Kelsey, LCG-GDB-Security1 LCG/GDB Security (Report from the LCG Security Group) RAL, 8 July 2003 David Kelsey CCLRC/RAL, UK
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
Mine Altunay July 30, 2007 Security and Privacy in OSG.
WP3 Authorization and R-GMA Linda Cornwall WP3 workshop 2-4 April 2003.
Pilot Jobs John Gordon Management Board 23/10/2007.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
INFSO-RI Enabling Grids for E-sciencE LCAS/LCMAPS and WSS Site Access Control boundary conditions David Groep NIKHEF.
Overview of Privilege Project at Fermilab (compilation of multiple talks and documents written by various authors) Tanya Levshina.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
INFSO-RI Enabling Grids for E-sciencE LCAS/LCMAPS and WSS Site Access Control boundary conditions David Groep et al. NIKHEF.
Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.
VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.
EMI INFSO-RI Argus Policies in Action Valery Tschopp (SWITCH) on behalf of the Argus PT.
INFSO-RI Enabling Grids for E-sciencE glexec deployment models local credentials and grid identity mapping in the presence of complex.
Ian D. Alderman Computer Sciences Department University of Wisconsin-Madison Condor Week 2008 End-to-end.
Jun 12, 20071/17 AuthZ Interoperability – Status and Plan Gabriele Garzoglio AuthZ Interoperability Status and Plans June 12, 2007 Middleware Security.
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
LCG Support for Pilot Jobs John Gordon, STFC GDB December 2 nd 2009.
1 AHM, 2–4 Sept 2003 e-Science Centre GRID Authorization Framework for CCLRC Data Portal Ananta Manandhar.
INFSO-RI Enabling Grids for E-sciencE glexec deployment models local credentials and grid identity mapping in the presence of complex.
GRID Security & DIRAC A. Casajus R. Graciani A. Tsaregorodtsev.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Update Authorization Service Christoph Witzig,
WLCG Authentication & Authorisation LHCOPN/LHCONE Rome, 29 April 2014 David Kelsey STFC/RAL.
INFSO-RI Enabling Grids for E-sciencE glexec on worker nodes David Groep NIKHEF.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Security and VO management enhancements in Panda Workload Management System Jose Caballero Maxim Potekhin Torre Wenaus Presented by Maxim Potekhin at HPDC08.
Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Management Claudio Grandi.
Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
Job Priorities and Resource sharing in CMS A. Sciabà ECGI meeting on job priorities 15 May 2006.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Simone Campana (CERN) Job Priorities: status.
CREAM Status and plans Massimo Sgaravatto – INFN Padova
June 5, 2006gLexec1 gLexec (within the OSG and Fermilab) June 5, 2006 Keith Chadwick.
Dynamic Accounts: Identity Management for Site Operations Kate Keahey R. Ananthakrishnan, T. Freeman, R. Madduri, F. Siebenlist.
How to integrate portals with EGI accounting system R.Graciani EGI TF 2012.
gLExec and OS compatibility
OGF PGI – EDGI Security Use Case and Requirements
Grid Security.
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Design rationale and status of the org.glite.overlay component
Workload Management System ( WMS )
Towards GLUE Schema 2.0 Sergio Andreozzi INFN-CNAF Bologna, Italy
John Gordon, STFC-RAL GDB 10 October 2007
Global Banning List and Authorization Service
Grid Deployment Board meeting, 8 November 2006, CERN
Short update on the latest gLite status
Security mechanisms and vulnerabilities in .NET
Glexec/SCAS Pilot: IN2P3-CC status
THE STEPS TO MANAGE THE GRID
Building Grids with Condor
Update on EDG Security (VOMS)
The New Virtual Organization Membership Service (VOMS)
Gridification Gatekeeper LCAS: Local Centre AuthZ Service LCAS
A Real-time Intrusion Detection System for UNIX
Privilege Separation in Condor
Grid Computing Software Interface
Presentation transcript:

glexec deployment models local credentials and grid identity mapping in the presence of complex schedulers David Groep NIKHEF

What is glexec? glexec a thin layer to change unix credentials based on grid identity and attribute information you can think of it as: ‘a replacement for the gatekeeper’ ‘a griddy version of Apache’s suexec(8)’ ‘a program wrapper around LCAS, LCMAPS or GUMS’ glexec deployment models, LCG Operations W/S June 19-20, 2006

What glexec does Input a certificate chain, possibly with VOMS extensions a user program name & arguments to run Action check authorization (LCAS, GUMS) user credentials, proper VOMS attributes, executable name acquire local credentials local (uid, gid) pair, possibly across a cluster enforce the local credential on the process Result user program is run with the mapped credentials glexec deployment models, LCG Operations W/S June 19-20, 2006

Why was glexec devised? gatekeeper and other schedulers are complex, and need not be run with root privileges all the time take an example from Apache httpd, where user cgi scripts can be run under their own identity, but without the web server itself having to run as root to accomplish this, a small, program is needed with setuid(2) power: ‘suexec(8)’ variety in grid job submission systems is increasing need a common way of obtaining and enforcing site policy and credential mapping without the need to modify each and every system as such, glexec in this deployment mode is an alternative to having authorization and mapping call-outs in each system glexec deployment models, LCG Operations W/S June 19-20, 2006

glexec traditional deployments There are three ‘traditional’ deployment models, where glexec has a role in two of these direct per-user job submission to a ‘gatekeeper’ running with root privileges (GT2GK, today’s model) a non-privileged dedicated CE or scheduler, accepting authenticated user jobs and submitting to the batch system on-demand CE, submitted by VO or user to a front-end system, that then receives user jobs and submits these to the batch system Submitting user’s identity & job VO identity/process or VO placeholder manager Site managed and trusted services glexec deployment models, LCG Operations W/S June 19-20, 2006

Jobs submission today (GT2 GK) Deployment model without glexec (‘mode GT2GK’) jobs are submitted with an identity (hopefully the original user’s one) to the site Gatekeeper running as root one job manager is run for each user on the head node with the user’s (uid,gid) as set by the gatekeeper glexec deployment models, LCG Operations W/S June 19-20, 2006

Glexec in a one-per-site mode Deployment model with a CE ‘service’ running in a non-privileged account or with a CE run (maybe one per VO) on a single front-end per site examples CREAM GT4 WS-GRAM glexec deployment models, LCG Operations W/S June 19-20, 2006

glexec with an on-demand CE Deployment model with on-demand CEs (‘mode on-demand CEs’) The user or the VO start their own scheduler on a front-end system All these on-demand schedulers are resource-limited by a site-managed master scheduler (via a GT2GK or Condor) the on-demand schedulers eat jobs for their VO or user and set the proper identity before the job gets submitted to the site batch system glexec deployment models, LCG Operations W/S June 19-20, 2006

glexec with an on-demand CE Deployment model with on-demand CEs (‘mode on-demand for VOs’ with legacy interface) glexec deployment models, LCG Operations W/S June 19-20, 2006

Traditional model summary In all three models, the submission of the user job to the batch system is done with the original job owner’s mapped (uid, gid) identity grid-to-local identity mapping is done only on the front-end system (CE) batch system accounting provides per-user records inspection of Unix process on worker nodes are per-user glexec deployment models, LCG Operations W/S June 19-20, 2006

Pilot jobs A pilot job is basically just a small script which downloads a real job from a repository once it starts executing, hence it is not committed to any particular task, or perhaps even a particular user, until that point. If there are no tasks waiting the pilot job exits immediately. In principle, if the time limits on the queue are long enough a single pilot job could run more than one real job, although I'm not sure if anyone is actually doing that at the moment. (thanks to Stephen Burke, on LCG-ROLLOUT) glexec deployment models, LCG Operations W/S June 19-20, 2006

From the VO side Background: some large VOs develop and prefer to use their own scheduling & job management framework late binding of jobs to job slots first establishing an overlay network subsequent scheduling and starting of jobs is faster hide details between the various grid flavours implement VO priorities full use of allocated slots, up to max wall clock time but these VOs will need their ‘own’ scheduler some of them do have it already, but then others don’t and most never will, so the use of pilots should not be the only option (or even the default) way of things glexec deployment models, LCG Operations W/S June 19-20, 2006

Situation today ‘VO-type’ pilot jobs submitted as if regular user jobs run with the identity of one or a few individuals from a VO obtain jobs from any user (within the VO) and run that payload on the WN allocated site ‘sees’ only a single identity, not the true owner of the workload no effective mechanisms today can deny this use model note that this does not apply to the regular ‘per-user’ pilot jobs glexec deployment models, LCG Operations W/S June 19-20, 2006

again, ‘per-user’ pilot jobs satisfy these rules by design Issues Issues that drove the original glexec-on-WN scenario: VO supplied pilot jobs must observe and honour the same policies the site uses for normal job execution preferably without requiring alternate mechanisms to describe the policies be continuously in synch with the site policies again, ‘per-user’ pilot jobs satisfy these rules by design glexec deployment models, LCG Operations W/S June 19-20, 2006

Pieces of the solution Three pieces that go together: glexec on the worker-node deployment mechanism for pilot job to submit themselves and their payload to site policy control give incontrovertible evidence of who is running on which node at any one time needed at selected sites for regulatory compliance ability to nail individual culprits by requiring the VO to present a valid delegation from each user VO should want this to keep user jobs from interfering with each other honouring site ban lists for individuals may help in not banning the entire VO in case of an incident glexec deployment models, LCG Operations W/S June 19-20, 2006

Pieces of the solution glexec on the worker-node deployment way to keep the pilot jobs submitters to their word system-level auditing of the pilot jobs, to see they are not doing the user job by themselves or evading the controls relies on advanced auditing features of the OS (from EAL3+) but auditing data on the WN is useful for incident investigations only internal accounting should be done by the VO the regular site accounting mechanisms are via the batch system, and will see the pilot job identity the site can easily show from those logs the usage by the pilot job (for which wall-clock-time accounting should be used) making a site do accounting based glexec jobs is non-standard, requires effort, may be intrusive, and messes up normal accounting ‘a VO capable of writing their own submission framework, ought to be able to write their own accounting system as well …’ glexec deployment models, LCG Operations W/S June 19-20, 2006

glexec on WN deployment model VO submits a pilot job to the batch system the VO ‘pilot job’ submitter is responsible for the pilot behaviour this might be a specific role in the VO, or a locally registered ‘badged’ user at each site Pilot job is subject to normal site policies for jobs Pilot job obtains the true user job, and presents the user credentials and the job (executable name) to the site (glexec) to request a decision Submitting user’s identity & job VO identity/process or VO placeholder manager Site managed and trusted services glexec deployment models, LCG Operations W/S June 19-20, 2006

VO pilot job on the node On success: the site will set the uid/gid of the new user’s job On failure: glexec will return with an error, and pilot job can terminate or obtain other job Note: proper uid change by Gatekeeper or Condor-C/BLAHP on head node should remain default glexec deployment models, LCG Operations W/S June 19-20, 2006

What is needed in this model? Agreement on the three ingredients deployment of glexec on the WN to do setuid detailed auditing on the head node and the WNs site accounting done at the VO (i.e. pilot job) level glexec needs feature enhancements compared to single-CE version see status of glexec on the next slide Inspection of the audit logs detect abuse patterns in the system-call auditing logs Grid job logging capabilities glexec will log (uid, user/system/real time usage) via syslog credential mapping framework (LCMAPS) will log mapping (also via syslog) centralisation of glexec mappings, e.g. via JobRepository glexec deployment models, LCG Operations W/S June 19-20, 2006

Status today Status of ‘glexec’ today implementation ready & tested, based off the Apache HTTP suexec code base uses the LCAS and LCMAPS for enforcement and mapping in their library-based implementation new modules have been added LCAS: RSL (executable path) constraints validation of cert chain and proxy lifetime restrictions policy should be located on local POSIX-accessible file systems policy transport should be ‘trustworthy’ glexec executable by specific users only (today via Unix permissions) Needed specifically for the –on-WN model make the credential acquisition process (LCAS/LCMAPS) work with a site-central policy engine enforcement will have to stay local changeover to standard callouts for both are needed needs more site-sysadmin configuration capabilities glexec deployment models, LCG Operations W/S June 19-20, 2006

Notes and alternatives glexec, like any site-managed ingress point, trusts the submitter not to have mixed up the user credentials and the jobs we trust the RB today do this correctly, and RBs are unknown quantities to the receiving site a longer term solution is to have the job request singed by the submitting user since the description is modified by intermediaries (brokers), the signature can only be to the original content, and the site would have to evaluate whether the job received matches the signed JDL or use an inheritance model for the job description, and treat the job like you would, e.g., a CIM entity glexec deployment models, LCG Operations W/S June 19-20, 2006

Summary Realize that today some VOs are doing ‘pilot’ jobs today there is no effective enforcement against this some sites may even just don’t care yet, whilst others have hard requirements on auditability and regulatory compliance The glexec-on-WN model gives the VOs tools to comply with site requirements at least makes it ‘better’ than it is today but you, as a site, will miss that warm and fuzzy feeling of trust a glexec-on-WN is always replaceable by the ‘null operation’ for sites that don’t care or want it but realize this is for just one of the glexec deployment models glexec deployment models, LCG Operations W/S June 19-20, 2006