FermiGrid - PRIMA, VOMS, GUMS & SAZ

FermiGrid - PRIMA, VOMS, GUMS & SAZ
Keith Chadwick Fermilab

What is FermiGrid? FermiGrid is: The Fermilab campus Grid.
A set of common services to support the campus Grid: The site globus gateway, VOMS, VOMRS, GUMS, SAZ, MyProxy, Gratia Accounting, etc. A forum for promoting stakeholder interoperability and resource sharing within Fermilab. The portal from the Open Science Grid to Fermilab Compute and Storage Services: Production: fermigrid1, fngp-osg, fcdfosg1, fcdfosg2, docabosg2, sdss-tam, FNAL_FERMIGRID_SE (public dcache), stken, etc… Integration: fgtest1, fnpcg, etc… FermiGrid Web Site & Additional Documentation: 23 Oct 2006 Keith Chadwick

FermiGrid - Infrastructure
Site Globus Gateway: Job forwarding gateway using Condor-G and CEMon. Makes use of “accept limited” globus gatekeeper option. VOMS & VOMRS: VO Membership Service & VO Management Registration Service . Allows user to select roles. GUMS: Grid User Mapping Service. maps FQAN in x509 proxy to site specific UID/GID. SAZ: Site AuthoriZation Service. Allows site to to make fine grained job authorization decisions. MyProxy: Service to security store and retrieve signed x509 proxies. 23 Oct 2006 Keith Chadwick

Site Gatekeeper Job Forwarding
Why? Single point of control. Hide site internal details. Facilitate resource sharing. Allow (some) load balancing Support specification of user job requirements (via ClassAds). Why not? Complicates problem diagnosis. Non-standard configuration. Can confuse users. 23 Oct 2006 Keith Chadwick

Site Gateway Job Forwarding with CEMon and BlueArc - Animation
VOMS Server Periodic Synchronization GUMS Server Step 1 - user issues voms-proxy-init user receives voms signed credentials Step 3 – Gateway requests GUMS Mapping based on VO & Role ? SAZ Server Step 4 – Gateway checks against Site Authorization Service Site Gateway Step 2 – user submits their grid job via globus-job-run, globus-job-submit, or condor-g Step 5 - Grid job is forwarded to target cluster clusters send ClassAds via CEMon to the site wide gateway BlueArc CMS WC1 CDF OSG1 CDF OSG2 D0 CAB2 SDSS TAM GP Farm LQCD 23 Oct 2006 Keith Chadwick

Globus gatekeeper - GUMS & SAZ interface
GUMS and SAZ are interfaced to the globus gatekeeper through the gsi_authz callout: /etc/grid-security/gsi_authz.conf ##### PRIMA globus_mapping /usr/local/vdt/prima/lib/libprima_authz_module_gcc32dbg globus_gridmap_callout ##### SAZ globus_authorization /usr/local/vdt/saz/client/lib/libSAZ-gt3.2_gcc32dbg globus_saz_access_control_callout 23 Oct 2006 Keith Chadwick

SAZ - Site AuthoriZation Service
We deployed the Fermilab Site AuthoriZation (SAZ) service on the Fermilab Site Globus Gatekeeper (fermigrid1) on Monday October 2, 2006. SAZ allows Fermilab to make Grid job authorization decisions for the Fermilab site based using the DN, VO, Role and CA information contained in the proxy certificate provided by the user. Fermilab has currently configured SAZ to operate in a default accept mode for user proxy credentials that are associated with VOs (user proxy credentials generated by voms-proxy-init). Users that continue to use grid-proxy-init may no longer be able execute on Fermilab Compute Elements. 23 Oct 2006 Keith Chadwick

SAZ Database Table Structure
DN: user_name, enabled, trusted, changedAt VO: vo_name, enabled, trusted, changedAt Role: role_name, enabled, trusted, changedAt CA: ca_name, enabled, trusted, changedAt 23 Oct 2006 Keith Chadwick

SAZ - Site AuthoriZation Pseudo-Code
Site authorization callout on globus gateway sends SAZ authorization request (example): user: /DC=org/DC=doegrids/OU=People/CN=Keith Chadwick VO: fermilab Role: /fermilab/Role=NULL/Capability=NULL CA: /DC=org/DC=DOEGrids/OU=Certificate Authorities/CN=DOEGrids CA 1 SAZ server on fermigrid4 receives SAZ authorization request, and: 1. Verifies certificate and trust chain. 2. If [ the certificate does not verify or the trust chain is invalid ]; then SAZ returns "Not-Authorized" fi 3. Issues select on "user:" against the SAZDB user table 4. if [ the select on "user:" fails ]; then a record corresponding to the "user:" is inserted into the SAZDB user table with (user.enabled = Y, user.trusted=F) 5. Issues select on "VO:" against the local SAZDB vo table 6. if [ the select on "VO:" fails ]; then a record corresponding to the "VO:" is inserted into the SAZDB vo table with (vo.enabled = Y, vo.trusted=F) 7. Issues select on ”Role:" against the local SAZDB role table 8. if [ the select on “Role:" fails ]; then a record corresponding to the "VO-Role:" is inserted into the SAZDB role table with (role.enabled = Y, role.trusted=F) 9. Issues select "CA:" against the local SAZDB ca table 10. if [ the select on "CA:" fails ]; then a record corresponding to the "CA:" is inserted into the SAZDB ca table with (ca.enabled = Y, ca.trusted=F) 11. The SAZ server then returns the logical and of (user.enabled, vo.enabled, vo-role.enabled, ca.enabled ) to the SAZ client (which was called by either the globus gatekeeper or glexec). 23 Oct 2006 Keith Chadwick

SAZ - Animation DN A D M I VO N SAZ Role Gatekeeper CA Job Job
23 Oct 2006 Keith Chadwick

SAZ - A Couple of Caveats
What about grid-proxy-init or voms-proxy-init without a VO? The “NULL” VO is specifically disabled (vo.enabled=“F”, vo.trusted=“F”). If a user has user.trusted=“Y” in their user record then >>> we allow them to execute jobs without VO “sponsorship” <<<. This granting of user.trusted=“Y” is not automatic. The number of users with this privilege will be VERY limited. What about pilot jobs / glide-in operation? To comply with the (draft) Fermilab policy on pilot jobs, VO’s that submit pilot jobs will shortly be required to use glexec to launch their user portion of the glide-in jobs. SAZ authoriization requests from glexec may require that the VO to have role.trusted=“Y” in the VO specific role record that they are using for glide-in operations. The granting of role.trusted=“Y” will not be automatic. Authorization for trusted=“Y” flags in the SAZ database tables is granted and revoked by the Fermilab Computer Security Executive based on explicit trust relationships. 23 Oct 2006 Keith Chadwick

SAZ - Open Issues Extra /CN=<random number> in DN.
Examples: /DC=org/DC=doegrids/OU=People/CN=Leigh Grundhoefer (GridCat) /CN= /DC=org/DC=doegrids/OU=People/CN=Leigh Grundhoefer (GridCat) /CN= /DC=org/DC=doegrids/OU=People/CN=Leigh Grundhoefer (GridCat) /CN= Result of user issuing grid-proxy-init. Does not occur in voms-proxy-init. Looking at code changes to handle “extra CN problem”. Condor fails to properly delegate the full voms proxy attributes. This can be worked around in condor_config by setting: DELEGATE_JOB_GSI_CREDENTIALS=FALSE A ticket on this issue has been opened with the Condor developers. Testing by Chris Green and John Weigand show that Reliable File Transfer (RFT) with WS-Gram is also failing to properly delegate the full voms attributes: RFT is using the full voms proxy for the first transaction, but uses a cached copy without the role information for the second transaction. A ticket on this issue has been opened with the Globus developers. 23 Oct 2006 Keith Chadwick

Draft Fermilab VO Trust Relationship Policy
Fermilab will only accept jobs from Virtual Organizations (VOs) which have established trust relationships in good standing. Trust relationships can be requested by VO management by contacting Fermilab Computer Security, and are granted and revoked by the Fermilab Computer Security Executive. Some VOs such as CDF, D0, MINOS, LQCD, already possess a valid trust relationship with Fermilab due to overlap of staff or the umbrella of Fermilab's own operational and management controls. Other VOs will be expected to establish the trust relationship as described below in order to continue using Fermilab resources. Criteria for Establishing Trust Relationships: Policies and practices for mutual security are continually adjusted to meet changes in risk perceptions. (NIST) Acceptable use of Fermilab resources is governed by both the VO's and Fermilab's Acceptable Use Policies. The Open Science Grid's User AUP (V2.0, February 9, 2006) is an example of an AUP acceptable to Fermilab and applies to users operating under OSG's auspices. A VO must describe and operate its technical infrastructure in a transparent manner which permits verification of its functioning. A VO must have an operational organization with an appropriate number of staff members who respond to Fermilab requests ( and/or phone calls) within a reasonable time, generally during the normal business hours of its home site. A VO must have an established and published response plan to deal with security incidents and reports of unauthorized use, and the staff to implement the plan. Non-compliance with site policies by a VO or its members may trigger early or frequent re-examination of the trust relationship with the VO. 23 Oct 2006 Keith Chadwick

Draft Pilot Job Policy A Pilot Job (also called a glide-in or late-binding job) is a batch job which starts on a grid worker node but loads some other job, termed the User Job, which has been created by another user. Rules: Pilot Jobs will only be acceptable from VOs whose trust relationships with Fermilab include authorization to use them. A Pilot Job must use the site provided glexec facility to map the application and data files to the actual owner of the User Job. glexec will perform the necessary callout to the Grid User Management System (GUMS) and Site Authorization Service (SAZ), and the Pilot Job must respect the result of these Policy Decision Points. A Pilot Job and the User Job will not attempt to circumvent job accounting or limits on placed system resources by the batch system. A Pilot Job may launch multiple User Jobs in serial fashion, but must not attempt to maintain data files between jobs belonging to different users. When transferring a User Job into the worker node, the Pilot Job will use a level of security equivalent to that of the original job submission process. Consequences: Fermilab reserves the right to terminate any batch jobs that appear to be operating beyond their authorization, including Pilot Jobs and User Jobs not in compliance with this policy. The DN of the Job Manager or the entire VO may be placed on the Site Black List until the situation is rectified. Fermilab expects any VO authorized to run Pilot Jobs to assure compliance by its users. 23 Oct 2006 Keith Chadwick

glexec Joint development by David Groep / Gerben Venekamp / Oscar Koeroo (NIKHEF) and Dan Yocum / Igor Sfiligoi (Fermilab). Integrated (via “plugins”) with LCAS / LCMAPS infrastructure (for LCG) and GUMS / SAZ infrastructure (for OSG). glexec is currently deployed on a couple of small clusters at Fermilab, moving towards a “significant” deployment at Fermilab this week. Will be included in Condor 6.9.x. 23 Oct 2006 Keith Chadwick

glexec block diagram 23 Oct 2006 Keith Chadwick

High Availability / Service Redundancy Plans
Gatekeeper: Redundant Condor_Master and Condor_Negotiator. VOMS: Sticky problem. Have requested a change to VOMRS that will make things much easier. GUMS: Have a test active/standby GUMS service operating with Linux-HA. Believe that we know how to implement an active/active service. SAZ: Can implement either active/standby or active/active. MyProxy: Need for MyProxy will be eliminated by new CEMon based job forwarding mechanism. 23 Oct 2006 Keith Chadwick

Metrics In addition to the normal operation effort of installing, running and upgrading the various FermiGrid services over the past year, we have spent significant effort to collect and publish operational metrics. Examples: Globus gatekeeper calls by jobmanager per day Globus gatekeeper IP connections per day VOMS calls per day VOMS server IP connections per day GUMS calls per day GUMS server IP connections per day GUMS server unique Certificates and Mappings per day SAZ Authorizations and Rejections per day SAZ server IP connections per day SAZ server unique DN, VO, Role & CA per day. Metrics collection scripts run once a day and collect information for the previous day. 23 Oct 2006 Keith Chadwick

Metrics - fermigrid1 23 Oct 2006 Keith Chadwick

Service Monitoring Service Monitor scripts run multiple times per day (typically once per hour). They gather detailed information about the service that they are monitoring. They also verify the health of the service that they are monitoring (together with any dependent services), notify administrators and automatically restart the service(s) as necessary to insure continuous operations. 23 Oct 2006 Keith Chadwick

Service Monitor - fermigrid1
23 Oct 2006 Keith Chadwick

Areas of Current Work within FermiGrid
SAZ and glexec - nearing completion. BlueArc storage and public dcache storage element - ongoing. Further Metrics and Service Monitor Development - ongoing. Gratia Accounting. Web Services. XEN. Service Failover Research, Development & Deployment of future ITBs and OSG releases 23 Oct 2006 Keith Chadwick

Parting Comments Extracting metrics and service monitor information needs to be easier - trolling through (globus gatekeeper, voms, gums, saz) log files is not an efficient method. Having a uniform standard time format (and some sort of unique process/thread id) is essential. Problem diagnosis is also very difficult (our job forwarding gateway does compound this problem). David Bianco from Jefferson Lab gave a presentation on Sguil at the Fall 2006 HEPiX conference. Having a similar common interface for the globus gatekeepers and services log files together with the ability to correlate events from multiple sources would significantly improve problem diagnosis. 23 Oct 2006 Keith Chadwick

fin Any questions? 23 Oct 2006 Keith Chadwick

FermiGrid - PRIMA, VOMS, GUMS & SAZ

Similar presentations

Presentation on theme: "FermiGrid - PRIMA, VOMS, GUMS & SAZ"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FermiGrid - PRIMA, VOMS, GUMS & SAZ

Similar presentations

Presentation on theme: "FermiGrid - PRIMA, VOMS, GUMS & SAZ"— Presentation transcript:

Similar presentations

About project

Feedback